DATA Pill feed

DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

ARTICLES

How AI and Machine Learning are Fixing Data Quality Fast | 4 min | AI | Michał Kardach, Katarzyna Kusznierczuk | GetInData | Part of Xebia Blog
Discover how AI-driven tools like Monte Carlo and Talend Data Fabric improve data quality for faster insights.

TUTORIALS

Every System is a Log: Avoiding coordination in distributed applications| 13 min | Software Engineering | Stephan Ewen, Jack Kleeman, Giselle van Dongen | Restate Blog
Reduce complexity in distributed applications by using a single log to manage failures and concurrency.
Don’t count rows in ETL, use Delta Log metrics!| 7 min | Data Engineering | Adrian Chodkowski | Seequality Blog
Leverage Delta Lake’s transaction log for automated ETL monitoring and performance optimization.
Explore RAG architecture and strategies for optimizing retrieval-augmented generation models.
Databricks Lakehouse Optimization: A deep dive into Delta Lake’s VACUUM | 4 min | Data Engineering | Frank Mbonu | Xebia Blog
Learn to cut Databricks storage costs with VACUUM and advanced cleanup techniques.

DATA LIBRARY

Agents | AI | Julia Wiesinger, Patrick Marlow, Vladimir Vuskovic | Google
See how AI agents enhance decision-making with external tool access for real-time actions.

TOOL

Cellm | LLM
An Excel extension that integrates Large Language Models (LLMs) like ChatGPT into formulas.

DATA TUBE

“The Coding Machine” at Meta | Software Engineering | 1 h 15 min | Gergely Orosz, Michael Novati | The Pragmatic Engineer
Inside Meta’s engineering culture, career growth, and hiring process with insights from top engineers.

CONFS, EVENTS AND MEETUPS

Looker Community Event #6| Rotterdam | 6th February
Join talks on Looker Explore Assistant, BI transformation, and AI-powered reporting
DuckDB Amsterdam Meetup #2 | Amsterdam | 20th February
Dive into DuckDB with Unity Catalog, WASM-powered spreadsheets, and Postgres Data Warehouses.

PINNACLE PICKS

Your last week top picks:
Airflow in a multi-teams / multi-tenant environment. Deployment strategies | 22 min | Data Engineering | Kacper Muda | GetInData | Part of Xebia Blog
Explore deployment solutions for Apache Airflow in multi-team environments. Highlights include resource isolation, shared access options, and a glimpse at Airflow 3's upcoming capabilities.
drawdata | Data Visualization
A Python library for interactive dataset creation directly in Jupyter notebooks. Perfect for machine learning tutorials and algorithm demos.
Paimon 1.0: Unified Lake Format for Data + AI | 4 min | AI | Martin Grund, Stefania Leone | Alibaba Cloud Blog
Introducing Apache Paimon, a groundbreaking data lakehouse solution integrating batch and streaming operations for real-time AI workflows.
________________________
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
Made on
Tilda