ARTICLES
Databricks Orchestration: Databricks Workflows, Azure Data Factory, and Airflow | 12 min | Data Orchestration | Mariusz Kujawski | Personal Blog
Comparison of three orchestration tools and how they handle ingestion, ETL, and transformation across cloud platforms using code, low-code, or native Spark workflows.
AI SQL functions + Iceberg | 6 min | AI | Julien Hurault | Personal Blog
How AI SQL functions in Snowflake plus Apache Iceberg make large-scale, SQL-native text processing and lakehouse operations seamless and scalable.
99% of AI Startups Will Be Dead by 2026 — Here’s Why | 14 min | AI | Srinivas Rao | Personal Blog
Most AI startups are fragile wrappers over rented intelligence. This post warns of their looming collapse and highlights what it takes to survive the next wave.
Real-Time Fraud Detection Using Complex Event Processing | 10 min | Streaming Data | Giannis Polyzos | Ververica Blog
Real-time fraud detection with CEP and streaming data — from impossible travel to suspicious transaction spikes, it’s all caught in-flight.
What the hell is MCP? | 8 min | AI | Saurabh Shah | Personal Blog
MCP is a new protocol standardizing how AI agents interact with APIs. Like HTTP for agents, it solves tool-use chaos and promotes a shared interface.
TUTORIALS
DuckLake: SQL as a Lakehouse Format | 20 min | Data Engineering | Mark Raasveldt, Hannes Mühleisen | DuckDB Blog
DuckLake simplifies data lakes by using SQL databases for metadata and Parquet for storage. It’s fast, transactional, and DuckDB-native.
Building My First MCP Server - Schema Registry | 6 min | AI | Roman Melnyk | Personal Blog
Roman uses MCP and Claude Desktop to build a schema registry tool that replaces complex UIs with conversational commands.
NEWS
Introducing Apache Spark 4.0 | 7 min | Data Engineering | Wenchen Fan, Serge Rielau, Herman van Hövell, Hyukjin Kwon, Allison Wang, Anish Shrigondekar, Daniel Tenedorio, Martin Grund, DB Tsai, Xiao Li and Reynold Xin | Databricks Blog
Spark 4.0 brings SQL scripting, native plotting in PySpark, multi-language support in Spark Connect, and major streaming and API improvements.
CONFS, EVENTS AND MEETUPS
dbt F✦SION engine (BETA) | SQL Engines
A Rust-based rewrite of dbt Core with better speed, strict YAML validation, SQL dialect support, and modern dev tooling. Built for the future of data transformation.
DATA TUBE
Unpacking the tech | AI | 2 h | Jay Parikh, Charles Lamanna, Scott Guthrie | Microsoft Build
Jay Parikh, Scott Guthrie & others break down Day 1 announcements—Copilot, Azure, GitHub, and what developers need to navigate the AI era.
PINNACLE PICKS
Your last week top picks:
Agentic GraphRAG for Commercial Contracts | 21 min | RAG | Tomaz Bratanic | Personal Blog
Build a system that reads and queries legal contracts using a graph-based RAG setup. Great example of structured retrieval for messy data.
How I Cut Docker Image Size by Switching to a Distroless Base Image | 9 min | DevOps | Dorian Grasset | Teads Engineering Blog
Going distroless, multi-stage builds, and running as non-root helped trim a Node.js Docker image from 380MB to 60MB. Cleaner, faster, and more secure.
MCP: future automation killer or a promise to be kept? | 4 min | AI | Giovanni Lanzani | Xebia Blog
The Model Communication Protocol could be the standard for how AI agents talk to tools and APIs. Still early, but full of potential.
________________________
Have any interesting content to share in the DATA Pill newsletter?