DATA Pill #184 – Zero-State Streaming Joins, OneLake <> UC, EMR+Iceberg Speedups & LLM Prompt Tracking

ARTICLES

Run Apache Spark + Apache Iceberg Write Jobs 2× Faster with Amazon EMR | AWS Big Data Blog | Atul Payapilly, Akshaya KP, Giovanni Matteo Fumarola & Hari Kishore Chaparala | 7 min | Data Lakehouse

AWS boosts Iceberg write-path performance on EMR by parallelizing metadata commits, reducing small files, and optimizing Spark tasks. Benchmarks show up to 2× faster writes on TPC-DS. The post also covers EMR 8.0 tuning tips for large Iceberg tables.

Expanding Support for OneLake in Unity Catalog | Databricks | Michelle León, Ji-Shan Pappa & Jonathan Keller | 5 min | Interop & Governance

Databricks deepens interoperability with Microsoft OneLake: UC now supports direct governance of OneLake files, automatic lineage capture, simplified credential passthrough and cross-cloud table management. A big step toward multi-engine, multi-cloud lake governance.

Introducing the Era of Zero-State Streaming Joins | Ververica | Giannis Polyzos | 10 min | Streaming Systems

Streaming joins typically rely on state stores that grow forever. Ververica presents a new pattern — zero-state joins — using pre-indexed materialized views and bounded retention to eliminate unbounded state. This reduces cost, improves latency and simplifies operations for Flink-based stream processing.

Track, compare, and optimize your LLM prompts with Datadog LLM Observability | Datadog | Jacob Simpher, Barry Eom, Yahoo Mouman, Will Potts | 6 min | LLM Ops

Datadog explains why prompt tracking is essential for debugging, evaluating and securing LLM apps. Key topics: multi-step prompt chains, attribution for cost & latency, structured logging, hallucination detection hints, plus examples of production logging patterns.

7 Things That Always Surprise People in Our Intro to AI Course | Xebia |Lysanne van Beek | 8 min | AI Foundations

A light but insightful walkthrough of core AI/ML concepts learners consistently misjudge: model complexity vs. performance, what “intelligence” actually means in ML, how much data is really needed, and why evaluation metrics are usually misunderstood. Useful if you teach or onboard newcomers.

NEWS

Snowflake to Acquire Select Star | Snowflake | 4 min | Data Discovery

Snowflake announces its intent to acquire Select Star, the popular data discovery & governance platform. This should bring automated column-level lineage, usage-based prioritization and semantic enrichment natively into Snowflake’s governance stack.

DATA TUBE

HUGE Claude Code Update: Opus 4.5, Desktop App + More! |12 min video | Ray Amjad

What's new in Delta Lake 4.0 | 40 min video | NextGenLakehouse

Delta Lake 4.0 introduces major enhancements focused on reliability, performance, and features that tackle the growing complexity of open data lakehouses. Key changes include new table management options, richer schema evolution, enhanced multi-engine writes, smarter metadata, and streamlined data modeling.

Why Gemini 3 Will Change How You Build AI Agents with Ravin Kumar | 1 h video | Vanishing Gradients

Gemini 3 is a few days old and the massive leap in performance and model reasoning has big implications for builders: as models begin to self-heal, builders are literally tearing out the functionality they built just months ago: ripping out the defensive coding and reshipping their agent harnesses entirely.

TOOL

osmextract

Extract OpenStreetMap data into parquet/geoparquet with a clean, geospatial-friendly schema. Great for quick geospatial prototyping, analytics and ML feature pipelines.

CONFS, EVENTS, WEBINARS AND MEETUPS

Operating Model Under Pressure: How Zilveren Kruis Is Redesigning Its Data & AI Model | 50 min | December 4, 4 PM CET

The Netherlands’ largest health insurer is reimagining its data and AI operating model to handle generative AI, self‑service analytics and plug‑and‑play tools.

Data & AI Platform Updates and Roadmap | Databricks Webinar | Dec 3 2025

Latest announcements across Unity Catalog, DBRX, governance and OneLake interoperability.

From Chaos to Control: MLOps in 2025 | JFrog Webinar | Dec 12, 2025

A practical workshop on stabilizing ML pipelines, packaging models, managing dependencies and reducing drift. Includes demos on automating model lifecycle workflows with artifact repositories.

PINNACLE PICKS

Your last week top picks:

Introducing SodaBricks | 11 min | Data Quality & Tools | Marta Radziszewska

Data quality in Databricks often suffers from scattered checks and poor monitoring. SodaBricks combines Soda Core checks with a GitHub‑driven deployment: analysts define rules in YAML, version them in Git and deploy via automated workflows. The results live in a single table and a dashboard provides accessible, consistent monitoring. The article explains why data checks are critical and walks through a simple example using two configuration files and a GitHub workflow to generate and deploy Databricks notebooks.

Cursor Composer-1 vs Claude 4.5 Sonnet: T he Better Cod| 6 min | AI Coding Tools | Composio Team
A head-to-head benchmark comparing model reasoning depth, code synthesis accuracy, refactoring capabilities and multi-file editing. Composer-1 excels at structured code generation and executing tool-augmented workflows, while Sonnet leads in natural reasoning and debugging explanations. Includes traceable examples and repo-level scoring.

_____________________

Have any interesting content to share in the DATA Pill newsletter?

2025-11-27 08:00