DATA Pill feed

DATA Pill #176 - DuckDB vs Spark, AI-Ready Data, and Real-World Kafka Lessons

ARTICLES

Most GenAI pilots stall after demos. McKinsey’s rules for survival and Zhou’s Agentic Engineering framework aim to make AI scalable and sustainable
Is Your Data “AI-Ready”? Why Good Data Isn’t Enough Anymore | 5 min | Data Infrastructure | Sanjeev Mohan | Medium
Six pillars of AI-ready data: contextual, unified, accessible, governed, accurate, and iterative. Clean data alone is not enough.
DuckDB Can Be 100x Faster Than Spark | 7 min | Data Engineering | Zach Wilson | DataExpert.io
Benchmarking shows DuckDB crushing Spark on 500M+ row datasets, highlighting when a lightweight in-memory engine outperforms a multi-node cluster.
Kafka Outages at PagerDuty: What Happened and How We’re Improving | 6 min | DevOps | PagerDuty Engineering Blog
A post-mortem of a major Kafka outage with root cause, monitoring gaps, and fixes to strengthen resilience.

TUTORIALS

Adding Document Understanding to Claude Code | 7 min | AI Agents | Jerry Liu | LlamaIndex Blog
How Claude Code gains PDF and DOCX understanding with LlamaIndex, unlocking real enterprise workflows.

DATA LIBRARY

Data & AI Monitor Report 2025–2026 | AI & Data Strategy | Xebia
Industry benchmarks on GenAI adoption, MLOps maturity, and modernization trends shaping the next two years.

TOOLS

Vortex| Data Infrastructure
An open, next-gen columnar file format optimized for analytics and AI, with modern compression and extensibility.
JupyterLite | Dev Tool
A full JupyterLab running in the browser via WebAssembly. Ideal for demos, lightweight analysis, and docs.
K8s-native integration of Spark History Server for monitoring jobs and logs within Kubeflow pipelines.

DATA TUBE

A demo-driven talk on building LLM support agents with Effect, covering architecture, tradeoffs, and reliability.

EVENTS, CONFS, AND MEETUPS

Join Marek Wiewiórka and Radosław Szmit as they debate the strengths and trade-offs of vendor vs. open-source data catalogs in real-world data lakehouse.

PINNACLE PICKS

Your last week top picks:
Online Feature Store for AI and ML with Apache Kafka and Flink | 8 min | Streaming & AI | Kai Waehner | Personal Blog
Step-by-step on building low-latency feature pipelines for consistent ML training and serving.
An open-source platform that merges streaming and OLAP for real-time analytics with SQL queries and time travel.
Practical takeaways from building agents at five startups, covering prompt design, tool use, memory, and user feedback loops.
________________________
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
Made on
Tilda