DATA Pill #176 - DuckDB vs Spark, AI-Ready Data, and Real-World Kafka Lessons

ARTICLES

MIT Says 95% of AI Pilots Fail. McKinsey Explains Why. Agentic Engineering Shows How to Fix It| 8 min | AI Strategy | Yi Zhou | Personal Blog

Most GenAI pilots stall after demos. McKinsey’s rules for survival and Zhou’s Agentic Engineering framework aim to make AI scalable and sustainable

Is Your Data “AI-Ready”? Why Good Data Isn’t Enough Anymore | 5 min | Data Infrastructure | Sanjeev Mohan | Medium

Six pillars of AI-ready data: contextual, unified, accessible, governed, accurate, and iterative. Clean data alone is not enough.

DuckDB Can Be 100x Faster Than Spark | 7 min | Data Engineering | Zach Wilson | DataExpert.io

Benchmarking shows DuckDB crushing Spark on 500M+ row datasets, highlighting when a lightweight in-memory engine outperforms a multi-node cluster.

Kafka Outages at PagerDuty: What Happened and How We’re Improving | 6 min | DevOps | PagerDuty Engineering Blog

A post-mortem of a major Kafka outage with root cause, monitoring gaps, and fixes to strengthen resilience.

TUTORIALS

Adding Document Understanding to Claude Code | 7 min | AI Agents | Jerry Liu | LlamaIndex Blog

How Claude Code gains PDF and DOCX understanding with LlamaIndex, unlocking real enterprise workflows.

DATA LIBRARY

Data & AI Monitor Report 2025–2026 | AI & Data Strategy | Xebia

Industry benchmarks on GenAI adoption, MLOps maturity, and modernization trends shaping the next two years.

TOOLS

Vortex| Data Infrastructure

An open, next-gen columnar file format optimized for analytics and AI, with modern compression and extensibility.

JupyterLite | Dev Tool

A full JupyterLab running in the browser via WebAssembly. Ideal for demos, lightweight analysis, and docs.

Kubeflow MCP Apache Spark History Server | DevOps

K8s-native integration of Spark History Server for monitoring jobs and logs within Kubeflow pipelines.

DATA TUBE

Building Reliable Support Agents Using the Effect Typescript Library | 7 min | AI | Michael Fester | AI Engineer

A demo-driven talk on building LLM support agents with Effect, covering architecture, tradeoffs, and reliability.

EVENTS, CONFS, AND MEETUPS

Clash of the Data Catalogs – Market Leaders vs. Challengers | 7th October | Online

Join Marek Wiewiórka and Radosław Szmit as they debate the strengths and trade-offs of vendor vs. open-source data catalogs in real-world data lakehouse.

PINNACLE PICKS

Your last week top picks:

Online Feature Store for AI and ML with Apache Kafka and Flink | 8 min | Streaming & AI | Kai Waehner | Personal Blog

Step-by-step on building low-latency feature pipelines for consistent ML training and serving.

Marmot

An open-source platform that merges streaming and OLAP for real-time analytics with SQL queries and time travel.

I have built around 300 agents, worked at 5 startups. Here's what I learnt about AI Agent | 8 min | AI Agents | Sai Yashwanth | Personal Blog

Practical takeaways from building agents at five startups, covering prompt design, tool use, memory, and user feedback loops.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub