DATA Pill feed

DATA Pill #190 – Hybrid Mesh, Hudi Lakehouse, LLM Memory & Sovereign Cloud

ARTICLES

Apache Hudi at Uber | Prashant Wason, Balajee Nagasubramaniam, Surya Prasanna Kumar Yalla, Meenal Binwade, Xinli Shang & Jack Song |Uber Engineering | 9 min | Data Lakehouse
Uber explains how Apache Hudi underpins its lakehouse by providing ACID transactions, schema evolution and efficient upserts at massive scale. The post details ingestion patterns, compaction strategies and Hudi’s evolution to handle trillions of records, enabling low‑latency analytics without rewriting entire datasets.
The Author argues that neither a pure data mesh nor a fully centralised warehouse fits enterprise realities. She proposes a hybrid hub‑and‑spoke pattern where domain teams own their data products but connect through a shared integration layer and platform guardrails. This preserves a single source of truth while still allowing local autonomy
Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels | Ryan Pégoud | Towards Data Science |10 min | LLM Systems
Chaudhary tackles the logit bottleneck—the final layer in LLMs where the model multiplies its hidden state by a massive output matrix—showing that it dominates memory usage. By fusing the linear projection and cross‑entropy loss into a single GPU kernel, Triton reduces peak memory by ~84%, unlocking larger batch sizes and longer contexts
Alternatives to MinIO for Single‑Node Local S3 | Robin Moffatt |rmoff.net | 6 min | Data Engineering
With MinIO pivoting away from open source, Moffatt evaluates alternatives for a simple, S3‑compatible storage layer. He lists essential criteria—Docker support, S3 API compatibility, simplicity and active community—and surveys options like SeaweedFS, Cloudflare R2, Zenko and others, comparing trade‑offs

NEWS

AWS announces a physically and logically separate cloud located entirely within the EU, designed to meet sovereignty and data‑residency requirements. The European Sovereign Cloud will expand with new Local Zones in Belgium, the Netherlands and Portugal, and AWS plans to invest over €7.8 billion in Germany.
DeepSeek Introduces Engram | 7 min | AI Infrastructure
Engram explores conditional memory as a new sparsity axis for large language models. By augmenting Transformer backbones with an $N$‑gram lookup table, Engram offloads static pattern reconstruction to host memory, freeing the network to focus on reasoning. The team shows that Engram‑27B models beat MoE baselines under equal parameter and FLOPs budgets

TUTORIALS & BOOKS

A short course for software engineers covering three themes: continual learning with agents.md, token maxxing—designing tasks around token budgets—and parallel workflows using multiple agents.
This updated Airflow manual teaches readers how to build reliable, scalable pipelines with the latest Airflow features. Authors Bas Harenslak and Julian de Ruiter draw on extensive engineering experience and contributions to Airflow’s codebase.

TOOLS

CloudWork
CloudWork is a research preview of a collaborative workspace built on Claude Code. Available to Claude Max subscribers, it lets users give Claude file‑level access to read, edit and create documents on their machine, queue multiple tasks for execution and maintain safety via explicit permissions and sandboxing

DATATUBE

The AI Roadmap for 2026 | The Neural Maze |Miguel Otero Perido | 1h
A recorded walkthrough of key themes shaping AI in 2026: agentic systems, evaluation, observability, and the shift from models to end-to-end AI system design.

CONFS, EVENTS, WEBINARS & MEETUPS

Xebia webinar on how Model Context Protocol (MCP), spec‑driven development and conversational data interfaces enable AI‑assisted data‑lakehouse engineering

PINNACLE PICKS

Your last edition top picks:
Ilya Sutskever – We’re moving from the age of scaling to the age of research | 21 min | Dwarkesh Patel
Sutskever discusses why the era of endlessly scaling models is giving way to research into new architectures, interpretability and safety. He argues that bigger models alone won’t unlock the next leaps in capability and that research‑driven innovation will define AI progress.

Lyft’s Feature Store – Architecture, Optimisation and Evolution | Rohan Varshney | Lyft Engineering 10 min | ML Infrastructure
This session covers Lyft’s feature‑store architecture and its evolution over five years. The platform centralises feature engineering, provides low‑latency access and high‑throughput processing, and enables consistent features across models. Speakers explain how the store optimises feature management at scale and share lessons on performance, developer experience and long‑term evolution.
_____________________
Have any interesting content to share in the DATA Pill newsletter?
2026-01-16 09:41