DATA Pill #156 - Ray at Uber, Rust vs. Kafka, and Copilot’s Agent Mode

ARTICLES

Uber’s Journey to Ray on Kubernetes: Resource Management | 6 min | MLOps | Bharat Joshi, Anant Vyas, Ben Wang, Axansh Sheth, Abhinav Dixit | Uber Engineering Blog

Uber enhanced Kubernetes with custom schedulers and GPU-aware logic to scale multi-tenant Ray workloads for ML efficiently and reliably.

Copilot ask, edit, and agent modes: What they do and when to use them | 6 min | AI | Ashley Willis | Github Engineering

A breakdown of GitHub Copilot’s three modes, showing how to use each for tasks ranging from quick answers to autonomous code changes.

Building a Scalable Analytics Platform: Why Microsoft Clarity Chose ClickHouse | 6 min | Stream Analytics | Omar Bazaraa | Microsoft Clarity Blog

Learn how ClickHouse replaced Spark and ElasticSearch to power real-time analytics at massive scale for Microsoft Clarity.

Evolution of Product Classification at Shopify: From Categories to Comprehensive Product Understanding | 5 min | LLM | Shopify Engineering Blog

Shopify details its transition from ML classifiers to a powerful Vision Language Model system for AI-driven product classification at scale.

TUTORIALS

Build more climate-friendly data applications with Rust | 5 min | Data Engineering | R. Tyler Croy | Buoyant Data Blog

A real-world example of replacing Kafka pipelines with Rust, cutting CO₂ emissions and cloud costs by 99%.

TOOLS

GlassFlow for ClickHouse Streaming ETL | Streaming ETL Tool

GlassFlow simplifies building real-time pipelines between Kafka and ClickHouse with built-in support for deduplication and temporal joins.

Kroxylicious | Data Streaming

An open-source Kafka proxy focused on encryption, multi-tenancy, and schema validation—snappy and production-ready.

DATA LIBRARY

LakeVilla: Multi-Table Transactions for Lakehouses | Data Lakehouses | Tobias Götz, Daniel Ritter, Jana Giceva

LakeVilla introduces multi-query transactional guarantees for Iceberg and Delta Lake, with minimal performance impact.

DATA TUBE

The Advent of the Open Data Lake | Data Architecture | 46 min | Julien Le Dem | Apache Iceberg

A talk on how modular open standards like Iceberg and Arrow are reshaping modern, composable data platforms.

CONFS, EVENTS AND MEETUPS

Data Lakehouse Storage Layer - openness, interoperability and performance | May 27th

Join us for a deep dive into the data lakehouse storage layer. We'll explore the evolution of file formats like Parquet and Avro, the rise of open table formats such as Delta, Hudi, and Iceberg, and how openness shapes modern data architecture. Learn performance tuning tips like partitioning and Z-ordering, plus key topics like deletion vectors, encryption, and GDPR-compliant lifecycle policies. The session includes a live demo.

Topics include:

What is Cloud Object Storage?
Overview of Big Data File Formats
Columnar vs. Row-Oriented: Parquet and Avro
Benefits of Delta, Hudi, and Iceberg
Optimizing Files for Query Performance

PINNACLE PICKS

Your last week top picks:

How I (Barely) Survived Setting Up Polaris As An Iceberg Rest Catalog | 4 min | Data Engineering | Daniel Beach | Personal Blog

A raw, honest take on setting up Polaris as an Iceberg REST catalog in production. Spoiler: it hurt, but it worked.

Starlake: Open Source Data Integration & ETL Platform | 4 min | Data Engineering | Starlake Blog

Starlake lets you define extract, load, transform, and test tasks in YAML, and auto-generates DAGs—like Terraform for your data pipelines.

PyCharm| 3 min | Data Engineering | Valerie Andrianova | jetbrains Blog

PyCharm merges Community and Pro editions into a single product with a free Pro trial and built-in Jupyter support for all.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub