DATA Pill #164 - Ray at Pinterest, Netflix’s UDA, and Why Fine-Tuning LLMs Is Overrated

ARTICLES

Scaling Pinterest ML Infrastructure with Ray: From Training to End-to-End ML Pipelines | ML | 8 min | Andrew Yu, Jiahuan Liu, Qingxian Lai, Kritarth Anand | Pinterest Engineering Blog

Pinterest unified its ML stack using Ray to enable scalable training, hyperparameter tuning, and modular end-to-end pipelines.

Measuring Commercial Impact at Scale at Canva | Data Analytics | 6 min | Jun Ye | Canva Engineering Blog

Canva connects experimentation with business outcomes by measuring impact at scale across its product ecosystem.

The Transactional Outbox Pattern: Transforming Real-Time Data Distribution at SeatGeek | 7 min | Data Engineering | ChairNerd Blog

SeatGeek shares how it ensures reliable and fault-tolerant event publishing across microservices using the transactional outbox pattern.

High concurrency mode for Fabric notebooks in pipelines| 4 min | Data Engineering | Adrian Chodkowski | SeeQuality Blog

Microsoft Fabric notebooks now support high-concurrency mode for faster and more efficient pipeline execution.

Fine-Tuning LLMs is a Huge Waste of Time| 8 min | ML | Personal Blog

This opinionated take argues that RAG and prompt engineering are often more effective than fine-tuning large language models.

TUTORIAL

Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix | 15 min | Data Management | Alex Hutter, Alexandre Bertails, Claire Wang, Haoyuan He, Kishore Banala, Peter Royal, Shervin Afshar | Netflix Engineering Blog

Netflix introduces its Unified Data Architecture to power batch, streaming, and ML pipelines across a scalable and modular platform.

NEWS

Introducing BigQuery ObjectRef: Supercharge your multimodal data and AI processing | 7 min | Data Analytics | Jamy Su, Gaurav Soni | Google Cloud Blog

Google BigQuery adds support for OBJECT data types, enabling native querying of unstructured formats like PDFs, images, and audio.

TOOLS

marimo

A new open-source Python notebook for building reactive dashboards with reproducible, modular code and minimal boilerplate.

Introducing Firebolt Core - Self-Hosted Firebolt, For Free, Forever | 3 min | Data Warehouse | Mosha Pasumansky, Benjamin Wagner | FireBolt Blog

Firebolt releases its high-speed query engine as a free open-source option for local or hybrid data environments.

Databricks Genie Slack Integration Solution Accelerator

A solution accelerator that connects Genie with Slack through n8n to trigger workflows and automate operations from chat.

DATA TUBE

A Framework for GenAI App and Agent Development | 52 min | GenAI | Jerry Liu, Richie Cotton | Data Camp

In this podcast, the LlamaIndex CEO breaks down how to build GenAI systems that handle complex document workflows and scale in the enterprise.

PINNACLE PICKS

Your last week top picks:

How did Meta modernize their lakehouse? | 10 min | Lakehouse | Vu Trinh | Data Engineer Things Blog

How Meta’s initial approach caused them troubles and their effort to fix them at the organizational scale.

Preventing Revenue Loss With Real-Time A/B Test Monitoring | Streaming | 15 min | Lukasz Krawiec | Expedia Group Technology - Engineering Blog

How Expedia uses real-time A/B test monitoring with Apache Flink to detect anomalies early, preventing revenue loss and improving experiment reliability.

Dimensional Data Modeling with Databricks| 15 min | Lakehouse Architecture | Mariusz Krajewski | Personal Blog

A practical guide to dimensional data modeling in Databricks using Delta Lake, Unity Catalog, and Delta Live Tables to build scalable, BI-ready star schemas and fact/dimension tables.

____________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub

2025-07-03 14:00