DATA Pill #169 - Persona vectors, HA Postgres on K8s, streaming lakehouses

ARTICLES

Persona vectors: Monitoring and controlling character traits in language models | 7 min | AI Research | Anthropic

Anthropic introduces latent vectors that shape tone, expertise, and goals in LLMs without touching the prompt. A practical path toward more modular and controllable AI.

Achieving High Availability with distributed database on Kubernetes at Airbnb | Platform Engineering | 6 min | Artem Danilov | Airbnb Tech Blog

Airbnb explains how they built a multi-tenant database platform with native Kubernetes tools, custom operators, and automated failover strategies.

5 Ways Dremio Makes Apache Iceberg Lakehouses Easy | 8 min | Data Lakehouse | Alex Merced | Dremio Blog

Dremio cuts the complexity of working with Iceberg using catalog federation, instant metadata refresh, and no-copy table creation.

Five Python Tips You Won’t Find in Most Curriculums | 5 min | Lucy Sheppard | Data Engineering | Xebia Blog

Quick wins for cleaner Python. Learn about else clauses in loops, function factories, and assignment expressions that keep your code lean and expressive.

TUTORIAL

Build a Streaming Lakehouse with Flink, Kafka, Iceberg, and Polaris | 8 min | Data Engineering | Gilles Philippart | Personal Blog

A hands-on guide to setting up a streaming data lakehouse with schema evolution and end-to-end reliability using open-source tools.

NEWS

Apache Flink 2.1.0: Ushers in a New Era of Unified Real-Time Data + AI with Comprehensive Upgrades | 6 min | Streaming & AI | Apache Flink

New AI-native connectors, unified batch and stream processing, improved autoscaling, and hardened production stability make this Flink's most capable release yet.

Announcing Snowpark Connect for Apache Spark™ in Public Preview. Your Spark Client, Now Powered by Snowflake. | 6 min | Data Infra | Shruti Anand, Nimesh Bhagat | Snowflake Blog

Spark pipelines can now read Snowflake data without data movement. This new integration simplifies hybrid workflows and keeps full access control in place.

TOOLS

Introducing LangExtract: A Gemini powered information extraction library | 4 min | NLP | Akshay Goel, Atilla Kiraly | Google for Developers Blog

A lightweight Python library for information extraction with built-in schema validation and few-shot support. Built for fast, type-safe NLP pipelines.

Databricks Labs LSQL | LLM | Databricks Labs

lsql turns LLMs into SQL-native copilots with prompt optimization and semantic query generation. Great for analytics teams building natural language interfaces.

EVENTS, CONFS, AND MEETUPS

Data Expo 2025 | 10-11th September | Utrecht

The largest data event in the Netherlands returns with 100+ vendors, 150+ sessions, and a packed agenda for engineers, scientists, and data leaders. Free to attend.

PINNACLE PICKS

Your last week top picks:

Announcing Kedro 1.0 | 6 min | ML | QuantumBlack, AI by McKinsey

Kedro reaches 1.0 with improved modularity, long-term support, and new hooks for ML pipelines.

Stream Kafka Topic to the Iceberg Tables with Zero-ETL | 12 min | Data Streaming | Vu Trinh | Data Engineer Things

Learn how to stream Kafka data into Iceberg tables using Flink for real-time, zero-ETL pipelines.

Why Startups Are Betting Everything on Apache DataFusion | Databases | 5 min | Andrew Lamb | The New Stack Blog

DataFusion is winning over startups with its fast Rust-based query engine and plug-and-play architecture.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub