DATA Pill #111 - Stream enrichment with Flink SQL, Ray Infrastructure

ARTICLES

Video annotator: a framework for efficiently building video classifiers using vision-language models and active learning | 6 min | Machine Learning | Amir Ziai, Aneesh Vartakavi, Kelli Griggs, Eugene Lok, Yvonne Jukes, Alex Alonso, Vi Iyengar, Anna Pulido | Netflix Tech Blog

Read about a framework that leverages active learning and large vision-language models to streamline the annotation process, empowering domain experts and enhancing model efficiency.

Unlocking Business Value and proving the value of data teams | 7 min | Data Engineering | Robert Sahlin | Personal Blog

In this blog post, the author discusses the necessity for data teams to prove their value in today's economic climate, the importance of productionizing data, and practical steps to support operational use cases effectively. The post explores how data teams can demonstrate their impact and navigate the complexities of operationalizing analytical data.

How Apache Iceberg is Built for Open Optimized Performance | 22 min | Data Engineering | Alex Merced | dremio blog

Apache Iceberg is a table format designed for data lakehouses. It offers both ACID transactions and rich metadata to enhance performance. This article explores the powerful mechanisms within Iceberg that enable query engines, such as Dremio, to optimize data queries and improve overall efficiency.

ETA (Estimated Time of Arrival) Reliability at Lyft | 8 min | ML | Rachita Naik | Lyft Engineering blog

Explore the challenge of balancing speed and accuracy in Lyft's ETA predictions, using machine learning and real-time data to enhance reliability. By understanding and addressing various factors impacting ETAs, Lyft aims to provide riders with timely and dependable arrival estimates.

TUTORIALS

Why we no longer use LangChain for building our AI agents | 11 min | LLM | Fabian Both | Octomind Blog

At Octomind, the team employs AI agents with multiple LLMs to automate end-to-end tests in Playwright. In this blog post, they share their experience transitioning away from the LangChain framework due to its rigid high-level abstractions. Switching to modular building blocks simplified their codebase and significantly boosted their productivity and satisfaction.

In this article, Marek compares different types of joins available in the Flink SQL engine for effective and efficient stream enrichment.

Ray Infrastructure at Pinterest | 14 min | ML | Chia-Wei Chen, Raymond Lee, Alex Wang, Saurabh Vishwas Joshi, Karthik Anantha Padmanabhan, Se Won Jang | Pinterest Engineering Blog

This blog series explores Pinterest's integration of Ray to address critical business challenges. The current installment details the complexities of implementing Ray in a large-scale environment, covering the journey from initial prototypes to production readiness.

PODCAST

Meryem Arik on LLM Deployment, State-of-the-art RAG Apps, and Inference Architecture Stack | 38 min | LLM | Meryem Arik, Srini Penchikala | The Stack Overflow Podcast

Meryem Arik, Co-founder/CEO at TitanML, discusses innovations in Generative AI and LLMs, covering the current state of LLMs, their deployment, state-of-the-art RAG applications, and the inference architecture stack for LLM applications.

DATA TUBE

Understanding User Behavior using Knowledge Graphs | 7 min | AI | RelationalAI

This demo showcases how RelationalAI, deeply integrated with the Snowflake Data Cloud, can uncover user behavior patterns by running powerful graph algorithms directly on your existing Snowflake data.

CONFS EVENTS AND MEETUPS

Data Summer School | Amsterdam | 12-16th August

Join one of our four specialized Data Science, Data Analysis, Analytics Engineering, or Data Literacy cohorts. Each cohort offers targeted, hands-on training sessions scheduled throughout the week for an immersive learning experience.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub

➡ Dig previous editions of DataPill