DATA Pill #118 - Realtime Streaming, Data Lakehouse, Flink & Kubernetes

ARTICLES

AWS Lambda vs. Cloudflare Workers Detailed Comparison | 7 min | Data Engineering | Kiryl Anoshka | Fively Blog

This article compares AWS Lambda and Cloudflare Workers, focusing on their theoretical capabilities and practical differences across key categories such as performance, runtime, and pricing. It also includes insights on which platform excels and a cold start comparison to highlight their distinctions, particularly for smaller tasks.

Apache Flink® on Kubernetes | 15 min | Streaming | Ran Zhang | Airbnb Tech Blog

Evolution of Flink architecture at Airbnb and comparison with their prior Hadoop Yarn platform with the current Kubernetes-based architecture.

Machine Learning in Content Moderation at Etsy | 15 min | ML | David Azcona | Etsy Blog

Evolution of Flink architecture at Airbnb and comparison with their prior Hadoop Yarn platform with the current Kubernetes-based architecture.

Transforming Sports Data with Databricks | 12 min | Data Infrastructure | Jared Chavez | Personal Blog

Basketball Analytics looked to the cloud for its next evolution, and the organization turned toward centralization to dramatically reduce operational costs and improve synergy across our brands. This article is about redesign the infrastructures of our respective departments and redefine how data operated within the organization.

TUTORIALS

How we built RudderStack’s real-time personalization engine | 9 min | Real-time personalization | Mackenzie Hastings, Matt Kelliher-Gibson, Chandler Van De Water, Eric Dodds | Rudderstack Blog

Creating real-time personalized website and app experiences. From identity resolution to tracking success, this tutorial will walk you through how to build a dynamic, user-focused experience that drives engagement and conversions.

Making WAF ML models go brrr: saving decades of processing time | 23 min | ML | Alex Bocharov | The Cloudflare Blog

This one covers the performance optimizations for our WAF ML product, showcasing code examples, benchmarks, and the impressive latency reductions achieved.

Setting up Flink with Hive Metastore Service (HMS) as an alternative to platforms like Ververica. Discover how to avoid duplicating table definitions and efficiently manage sources and sinks across various projects.

Crazy Challenge: Run Llama 405B on a 8GB VRAM GPU | 4 min | LLM | Gavin Li | AI Advances Blog

The challenge of running the massive 820GB Llama 3.1 405B model on a GPU with just 8GB of VRAM is addressed.

DATA LIBRARY

Accelerate ETL, data warehousing, BI and AI | ebook | databricks

Building applications with traditional AI and generative AI
Databricks Data Intelligence Platform

DATA TUBE

Realtime Streaming with Data Lakehouse - End to End Data Engineering Project | 1h | Streaming | CodeWithYu

How to design, implement and maintain secure, scalable and cost effective lakehouse architectures leveraging Apache Spark, Apache Kafka, Apache Flink, Delta Lake, AWS, and open-source tools.

CONFS, EVENTS AND MEETUPS

Airflow Summit 2024 | San Francisco | 10-12 September

This conference does not need to be introduced. In agenda:

Mastering LLM Batch Pipelines: Handling Rate Limits, Asynchronous APIs, and Cloud Scalability
OpenLineage: From Operators to Hooks by Maciej Obuchowski - our community member 👏
How we use Airflow at Booking to orchestrate Big Data workflows