DATA Pill feed

DATA Pill #107 - dbt 1.8 is just wow, How Twitter processes 4 billion events in real-time daily

ARTICLES

How We Solve Load Balancing Challenges in Apache Kafka | 11 min | Data Enigneering | Yifan Huang | Agoda Engineering Blog
Explore how Kafka's partitioning and load-balancing strategies help efficiently manage our daily data flow while addressing common challenges like workload imbalances and different hardware capabilities.
Tabular Data, RAG, & LLMs: Improve Results Through Data Table Prompting | 10 min | LLM | Eduardo Rojas Oviedo, Ezequiel Lanza | Intel Tech Blog
This post explores how a RAG system can help analysts quickly identify market trends, investment opportunities, and economic risks. The focus is handling tabular data embedded within documents to provide accurate and efficient insights.
How Twitter processes 4 billion events in real-time daily | 5 min | Real-time analytics | Vu Trinh | Personal Blog
Twitter handles 400 billion real-time events daily, generating a petabyte of data from diverse sources. By transitioning from a lambda to a Kappa architecture, Twitter has improved latency, throughput, and accuracy in their data processing pipelines.
dbt 1.8 it is just wow | 8 min | Data Engineering | Charles Verleyen | Astrafy Blog
Delve into the release's core feature, "unit testing," and explore other notable features, like the "empty" flag. This blog includes code snippets and a public repository, allowing readers to test these new features in a sandbox project immediately.

TOOL

Marimo | Data Engineering
Marimo is a reactive Python notebook: run a cell or interact with a UI element, and Marimo automatically runs dependent cells (or marks them as stale), keeping code and outputs consistent. Marimo notebooks are stored as pure Python, executable as scripts, and deployable as apps.

TUTORIAL

DREAM: Distributed RAG Experimentation Framework | 7 min | RAG | Aishwarya Prabhat | MLOps Community
DREAM is a Distributed RAG Experimentation Framework that simplifies the complex process of determining the best combination of RAG parameters for your use case. By leveraging a Kubernetes-native architecture and various open-source technologies, DREAM enables efficient experimentation, evaluation, and tracking of RAG methods in a distributed manner.
Rust vs Python: Choosing the Right Language for Your Data Project | 8 min | Data Engineering | Amberle McKee | Data Camp Blog
Let’s compare Rust and Python. We'll look at how they stack up on various topics to help you make an informed decision on which to use for your project.

PODCAST

Data Migration Strategies for Large Scale Systems | 1 h | Data Engineering | Tobias Macey, Sriram Panyam | Data Engineering Podcast
Any software system will eventually need migration or evolution, especially when dealing with the data layer, which adds complexity. Sriram Panyam, with experience in high-traffic data migration projects, shares his insights on ensuring their success.

CONFS EVENTS AND MEETUPS

The AI Summit London | London | 12-13th June
The AI Summit London unites the most forward-thinking technologists and business professionals to explore the real-world applications of AI. Think unparalleled opportunities for learning, deep-dive discovery, and non-stop networking (not to mention the incredible line-up of heavyweight speakers).
________________________
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on
Tilda