DATA Pill feed

DATA Pill #116 - Apache Flink vs. Spark, The Levels of RAG

ARTICLES

Comparing Apache Flink and Spark for Modern Stream Data Processing | 6 min | Real-time Processing | Eric Xiao | Decodable Blog
The latest blog post examines the selection of Apache Flink as the foundation for Decodable's real-time data processing platform. It delves into its architectural design, stateful stream processing capabilities, and production readiness, highlighting how these aspects enable agile, scalable solutions for dynamic ETL, ELT, and data movement workflows.
The Levels of RAG | 6 min | RAG | Jeroen Overschie | Xebia Blog
Jeroen explores the benefits of RAG for machine learning engineers. Learn how to overcome challenges, improve data retrieval, and model accuracy using RAG techniques.

NEWS

Maestro: Data/ML Workflow Orchestrator at Netflix | 18 min | Data/ML |Jun He, Natallia Dzenisenka, Praneeth Yenugutala, Yingyi Zhang, and Anjali Norwood | Netflix Tech Blog
Netflix introduces Maestro, a horizontally scalable workflow orchestrator that is now open source. Learn how Maestro manages large-scale Data/ML workflows, handles retries, and integrates seamlessly with tools like Docker, providing a robust solution for complex data processing needs.
Polaris Catalog Is Now Open Source | 4 min | Data Engineering | Tyler Akidau, Russell Spitzer, Scott Teal | Snowflake
In June 2024, Snowflake announced the Polaris Catalog to enhance data control and interoperability for organizations and the Iceberg community. Now open source under Apache 2.0 and available on GitHub, it is also in public preview for Snowflake customers.

DATA LIBRARY

Apple Intelligence Foundation Language Models | Takes time to read | LLMs| Apple
How is Apple training LLMs for Apple Intelligence? A new technical report reveals insights into the architecture, training, distillation, and benchmarks for the 2.7B on-device (iPhone) and a larger server-based model for Private Cloud computing.

TUTORIAL

This tutorial covers a real-time data engineering project using Apache Spark Structured Streaming, Kafka, Cassandra, and Airflow. It involves retrieving random user data from an API, processing it in real-time, and storing it for analysis, all containerized with Docker for seamless deployment.
Data Quality Rules: enforcing reliability of datasets. Data Quality Assurance using AWS Glue DataBrew | 14 min | Data Quality | Mateusz Krupski | GetInData | Part of Xebia Blog
This blog post explores the importance of maintaining data quality and integrity using AWS Glue DataBrew. Discover practical strategies and tools from the upcoming eBook, "Data Quality No-Code Automation with AWS Glue DataBrew: A Proof of Concept," and learn how to implement effective data quality rules for accurate and reliable datasets.

PODCAST

DataOps, Observability, and The Cure for Data Team Blues | 1 h 5 min | Data Engineering | Christopher Bergh, Alexey Grigorev | Data Talks Club
​Outline:
  • ​What is DataOps?
  • ​Productivity, failure, and emotional crisis in data & analytics teams - are LLMs (not) the solution?
  • ​Lean Agile, Lean, and DevOps
  • ​Conclusions from a decade in DataOps and the future

DATA TUBE

Unity Catalog Lakeguard: Data Governance for Multi-User Apache™ Spark Clusters | 25 min | Data Governance | Stefania Leone, Martin Grund | Databricks
This presentation covers how Shared Clusters and Unity Catalog enable cost reduction and minimize operational toil, allowing secure and economical workload execution on shared compute resources

CONFS EVENTS AND MEETUPS

Data Summer School | Amsterdam | 12-16th August
Join one of our four specialized Data Science, Data Analysis, Analytics Engineering, or Data Literacy cohorts. Each cohort offers targeted, hands-on training sessions scheduled throughout the week for an immersive learning experience.
________________________
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
2024-07-31 14:12