DATA Pill feed

DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular


How we orchestrate 2000+ DBT models in Apache Airflow | 13 min | Data Engineering | Alexandre Magno Lima Martins | Apache Airflow Blog
This text explains how Airflow orchestrated a DBT Core project, creating an intuitive pipeline for data analysts and product owners to develop and maintain their data models. With just SQL and basic Git knowledge, anyone in the business can turn their models into Airflow DAGs within minutes, ready for execution with built-in alerting, data quality tests, and access control. Importantly, they can understand Airflow DAGs only after interacting with the UI. Key areas covered include:

  • Mono vs. Multi DAG approach
  • Project structure and DAGs layout
  • DAG generation pipeline
  • Creation of DBTOperator
  • Conclusion and plans
An LLM Journey: From POC to Production | 12 min | LLM | Adva Nakash Peleg | CyberArk Engineering Blog
This blog explores the journey of taking an LLM project from concept to completion, highlighting key steps, tips, and considerations to ensure success.
Data skew in Flink SQL | 10 min | Data Processing | Maciej Maciejko | GetInData | Part of Xebia Blog
Real-time data processing is vital for businesses, and Apache Flink excels in this area. This blog explores strategies to tackle data skew in Flink SQL, ensuring efficient and balanced processing.
What 10 Years at Uber, Meta and Startups Taught Me About Data Analytics | 9 min | Data Analytics | Torsten Walbaum | Towards Data Science Blog
Over the last 10 years, Torsten worked in analytics at various companies, from startups to big tech firms. Each company had unique challenges and data cultures. Key learnings include the importance of data storytelling, business acumen, and pragmatism in analytics.
What We Learned from a Year of Building with LLMs (Part I)| 13 min | LLM | Eugene Yan, Bryan Bischof, Charles Frye, Hamel Husain, Jason Liu and Shreya Shankar | O'Reilly Blog
It's an exciting time for LLMs, which are now effective for real-world applications and driving significant AI investments. While creating a proof-of-concept is easy, building a successful product remains challenging. This post shares key lessons and tips for developing LLM-based products based on practical experiences.
Is star-schema a thing in 2024? A closer look at the OBTs | 8 min | Real-time analytics | Adrian Bednarz | Personal Blog
This text explores the pros and cons of denormalized models, the challenges of managing changes, and the evolving landscape of real-time streaming technologies, ultimately questioning the balance between performance and data modeling.


Databricks + Tabular | 3 min | Data Engineering | Adam Conway, Ali Ghodsi, Arsalan Tavakoli-Shiraji, Reynold Xin | Databricks blog
Databricks announces its acquisition of Tabular, Inc., bringing together the creators of Apache Iceberg™ and Delta Lake to lead in data compatibility. This blog will outline Databricks' plans to collaborate with the Iceberg and Delta Lake communities to achieve format compatibility and evolve towards a single open standard of interoperability.
Introducing Polaris Catalog: An Open Source Catalog for Apache Iceberg | 3 min | Data Engineering | Saurin Shah, Martin Lee, James Malone, Scott Teal | Snowflake blog
Snowflake announces Polaris Catalog, offering enhanced data choice, flexibility, and security with interoperability across significant platforms like AWS, Google Cloud, and Azure. Open-sourcing within 90 days, Polaris allows seamless data interoperability without moving or copying data.


Run multiple notebooks in parallel using runMultiple in Microsoft Fabric | 7 min | Data Orchestration | Adrian Chodkowski | Seequality blog
Orchestration manages multiple systems and tasks to make workflows run smoothly and efficiently. This tutorial shows how to manage and run various notebooks from a main notebook using the runMultiple method in Microsoft Fabric. You'll learn to easily create and execute notebooks with built-in dependencies, helping streamline your data processing tasks.


Data Streaming Platform Demo | 6 min | Data Streaming | Maciej Kluczny | GetInData | Part of Xebia
In this video, you will dive into platform architecture and see how real-life streaming application works based on SQL queries using Apache Flink and Jupiter Notebooks.


Demand Forecasting at Scale | 55 min | AI | Ruben van de Geer, Rogier van der Geer, Daniel van Dijk | Xebia
Watch how Albert Heijn optimized their demand forecasting services. Learn why they chose a custom solution, the necessary processes, people, and technology, and the challenges to scaling forecasts.


RADAR AI | Online | 26-27th June
ChatGPT was only the beginning. Generative AI is now revolutionizing every industry. Join us for RADAR: AI Edition, exploring how businesses and individuals can unlock their full potential with AI.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on