DATA Pill feed

DATA Pill #070 - 3 dbt SQL engines, Machine Learning Platform at Walmart


Last Mile Data Processing with Ray | 8 min | ML | Raymond Lee, Qingxian Lai , Karthik Anantha Padmanabhan, Se Won Jang | Pinterest Engineering Blog
The Pinterest team’s assessment of bottlenecks impacting ML developer velocity and the integration of Ray, an open-source framework, into their ML Platform. This integration has substantially improved dataset iteration speed, reduced the duration from days to hours, and increased GPU utilization to over 90%.
Head-to-head comparison of 3 dbt SQL engines | 8 min | SQL | Niels Claeys | Data Minded Blog
This blog post compares three popular open-source SQL engines (Duckdb, Trino and Spark) for use with dbt in data pipelines. The benchmarking setup uses the TPC-DS benchmark with medium-sized datasets, highlighting that Duckdb performs the fastest in 75% of cases due to its single-node advantage. Trino is the fastest in the remaining 25%. It also touches on the user experience and differences in SQL dialects between these engines, when integrating them with dbt.
Combining Kedro and Streamlit to build a simple LLM-based Reading Assistant | 10 min | LLM | Piotr Chaberski | Part of Xebia Blog
This article shows how to effectively harness the power of these models by combining robust language understanding capabilities with clean implementation and a user-friendly experience using commercial LLM APIs, Kedro and Streamlit.
Scaling Kafka to Support PayPal’s Data Growth | 12 min | Data Engineering | Monish Koppa | The PayPal Technology Blog
Let's dive into how Kafka adeptly manages trillions of daily messages and provides insight into PayPal's strategies for effective Kafka operations, encompassing aspects such as configuration management, automation, monitoring, security and the pivotal role of metrics and tools in upholding high availability and performance.
Machine Learning Platform at Walmart | 19 min | ML | Thomas Vengal, Pamidi Pradeep, Bagavath Subramaniam, Hema Rajesh, Girish Ramachandran Pillai, Ravishankar, Anirban Chatterjee, Kunal Banerjee, Rahul Rawat, Anil Madan | Walmart Global Tech Blog
How Walmart's Element ML Platform addresses AI implementation challenges, focusing on faster innovation, higher scalability, cost reduction and stronger governance.
Best Practices for LLM Evaluation of RAG Applications | 16 min | LLM | Quinn Leng, Kasey Uhlenhuth, Alkis Polyzotis | Databricks Blog
Databricks recommends the following procedure when using an LLM judge:

  1. Use a 1-5 grading scale
  2. Use GPT-4 as an LLM judge with no examples to understand grading rules
  3. Switch your LLM judge to GPT-3.5 with one example per score


Building a Real-Time Service Marketplace with Confluent Cloud | 9 min | Cloud | Arpita Agarwal | Confluence Tech Blog
How a leading service marketplace overcame challenges and harnessed Confluent Cloud, a managed Apache Kafka® solution, to build a centralized streaming platform for event-driven processing. Learn about advanced techniques such as data integrity, security and real-time analytics that empower the creation of a scalable, responsive and reliable ecosystem for tradespeople and clients.
This post demonstrates how to build an end-to-end implementation to process data from MSK Serverless using an AWS Glue streaming extract, transform and load a (ETL) job with IAM authentication to connect MSK Serverless from the AWS Glue job and query the data using Amazon Athena.


Introducing Infrastructure Manager: Provision Google Cloud resources with Terraform | 3 min | Cloud | Danny Hammo, Vlad Ouzienko | Google Cloud Blog
Infrastructure Manager allows users to efficiently oversee their Google Cloud infrastructure through Infrastructure as Code (IaC) principles, all powered by Terraform's robust foundation. This approach offers the benefits of both worlds - a managed, streamlined method for deploying, configuring and managing cloud resources using declarative configurations.


This tool simplifies identifying source and target tables in SQL commands, handling all the parser intricacies using libraries like SQLfluff and SQLparse to generate a user-friendly lineage graph.
StarRocks | Data Warehouse
StarRocks is a high-performance data warehouse designed for real-time, multi-dimensional analytics. It features MPP architecture, vectorized execution and columnar storage with real-time update support. StarRocks offers seamless data ingestion, direct data lake analysis, and MySQL protocol compatibility and is known for scalability and reliability, serving various OLAP use cases, including real-time analytics and ad-hoc queries.


Productivization of Data | 2 h 43 min | AI | guest: Kristofer Ågren | AIAW Podcast
Let’s explore Telia Division X's future, organizational structure, customer-driven vs. tech-driven development, data product creation while ensuring user privacy, generative AI and LLMs, the changing landscape of transportation and coding in an AI-driven world, the call for an AI race pause and Kristofer's plans.


Google Cloud Summit Poland | On-site | 26th October | Warsaw
Join the most significant Google Cloud event of the year, organized for the first time in Poland at the Palace of Culture and Science in Warsaw. Google Cloud Summit Poland is a free event bringing everyone together in the cloud community.

Discover advancements in artificial intelligence, application modernization, collaboration tools, data cloud solutions, open infrastructure and cutting-edge security measures. This is designed to propel your digital transformation efforts and enhance your business outcomes.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on