DATA Pill feed

DATA Pill #100 - dbt vs. Dataform, RAG for Quality Engineers, Text-to-SQL at Pinterest


We want to improve the subsequent 100 DATA Pill editions. Please fill out the form and let us know what you think about us and what we should change or develop!



Dbt vs. Dataform: Which one should you choose? | 11 min | Data Engineering | Na Nguyen (Anna) | Joon Solutions Global Blog
The launch of Dataform has complicated the choice between it and Dbt, especially for clients integrated with the GCP ecosystem. To determine the better option, Anna explored Dataform's capabilities through an entire code lifecycle, aiming to identify any pitfalls and compare it fairly with Dbt. The following review is based on six main aspects:

  • Development
  • Collaboration
  • Deployment
  • Governance
  • Integration
  • Platform Cost
How we built Text-to-SQL at Pinterest | 8 min | Gen AI | Adam Obeng, J.C. Zhong, Charlie Gu | Pinterest Engineering Blog
Pinterest team took the rise in availability of LLMs as an opportunity to explore whether they could assist our data users with this task by developing a Text-to-SQL feature which transforms these analytical questions directly into code.
Real-time Fraud Detection with Yoda and ClickHouse | 8 min | Real-time analytics | Nick Shieh, Shen Zhu, Xiaobing Xia | Instacart Tech
Read about Instacart's Fraud Platform, Yoda, its use of ClickHouse for real-time data analysis, and its roles in combating fraud. Yoda's rules and ClickHouse's analytics help quickly identify and address fraudulent activities like fake accounts, payment fraud, and conspiracy.
Top 6 Mixpanel Alternatives for Product Analytics in 2024 | 8 min | Data Analytics | NetSpring Blog
Read about alternatives that evolve challenges of comprehensive data analysis, custom reporting, and integration capabilities, reflecting the changing demands in tracking and understanding customer behavior.


Building a RAG System With Gemma, Hugging Face & Elasticsearch | 9 min | Gen AI | Ashish Tiwari | elastic search labs
This blog will show you how to build a RAG system using Elasticsearch and Python to perform a semantic search and create a question-answering service that runs on your private data set. You will fetch the most relevant documents as a context window and send them to the Gemma model along with a question to be answered.
RAG for Quality Engineers| 15 min | Gen AI | Blake Norrish | Slalom Build
It is an introduction to RAG concepts and patterns from a testing and quality point of view. It starts with an introduction of why RAG is valuable, and then discusses how the many design decisions inherent in building production-quality RAG seek to improve it.
An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin | 11 min | LLM | Eduardo Rojas Oviedo, Ezequiel Lanza | Intel Blog
In this post, we’ll demonstrate the use of four common NLP techniques to clean text before it’s ingested and converted into chunks for further processing by the LLM. We’ll also illustrate how these techniques can significantly enhance the model’s response to a prompt.


Four Data Cleaning Techniques to Improve LLM Performance | 11 min | LLM | Alex Williams, Sanjeev Mohan | The New Stack
The Kubernetes community focuses on improving the DevOps experience, which has historically outpaced the data science sector. However, the rise of AI engineering has accelerated data science advancements, bridging this gap. Challenges in Kubernetes' handling of stateful data tasks have been noted, especially for data science applications. Still, Kubernetes is increasingly crucial for managing resources in AI workloads, including the costly training of LLMs and GPU utilization.
Diving into Uber's Cutting-Edge Data Infrastructure | 5 min | Data Engineering | Girish Baliga | OnehouseHQ
As an astoundingly successful, global transportation provider, Uber has a voracious appetite for up-to-the-minute data. In response to this demand, Apache Hudi sprung from Uber nearly a decade ago - and they have not stopped innovating yet.


GenAI and RAG with Google Cloud | Zurich, Switzerland | 28th May
Delve into GenAI's opportunities and challenges, including organizational strategies and the RAG approach that combines semantic search with large language models. Discover how to integrate Google Cloud services with Open Source tools, and apply GenAI in business.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on