ARTICLES
3 Steps to AI-Ready Data | 4 min | AI | Barr Moses | Monte Carlo Blog
A guide to preparing reliable data for GenAI by migrating to the cloud, adding semantic meaning, and implementing data quality and governance strategies for scalable use cases.
Apache Kafka + Vector Database + LLM = Real-Time GenAI | 12 min | Gen AI | Kai Waehner | Personal Blog
An exploration of how event-driven architectures with Kafka and Flink enable real-time GenAI use cases, combining large language models (LLMs) with vector databases for semantic search.

Building a Unified Healthcare Data Platform: Architecture | 14 min | Data Platform | Alexandre Guitton | Doctolib blog
Doctolib's shift from a centralized monolithic platform to a data mesh architecture that supports scalable AI, analytics, and robust data governance.
The Turning Point: Agentic AI, Inference Optimization, and Society's Next Challenge | 5 min | AI | Wesley Pasfield | Personal Blog
An overview of trends in AI, highlighting agentic workflows, inference optimizations, and the societal impacts of AI-driven automation.
TUTORIALS
Delta Lake and restore - traveling in time differently | 3 min | Data Engineering | Bartosz Konieczny | Personal Blog
An overview of Delta Lake's RESTORE command, explaining how it reverts tables to past versions, records new commits, and handles production reversion scenarios.
Deep Dive into New Amazon S3 Tables | 7 min | Data Lake | Narendra Srinivasula | AWS Tip
An introduction to Amazon S3 Tables using Apache Iceberg, detailing table structure, access control, and benefits for scalable, secure analytics.

TOOL
Fast Semantic Text Deduplication | Data Engineering
SemHash is a tool for deduplicating datasets using semantic similarity, combining fast embedding generation and efficient similarity search for text and multi-column data.
ON-DEMAND WEBINAR
Understanding Retrieval-Augmented Generation (RAG) on Google Cloud Platform (GCP)| 30 min | ML | Jeroen Overschie | Xebia
Learn how RAG integrates external data with LLMs to enhance query accuracy, avoiding outdated responses and hallucinations in real-world AI applications.
DATA TUBE
Unity Catalog Lakeguard: Data Governance for Multi-User Apache™ Spark Clusters | 25 min | Data Governance | Martin Grund, Stefania Leone | Databricks
Learn how to build ChatGPT-like models with this comprehensive Stanford lecture.
CONFS, EVENTS AND MEETUPS
AWS User Group 3city Workshop: Enhancing Accessibility Using Serverless | Gdynia | 30th January
Master Git in this beginner-friendly series, starting with why Git is essential for developers.
PINNACLE PICKS
Your last week top picks:
The wall that wasn’t | 13 min | AI | Duncan Anderson | Barnacle Labs
Explore how AI models like OpenAI’s o3 and Google’s Gemini use Chain of Thought to push reasoning capabilities forward.
Monte Carlo vs. Collibra vs. Talend Data Fabric vs. Ataccama One vs. Dataprep by Trifacta vs. AWS Glue DataBrew: Which Data Quality Tool is Right for You?| 5 min | Data Quality | Michał Kardach, Katarzyna Kusznierczuk | GetInData | Part of Xebia Blog
Compare top data quality tools to find the right solution for governance, observability, and no-code prep.
Building a SQL Bot with LangChain, Azure OpenAI, and Microsoft Fabric | 8 min | Data Science | Mariusz Kujawski | Personal Blog
Create SQL bots using LLMs and Microsoft Fabric to turn natural language into actionable queries.

________________________
Have any interesting content to share in the DATA Pill newsletter?