DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

ARTICLES

3 Steps to AI-Ready Data | 4 min | AI | Barr Moses | Monte Carlo Blog

A guide to preparing reliable data for GenAI by migrating to the cloud, adding semantic meaning, and implementing data quality and governance strategies for scalable use cases.

Apache Kafka + Vector Database + LLM = Real-Time GenAI | 12 min | Gen AI | Kai Waehner | Personal Blog

An exploration of how event-driven architectures with Kafka and Flink enable real-time GenAI use cases, combining large language models (LLMs) with vector databases for semantic search.

Building a Unified Healthcare Data Platform: Architecture | 14 min | Data Platform | Alexandre Guitton | Doctolib blog

Doctolib's shift from a centralized monolithic platform to a data mesh architecture that supports scalable AI, analytics, and robust data governance.

The Turning Point: Agentic AI, Inference Optimization, and Society's Next Challenge | 5 min | AI | Wesley Pasfield | Personal Blog

An overview of trends in AI, highlighting agentic workflows, inference optimizations, and the societal impacts of AI-driven automation.

TUTORIALS

Delta Lake and restore - traveling in time differently | 3 min | Data Engineering | Bartosz Konieczny | Personal Blog

An overview of Delta Lake's RESTORE command, explaining how it reverts tables to past versions, records new commits, and handles production reversion scenarios.

Deep Dive into New Amazon S3 Tables | 7 min | Data Lake | Narendra Srinivasula | AWS Tip

An introduction to Amazon S3 Tables using Apache Iceberg, detailing table structure, access control, and benefits for scalable, secure analytics.

TOOL

Fast Semantic Text Deduplication | Data Engineering

SemHash is a tool for deduplicating datasets using semantic similarity, combining fast embedding generation and efficient similarity search for text and multi-column data.

ON-DEMAND WEBINAR

Understanding Retrieval-Augmented Generation (RAG) on Google Cloud Platform (GCP)| 30 min | ML | Jeroen Overschie | Xebia

Learn how RAG integrates external data with LLMs to enhance query accuracy, avoiding outdated responses and hallucinations in real-world AI applications.

DATA TUBE

Unity Catalog Lakeguard: Data Governance for Multi-User Apache™ Spark Clusters | 25 min | Data Governance | Martin Grund, Stefania Leone | Databricks

Learn how to build ChatGPT-like models with this comprehensive Stanford lecture.

CONFS, EVENTS AND MEETUPS

AWS User Group 3city Workshop: Enhancing Accessibility Using Serverless | Gdynia | 30th January

Master Git in this beginner-friendly series, starting with why Git is essential for developers.

PINNACLE PICKS

Your last week top picks:

The wall that wasn’t | 13 min | AI | Duncan Anderson | Barnacle Labs

Explore how AI models like OpenAI’s o3 and Google’s Gemini use Chain of Thought to push reasoning capabilities forward.

Compare top data quality tools to find the right solution for governance, observability, and no-code prep.

Building a SQL Bot with LangChain, Azure OpenAI, and Microsoft Fabric | 8 min | Data Science | Mariusz Kujawski | Personal Blog

Create SQL bots using LLMs and Microsoft Fabric to turn natural language into actionable queries.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub