DATA Pill feed

DATA Pill #182 – Lakehouse Integrations, Text‑to‑SQL Advances & Multimodal RAG

ARTICLES

Smarter, Faster and Snowflake‑Native: Real‑Time Text2SQL Behind Snowflake Intelligence | 7 min | Gen AI & SQL | Lukasz Borchmann, Gaurav Nuti, Aurick Qiao, Zhewei Yao, Yuxiong He
Snowflake’s AI Research team unveils Arctic‑Text2SQL‑R1.5, a model fine‑tuned for the Snowflake SQL dialect. General‑purpose LLMs often have high latency and struggle with dialect‑specific constructs; in contrast, Arctic‑Text2SQL‑R1.5 achieves high accuracy and low latency, outperforming GPT 5, Claude 4.5 and Gemini 2.5 on internal benchmarks. The team uses transfer learning across dialects and execution‑based reinforcement learning on Snowflake’s multicluster warehouses to train the model, reaching 45 % single‑turn execution accuracy and delivering near‑instant conversational analytics.
Introducing pg_lake: Integrate Your Data Lakehouse with Postgres | 6 min | Data Lakehouse | Craig Kerstiens
Snowflake open‑sources pg_lake, a set of PostgreSQL extensions that turns Postgres into an Iceberg‑ready lakehouse. pg_lake lets users create Iceberg tables directly in Postgres, query Parquet/CSV files in object storage and perform flexible data import/export between S3 and database tables. Full Postgres transaction semantics and hybrid queries across Iceberg, Delta and Postgres tables enable a unified lakehouse without external services
Introducing SodaBricks | 11 min | Data Quality & Tools | Marta Radziszewska
Data quality in Databricks often suffers from scattered checks and poor monitoring. SodaBricks combines Soda Core checks with a GitHub‑driven deployment: analysts define rules in YAML, version them in Git and deploy via automated workflows. The results live in a single table and a dashboard provides accessible, consistent monitoring. The article explains why data checks are critical and walks through a simple example using two configuration files and a GitHub workflow to generate and deploy Databricks notebooks.
Robotics with Python: AI Humanoid | 9 min | AI & Robotics | Mauro Di Pietro
Humanoid robots are forecast to number in the billions by 2050. This tutorial shows how to build a 3D simulation for a humanoid robot using Gymnasium and MuJoCo, letting you train reinforcement‑learning agents in simulation. The author explains the Humanoid‑v4 environment, its observation and action spaces, and provides Python code to reset the environment, run random actions and render the simulation.
Few chatbots return images and tables alongside text because LLM‑generated captions often lose context. Sarkar builds a multimodal RAG that uses context‑aware image summaries: for each figure, he extracts the surrounding text and author‑provided caption and combines them into a rich summary. At generation time, the system selects images based on the text response, delivering accurate multimodal answers even for complex documents with tables and charts.
Willison champions code research, answering technical questions by writing and running experiments. Asynchronous coding agents such as Claude Code, Codex Cloud, Jules and GitHub Copilot can run research tasks autonomously and return results via pull requests, freeing developers from babysitting experiments. He recommends creating dedicated GitHub repositories for agents and granting full network access to allow them to install dependencies and fetch data.
Code execution with MCP: Building more efficient agents | 5 min | AI Agents | Anthropic Engineering Team
Model Context Protocol standard connects agents to thousands of tools, but loading every tool definition and intermediate result into the model bloats token usage. Anthropic proposes a code‑execution approach: treat MCP servers as code APIs, letting the agent write code to load only the needed tools and process large data outside the model. This reduces context overhead and scales to thousands of tool integrations.

NEWS

Apache Fluss 0.8 release notes | 7 min | Streaming Lakehouse
See the article above for highlights of the new real‑time streaming lakehouse features, including Iceberg/Lance support, Delta Joins and multimodal AI analytics.
Lightdash has rolled out a new project connection feature that allows users to connect Lightdash projects directly to their data warehouses and dbt projects. The documentation explains that setting up the connection requires linking to a warehouse and a dbt project. Supported warehouses include BigQuery, Postgres, Redshift, Snowflake, Databricks, Trino and ClickHouse, and read‑only permissions are recommended for security.

DATA TUBE

Matt Maher reveals that Claude Code includes hidden subagents named Plan and Explore. The Plan agent decomposes tasks and sets goals while Explore searches external resources. Working together, they form a “team mode” that improves coding workflows and debugging efficiency.
Check why some developers are moving away from MCP servers and demonstrates three alternative solutions for connecting AI agents to external tools. Get knowledge about limitations of the MCP pattern and suggestions about code‑first and file‑based approaches that reduce token usage and simplify integration.

TOOL

Supermetal is a Rust + Apache Arrow data replication platform that syncs transactional databases to data warehouses or other databases. It ships as a single binary with a built‑in UI, management APIs and metrics, eliminating the need for Kafka, Debezium or complex orchestration. Its single‑process pipeline streams data from source to target using Arrow record batches, reducing serialization overhead and operational complexity and supporting parallel snapshots with type preservation.

CONFS, EVENTS, WEBINARS AND MEETUPS

Kafka at Scale: Smarter Architectures for Real‑Time Business Impact | 20 Nov 2025, 08:00 AM PT / 11:00 AM ET (online).
Explore how to build resilient, low‑latency data pipelines by rethinking Kafka architectures. Experts will discuss about replacing Kafka‑centric pipelines with object‑store‑native designs, reducing serialization overhead and sharing lessons from real‑world deployments.
Compute Layer Unbundled – Market Trends Shaping Data Lakehouse in 2025 | 50 min | November 18, 3 PM CET| Data Engineering | Marek Wiewiórka & Radosław Szmit | Xebia
Part 4 of Xebia’s “Towards Data Lakehouse Architecture” webinar series. The session maps the compute landscape from single‑node to serverless options and examines emerging technologies like vectorised back‑ends and multi‑engine stacks.
The Data & AI Warsaw Tech Summit (April 21–22 2026) is extending its call for presentations until Nov 17. With over 600 expected participants and dozens of speakers, this is a chance to share your data or AI story with a broad audience. The selection committee includes experts from top data‑driven companies.

PINNACLE PICKS

Your last week top picks:
Why can’t your business afford to wait for AI adoption? | 6 min | Strategy & AI | Joris Conijn, Xebia
Many enterprises spend heavily on cloud, data lakes and analytics, yet execution remains manual and slow. Conijn argues that ‘agentic AI’ – autonomous software agents that interpret goals and execute tasks across systems – is the missing link. Agentic AI differs from traditional automation and generative copilots because it independently interprets intent, executes tasks and learns from outcomes; early adopters report efficiency gains of 30–50% and higher customer satisfaction.

Multi‑Agent SQL Assistant, Part 2: Building a RAG Manager | 21 min | Data Engineering & ML | Alle Sravani, Towards Data Science

Passing an entire database schema to an LLM can blow up token usage. This article introduces a Retrieval‑Augmented Generation (RAG) manager that selects relevant tables and columns using four strategies: a baseline with no RAG, Keyword RAG using domain‑specific keywords, FAISS RAG using vector similarity and Chroma RAG using a persistent vector store. Sravani implements a BaseRAG abstract class to unify these strategies and compares their pros and cons in terms of token savings and accuracy.
Kubetorch
Kubetorch aims to bridge the gap between Kubernetes and open‑source ML frameworks with a zero‑cost abstraction for infrastructure. The post explains that current ML frameworks lack fault tolerance and scalability; Kubetorch’s open‑source core provides a portable interface across OSS frameworks and Kubernetes, with a serverless option and an interactive CLI and Python API. The commercial platform remains separate, emphasising community‑driven development.
_____________________
Have any interesting content to share in the DATA Pill newsletter?
2025-11-13 23:03