DATA Pill #106 - OpenAI GPT-4o to query your database, Postgres on Kubernetes

ARTICLES

My First Billion (of Rows) in DuckDB | 12 min | Data Processing | João Pedro | Towards Data Science

João explores DuckDB by revisiting the challenge of processing Brazilian electronic ballot box logs to calculate vote-time metrics, providing a benchmark for performance and user experience.

Scale Real-Time Streams to Delta Lakehouse with Apache Flink on Azure HDInsight on AKS | 7 min | Real-Time Streaming | Sairam Yeturi, Keshav Singh | Microsoft Blog

This blog explores using Delta format as a source and sink for Apache Flink stream processing. Delta, an ACID-compliant lakehouse format, supports petabyte-scale processing and acts as a single source of truth, seamlessly integrating with Microsoft Fabric.

Unveiling the Future of Streaming Data Platforms | 10 min | Data Streaming | Filip Yonov, Kaye Lincoln | Ververica Blog

Filip Yonov, Head of Streaming at Avien, is joining this year's Flink Forward Program Committee. Read a short Q&A session about his journey with streaming data platforms and his insights on upcoming industry trends.

How to use OpenAI GPT-4o to query your database? | 5 min | SQL | Howard Chi | WrenAI Blog

This post will guide you through setting up GPT-4o with WrenAI to query your PostgreSQL database, enhancing your data retrieval process with faster responses and cost efficiency.

Fine-tuning AWS ASGs with Attribute Based Instance Selection | 5 min | Data Engineering | Ajay Pratap Singh | Yelp Engineering

This post covers how attribute-based instance selection improved Yelp's autoscaling and their switch from Clusterman to Karpenter.

NEWS

Amazon DocumentDB zero-ETL integration with Amazon OpenSearch Service is now available | 5 min | Data Analytics | AWS Blog

Amazon DocumentDB's zero-ETL integration with Amazon OpenSearch Service simplifies your data architecture and boosts search capabilities. Read about the setup process, making advanced search analytics effortless.

TUTORIAL

Building a Real-Time Data Pipeline | 11 min | Data Engineering | Andy Sawyer | Personal Blog

Andy demonstrates creating a real-time data pipeline using Kafka, Polars, and Delta Lake. It’s easier than you might think, and you can find the code on their GitHub to try it yourself.

PODCAST

Postgres on Kubernetes | 1 h 12 min | Data Engineering | Álvaro Hernández | Kubernetes Podcast

Álvaro Hernández is the founder and CEO of OnGres a company that provides among other things a distribution of Postgres that runs on Kubernetes, called “StackGres”.

DATA TUBE

What's next for AI agentic workflows ft. Andrew Ng of AI Fund | 14 min | AI | Andrew Ng | Sequoia Capital

Andrew Ng, founder of DeepLearning.AI and AI Fund shows the difference between non agentic workflow (LLM based) and agentic workflow in a smooth, insightful way based on example speech. Zero-shot vs iterative workflow. RAG vs Agentic RAG. See how the other one gives a better outcome.

CONFS EVENTS AND MEETUPS

Data Learning Week | Online | 28-31th May

Would you like to test one of our courses before investing? Then come to our Data Learning Week, a series of 4 free workshops. Each session is a free first-trial lesson for the entire training. Choose your topic, check the agenda, and sign up:

GenAI taster: discover the power of ChatGPT
dbt Learn training taster: the new standard for data transformation
Find valuable data use cases with Analytics Translation
Power BI in an hour

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub

➡ Dig previous editions of DataPill