DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship

ARTICLES

Building an Efficient ETL/ELT Process for Data Delivery | 15 min | Data Science | Mateusz Kujawski | Personal Blog

This post outlines strategies for constructing a resilient ingestion and ETL/ELT process to facilitate seamless data delivery for our data platform.

Kafka Connect: A Love/Hate Relationship | 12 min | Data Streaming | Abraham Leal | Personal Blog

Apache Kafka is the leading streaming platform businesses use, with Kafka Connect facilitating data transfer to and from Kafka using pre-made, configuration-driven Connectors. This article discusses the advantages and challenges of using Kafka Connect, including practical solutions for common issues and examples with Debezium's Postgres Connector and Confluent's JDBC Sink Connector.

What can LLMs never do? | 9 min | LLM | Rohit Krishnan | Personal Blog

Despite LLMs excelling at complex tasks, they struggle with simple ones, and the reasons behind these failures still need to be clarified. This article explores LLMs' limitations, showing that their failures reveal more about their capabilities than their successes. It analyzes why models like GPT-4 and Opus fail at tasks like Wordle and cellular automata, highlighting challenges in reasoning and generalization.

ML & Gen AI for data teams | 12 min | ML | Mikkel Dengsøe | Personal Blog

In this one, Mikkel explores how AI transforms roles within data teams, using real-world examples like Klarna's AI-driven customer service success to illustrate the impact.

The post touches:

State of AI in data teams — how data practitioners use and expect to use ML and AI
AI and ML use cases — the most popular ways data teams can apply different types of models
Data quality in AI and ML systems — common data quality issues and how to spot them
The five steps to reliable data — how to build AI and ML systems that are fit for purpose for business-critical use cases

The next part of takeaways from selected topics presented during the Big Data Tech Warsaw Summit ‘24. In the first part, which you can read here, we shared insights from the Spotify, Dropbox, Ververica, Hellofresh and Agile Lab case studies. This time we will focus on the Allegro and Play case studies, but also on: migration from Spark on Hadoop to Databricks, the Agile Data Ecosystem and how to build your personal RAG-backed Data Copilot.

It’s Time to Retire Terraform | 10 min | LLM | Eric Larssen | Real Kinetic Blog

Terraform's challenges, from bespoke configurations to drift management, highlight better solutions like the Kubernetes operator pattern. This pattern offers improved automation and collaboration through tools like GCP Config Connector and AWS Controllers for Kubernetes. This shift is essential as organizations aim to streamline infrastructure management and align with evolving technological demands.

SKILL LAKE

LLM Zoomcamp | 10 weeks | LLM | DataTalks.Club

It’s a free online course about real-life applications of LLMs. In 10 weeks you will learn how to build an AI bot that can answer questions about your knowledge base.

Topics:

・Introduction to LLMs and RAG
・Open-Source LLMs and self-hosting LLMs
・Vector databases
・LLM orchestration
・Monitoring and Guardrails
・Tips and Tricks

TUTORIALS

Creating a Kubernetes cluster from scratch in 1 hour using automation | 18 min | DevOps | Martin Hodges | Personal Blog

This tutorial discusses the 8 steps to create a Kubernetes cluster from scratch. It carries out the following:

Creating the infrastructure using Terraform
Bootstrapping the servers
Setting up an OpenVPN connection
Configuring ingress and egress
Creating a storage solution
Installing Kubernetes
Configuring the applications and tools on the cluster

Optimizing RAG - Supervised Embeddings & Reranking with your data with LlamaIndex | 12 min | RAG | Aruna Withanage | EffectZ.AI Blog

Learn how to develop a RAG system using LlamaIndex, which offers various embedding and reranking options. It guides readers through training an embedding system with their data and refers newcomers to an introductory blog post on LlamaIndex.

NEWS

Stack Overflow and OpenAI Partner to Strengthen the World’s Most Popular LLMs | 4 min | LLM | Stack Overflow

Stack Overflow and OpenAI today announced a new API partnership that will empower developers with the collective strengths of the world’s leading knowledge platform for highly technical content with the world’s most popular LLM models for AI development.

DATA TUBE

Cloud-Native LLM Deployments Made Easy Using LangChain | 34 min | LLM | Ezequiel Lanza, Arun Gupta | CNCF

This talk walks you through how to smoothly and efficiently transition your trained models to working applications by deploying an end-to-end LLM containerized LangChain application in a cloud-native environment . You'll learn how quickly and easily it can be achieved.

CONFS EVENTS AND MEETUPS

Infoshare | Gdańsk | 22nd-23rd May

Software developers, business leaders, and tech enthusiasts gather annually in Gdańsk for Infoshare, CEE's most prominent event, to connect and evolve. DataMass collaborates with Infoshare to introduce a new stage focused on AI/ML innovation, data engineering, and cloud scalability this year.

Use the SC24-DATAPill10 code to get the 10% discount.

CONTEST!

Win a free Developer Pass to InfoShare!

🤔 How would you name the most clickbait and the most cringe presentation title for the DataMass Stage at the InfoShare Conference?

We already have some examples made by your competitors:

Mining crypto and VBA macros: A match made in heaven ❤️
How to : 10 row=1$ with LLM CryptoCoin Mining
Big data mining with Excel using futuristic AI

So, what do you do to win an InfoShare pass?

👉 Answer the above question
and optional:
👉 Subscribe to datapill.tech weekly data & AI newsletter
👉 Follow InfoShare

🏆 Rules:

1. Submit your suggestion in the comments of this post or by sending the answer to datapill newsletter mail by 13th May 23:55
2. The Organizer of the contest is GetInData.
3. The winner will be chosen based on the most interesting proposal, as selected by the Organizer. We value your creativity and unique ideas, and we're excited to see what you come up with!
4. By submitting your proposal, you agree that the Organizer may use this idea for marketing purposes.
5. We will announce the winner in the comments on 14th May.
6. The Organizer reserves the right not to select the winner if the proposed answers are not distinctive, offensive, or discriminatory.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub

➡ Dig previous editions of DataPill