DATA Pill feed

DATA Pill #104 - What can LLMs never do?, Kafka Connect: A Love/Hate Relationship


Building an Efficient ETL/ELT Process for Data Delivery | 15 min | Data Science | Mateusz Kujawski | Personal Blog
This post outlines strategies for constructing a resilient ingestion and ETL/ELT process to facilitate seamless data delivery for our data platform.
Kafka Connect: A Love/Hate Relationship | 12 min | Data Streaming | Abraham Leal | Personal Blog
Apache Kafka is the leading streaming platform businesses use, with Kafka Connect facilitating data transfer to and from Kafka using pre-made, configuration-driven Connectors. This article discusses the advantages and challenges of using Kafka Connect, including practical solutions for common issues and examples with Debezium's Postgres Connector and Confluent's JDBC Sink Connector.
What can LLMs never do? | 9 min | LLM | Rohit Krishnan | Personal Blog
Despite LLMs excelling at complex tasks, they struggle with simple ones, and the reasons behind these failures still need to be clarified. This article explores LLMs' limitations, showing that their failures reveal more about their capabilities than their successes. It analyzes why models like GPT-4 and Opus fail at tasks like Wordle and cellular automata, highlighting challenges in reasoning and generalization.
ML & Gen AI for data teams | 12 min | ML | Mikkel Dengsøe | Personal Blog
In this one, Mikkel explores how AI transforms roles within data teams, using real-world examples like Klarna's AI-driven customer service success to illustrate the impact.

The post touches:

  • State of AI in data teams — how data practitioners use and expect to use ML and AI
  • AI and ML use cases — the most popular ways data teams can apply different types of models
  • Data quality in AI and ML systems — common data quality issues and how to spot them
  • The five steps to reliable data — how to build AI and ML systems that are fit for purpose for business-critical use cases
Private RAG-backed Data Copilot, Allegro and PLAY case studies | 10 min | Data, RAG | Piotr Menclewicz, Michał Kardach, Sylwia Kołpuć | GetInData | Part of Xebia Blog
The next part of takeaways from selected topics presented during the Big Data Tech Warsaw Summit ‘24. In the first part, which you can read here, we shared insights from the Spotify, Dropbox, Ververica, Hellofresh and Agile Lab case studies. This time we will focus on the Allegro and Play case studies, but also on: migration from Spark on Hadoop to Databricks, the Agile Data Ecosystem and how to build your personal RAG-backed Data Copilot.
It’s Time to Retire Terraform | 10 min | LLM | Eric Larssen | Real Kinetic Blog
Terraform's challenges, from bespoke configurations to drift management, highlight better solutions like the Kubernetes operator pattern. This pattern offers improved automation and collaboration through tools like GCP Config Connector and AWS Controllers for Kubernetes. This shift is essential as organizations aim to streamline infrastructure management and align with evolving technological demands.


LLM Zoomcamp | 10 weeks | LLM | DataTalks.Club
It’s a free online course about real-life applications of LLMs. In 10 weeks you will learn how to build an AI bot that can answer questions about your knowledge base.


・Introduction to LLMs and RAG
・Open-Source LLMs and self-hosting LLMs
・Vector databases
・LLM orchestration
・Monitoring and Guardrails
・Tips and Tricks


Creating a Kubernetes cluster from scratch in 1 hour using automation | 18 min | DevOps | Martin Hodges | Personal Blog
This tutorial discusses the 8 steps to create a Kubernetes cluster from scratch. It carries out the following:

  • Creating the infrastructure using Terraform
  • Bootstrapping the servers
  • Setting up an OpenVPN connection
  • Configuring ingress and egress
  • Creating a storage solution
  • Installing Kubernetes
  • Configuring the applications and tools on the cluster
Learn how to develop a RAG system using LlamaIndex, which offers various embedding and reranking options. It guides readers through training an embedding system with their data and refers newcomers to an introductory blog post on LlamaIndex.


Stack Overflow and OpenAI today announced a new API partnership that will empower developers with the collective strengths of the world’s leading knowledge platform for highly technical content with the world’s most popular LLM models for AI development.


Cloud-Native LLM Deployments Made Easy Using LangChain | 34 min | LLM | Ezequiel Lanza, Arun Gupta | CNCF
This talk walks you through how to smoothly and efficiently transition your trained models to working applications by deploying an end-to-end LLM containerized LangChain application in a cloud-native environment . You'll learn how quickly and easily it can be achieved.


Infoshare | Gdańsk | 22nd-23rd May
Software developers, business leaders, and tech enthusiasts gather annually in Gdańsk for Infoshare, CEE's most prominent event, to connect and evolve. DataMass collaborates with Infoshare to introduce a new stage focused on AI/ML innovation, data engineering, and cloud scalability this year.

Use the SC24-DATAPill10 code to get the 10% discount.


Win a free Developer Pass to InfoShare!

🤔 How would you name the most clickbait and the most cringe presentation title for the DataMass Stage at the InfoShare Conference?

We already have some examples made by your competitors:

  • Mining crypto and VBA macros: A match made in heaven ❤️
  • How to : 10 row=1$ with LLM CryptoCoin Mining
  • Big data mining with Excel using futuristic AI

So, what do you do to win an InfoShare pass?

👉 Answer the above question
and optional:
👉 Subscribe to weekly data & AI newsletter
👉 Follow InfoShare

🏆 Rules:

1. Submit your suggestion in the comments of this post or by sending the answer to datapill newsletter mail by 13th May 23:55
2. The Organizer of the contest is GetInData.
3. The winner will be chosen based on the most interesting proposal, as selected by the Organizer. We value your creativity and unique ideas, and we're excited to see what you come up with!
4. By submitting your proposal, you agree that the Organizer may use this idea for marketing purposes.
5. We will announce the winner in the comments on 14th May.
6. The Organizer reserves the right not to select the winner if the proposed answers are not distinctive, offensive, or discriminatory.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on