DATA Pill feed

DATA Pill #098 - Deploy LLM in your Private Kubernetes Cluster, The Real Cost of Self-Hosting MLflow


Data Quality Error Detection powered by LLMs | 17 min | LLM | Simon Grah | Towards Data Science Blog
Read the first review of the introductory article on the Data Dirtiness Score, which explains the key assumptions and demonstrates how to calculate this score. It's the second in a series about cleaning data using Large Language Models (LLMs), with a focus on identifying errors in tabular data sets.
Unlocking Kafka's Potential: Tackling Tail Latency with eBPF | 7 min | Data Engineering | Maciej Mościcki, Piotr Rżysko | Allegro Tech Blog
This blog post describes Allegro’s team journey — how they used Kafka protocol sniffing and eBPF to identify and remove the performance bottleneck.
Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | 11 min | LLM | Jane Huang, Kirk Li, Daniel Yehdego | Data Science at Microsoft
This article thoroughly examines LLM system evaluation, distinguishing between model and system evaluation and scrutinizing online and offline strategies. It focuses on AI assessing AI and Responsible AI metrics. The article highlights the relevance of diverse evaluation tools and frameworks across application scenarios, urging readers to stay informed about evolving metrics and frameworks for a comprehensive understanding.
How we expose data in BigQuery | 8 min | Data Engineering | Roxanne Ricci | Black Market Blog
This transition highlights a user-centric approach, focusing on building a domain-oriented, self-service data platform through experimentation. BackMarket aims to improve user experience and operational efficiency by prioritizing seamless data organization and access policies.
The Real Cost of Self-Hosting MLflow | 5 min | ML | Aurimas Griciunas | blog

  • MLflow is a popular experiment-tracking and end-to-end ML platform
  • Since MLflow is open source, it’s free to download, and hosting an instance does not incur license fees
  • Hosting MLflow requires multiple infrastructure components and comes with maintenance responsibilities, the cost of which can be difficult to estimate

On AWS, which offers various options for hosting MLflow, a medium-sized instance comes in at about $200 per month, plus storage and data transfer costsL;


Data Learning Week | Online | 8-11th April
Would you like to test one of our courses before investing money in it? Then come to our Data Learning Week, a series of 4 free hands-on workshops. Each session is a free first-trial lesson for the full training. We will also have a special bonus from the Academy for all workshop participants.

Choose your topic, check agenda and sign up:


This tutorial dives into such a custom solution:

  • Deploy our ML model using a custom Docker image.
  • Use a blue-green deployment strategy to ensure there is no downtime when deploying our model.
  • Run smoke tests to see if our deployment is working as expected, before we replace our previous model.
  • Use the Azure ML Python SDK to configure and manage deployment to Azure ML.


How to Deploy LLM in your Private Kubernetes Cluster in 5 STEPS | 17 min | LLM | Marcin Zabłocki | GetInData | Part of Xebia
In this tutorial, Marcin Zabłocki shows how to deploy LLM in your private Kubernetes cluster in 5 simple steps on the Mistral example.
Streams Forever: Kafka Summit London 2024 Keynote | 1 h 48 min | LLM | Jay Kreps | Confluent
Jay Kreps, Co-creator of Apache Kafka and CEO of Confluent, will present his vision of unifying the operational and analytical worlds with data streams and showcase exciting new product capabilities. During this keynote, the winner and finalists of the $1M Data Streaming Startup Challenge will showcase how their use of data streaming is disrupting their categories.


ML for Finance and Storytelling through Data | 1 h 7 min | ML | Daniel Bashir, Ben Wellington
On challenges for ML in quantitative trading and investing, and telling stories through data.


Big Data Technology Warsaw Summit | Warsaw and Online | 10th and 11th April
Join the independent conference with an agenda with presentations arranged into nine categories – find your most desired topics! There are, for example:

  • Data Engineering
  • Streaming and real-time analytics
  • ML & Data Science
  • Gen AI

And more! Learn from speakers from companies like Dropbox, IKEA, Cloudera, Allegro, Ververica, and Freenow.

Shhh… Use the DataPill200 code to get the 200 PLN discount!
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on