DATA Pill #103 - Semantic chunking for RAG + free InfoShare pass contest

ARTICLES

Solving a Data Engineering task with pragmatism and asking WHY? | 7 min | Data Engineering | Volker Janz | Personal Blog

A deep technical breakdown and considerations are necessary for constructing such a service, emphasizing the importance of understanding the underlying motivations—what you want and why—before diving into the how. By exploring various technical strategies and their practical applications, the text highlights how clarity in purpose simplifies technology solutions and can drive more effective outcomes.

Do We Need the Lakehouse Architecture? | 10 min | Data Engineering | Vu Trinh | Data Engineer Things Blog

Initially viewed as a marketing term, "Lakehouse" has emerged as a critical concept in data management. It merges data lakes and warehouses to improve the handling of large-scale data analytics and storage. This architecture integrates the flexibility of data lakes with the robust features of traditional data warehouses, aiming to improve data reliability, accessibility, and analytical performance.

Tasty takeaways from Spotify, Dropbox, Ververica | Original creators of Apache Flink®, HelloFresh, Agile Lab, and insights from the 10th edition of the Big Data Tech.

Among other topics:

Data Quality
Real-time Clickstream Analytics
Replacing lambda with kappa architecture
GreenOps

SKILL LAKE

The Full Stack 7-Steps MLOps Framework | Takes time | MLOps | Paul Isztin

You will learn how to build, train, serve, and monitor an ML system using a batch architecture. We will show you how to integrate an experiment tracker, a model registry, a feature store, Docker, Airflow, GitHub Actions and more!

TUTORIALS

Technical Guide: End-to-End CI/CD DevOps with Jenkins, Terraform, Docker, Kubernetes, SonarQube, ArgoCD, AWS EC2, EKS, and GitHub Actions (Django Deployment) | 48 min | DevOps | Joel Wembo | Django Unleashed Blog

Read a guide to automating Django deployments, leveraging the power of Jenkins, Kubernetes, Terraform, and GitHub Actions. By utilizing a solid CI/CD pipeline composed of AWS EC2, EKS, Docker, SonarQube, and ArgoCD, developers can improve the consistency and speed of their deployment processes.

Semantic Chunking for RAG | 17 min | RAG | Plaban Nayak | The AI Forum

Learn about the concepts of chunking and RAG as methods to improve LLM performance. Chunking involves dividing the text into smaller, manageable segments to fit within the LLMs' context windows, addressing common issues such as hallucinations, where LLMs generate incorrect information. RAG improves accurate information retrieval by encoding these chunks into vector embeddings and storing them for efficient access during model operations.

NEWS

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU! | 4 min | LLM | Gavin Li | AI Advances

The newly released Llama3 model has stirred excitement with its ability to run on minimal hardware and compete with major models like GPT-4. This guide explores Llama3's advanced features and compares them to industry leaders. It also provides practical steps for deploying it on a single GPU, underscoring the growing significance of open-source models in AI.

Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open | 7 min | LLM | Snowflake Blog

The Snowflake AI Research Team introduces Snowflake Arctic, a top-tier enterprise-focused LLM that pushes the frontiers of cost-effective training and openness.

DATA TUBE

Real Estate End to End Data Engineering using AI | 2 h 8 min | Data Engineering | Yusuf Ganiyu | CodeWithYou

In this video, you'll learn how to build a comprehensive Real Estate data engineering pipeline, covering everything from data gathering and ingestion to processing and storage. This tutorial uses ChatGPT, WebSocket, Chrome DevTools Protocol, Docker, Apache Kafka, Spark with Master Worker Architecture, Zookeeper, Confluent Control Center, and Cassandra.

PODCAST

The Rise of Modern Data Management | 58 min | Data Management | Chad Sanderson | MLOps.community Podcast

Chad Sanderson explores how cloud technology has transformed data management in today’s AI-driven era. He discusses modern practices like data change detection, data contracts, and CI/CD tests, emphasizing the roles of data producers and consumers.

DBRX and the Future of Open LLMs | 45 min | LLM | Ben Lorica, Hagay Lupesko | The Data Exchange Podcast

Hagay Lupesko from Databricks MosaicAI introduces DBRX, an innovative open LLM that merges quality with cost-effectiveness for AI. He discusses improving AI performance using high-quality training data and a mixture-of-experts model, particularly for coding and math tasks. By leveraging the open-source community and efficient deployment, DBRX aims to make advanced AI more accessible and continuously improve AI development.

CONFS EVENTS AND MEETUPS

Infoshare | Gdańsk | 22nd-23rd May

Software developers, business leaders, startuppers, investors, marketers, and enthusiasts of technology gather in Gdańsk to learn and get inspired at this celebration of the digital world. Every year we bring together thousands of people looking for a platform to connect and evolve, which makes Infoshare conference the biggest tech and startup event in CEE. This year DataMass is joining forces with Infoshare to make the new stage - all about AI/ML innovation, data engineering efficiency, and cloud scalability.

Pssst! Use the SC24-DATAPill10 code to get the 10% discount. The price will increase on 8th May!

CONTEST!

Win a free Developer Pass to InfoShare!

🤔 How would you name the most clickbait and the most cringe presentation title for DataMass Stage at the InfoShare Conference?

✨ Will it be powered by AI?

✨ Will it be GenAI-related?

✨ Will it be starting by AI will take your job?

So, what do you do to win an InfoShare pass?

👉 Answer the above question

👉 Subscribe to datapill.tech weekly data & AI newsletter

👉 Follow InfoShare

🏆 Rules:

1. Submit your suggestion in the comments of this post or by sending the answer to datapill newsletter mail by 13th May 23:55
2. The Organizer of the contest is GetInData.
3. The winner will be chosen based on the most interesting proposal, as selected by the Organizer. We value your creativity and unique ideas, and we're excited to see what you come up with!
4. By submitting your proposal, you agree that the Organizer may use this idea for marketing purposes.
5. We will announce the winner in the comments on 14th May.
6. The Organizer reserves the right not to select the winner if the proposed answers are not distinctive, offensive, or discriminatory.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on G itHub

➡ Dig previous editions of DataPill