DATA Pill feed

DATA Pill #103 - Semantic chunking for RAG + free InfoShare pass contest


Solving a Data Engineering task with pragmatism and asking WHY? | 7 min | Data Engineering | Volker Janz | Personal Blog
A deep technical breakdown and considerations are necessary for constructing such a service, emphasizing the importance of understanding the underlying motivations—what you want and why—before diving into the how. By exploring various technical strategies and their practical applications, the text highlights how clarity in purpose simplifies technology solutions and can drive more effective outcomes.
Do We Need the Lakehouse Architecture? | 10 min | Data Engineering | Vu Trinh | Data Engineer Things Blog
Initially viewed as a marketing term, "Lakehouse" has emerged as a critical concept in data management. It merges data lakes and warehouses to improve the handling of large-scale data analytics and storage. This architecture integrates the flexibility of data lakes with the robust features of traditional data warehouses, aiming to improve data reliability, accessibility, and analytical performance.
Takeaways from Spotify, Dropbox, Ververica, Hellofresh and Agile Lab | 14 min | Data, AI, Cloud | Maciej Maciejko, Michał Kardach, Adam Kawa, Sylwia Kołpuć | GetInData | Part of Xebia Blog
Tasty takeaways from Spotify, Dropbox, Ververica | Original creators of Apache Flink®, HelloFresh, Agile Lab, and insights from the 10th edition of the Big Data Tech.

Among other topics:

  • Data Quality
  • Real-time Clickstream Analytics
  • Replacing lambda with kappa architecture
  • GreenOps


The Full Stack 7-Steps MLOps Framework | Takes time | MLOps | Paul Isztin
You will learn how to build, train, serve, and monitor an ML system using a batch architecture. We will show you how to integrate an experiment tracker, a model registry, a feature store, Docker, Airflow, GitHub Actions and more!


Read a guide to automating Django deployments, leveraging the power of Jenkins, Kubernetes, Terraform, and GitHub Actions. By utilizing a solid CI/CD pipeline composed of AWS EC2, EKS, Docker, SonarQube, and ArgoCD, developers can improve the consistency and speed of their deployment processes.
Semantic Chunking for RAG | 17 min | RAG | Plaban Nayak | The AI Forum
Learn about the concepts of chunking and RAG as methods to improve LLM performance. Chunking involves dividing the text into smaller, manageable segments to fit within the LLMs' context windows, addressing common issues such as hallucinations, where LLMs generate incorrect information. RAG improves accurate information retrieval by encoding these chunks into vector embeddings and storing them for efficient access during model operations.


The newly released Llama3 model has stirred excitement with its ability to run on minimal hardware and compete with major models like GPT-4. This guide explores Llama3's advanced features and compares them to industry leaders. It also provides practical steps for deploying it on a single GPU, underscoring the growing significance of open-source models in AI.
The Snowflake AI Research Team introduces Snowflake Arctic, a top-tier enterprise-focused LLM that pushes the frontiers of cost-effective training and openness.


Real Estate End to End Data Engineering using AI | 2 h 8 min | Data Engineering | Yusuf Ganiyu | CodeWithYou
In this video, you'll learn how to build a comprehensive Real Estate data engineering pipeline, covering everything from data gathering and ingestion to processing and storage. This tutorial uses ChatGPT, WebSocket, Chrome DevTools Protocol, Docker, Apache Kafka, Spark with Master Worker Architecture, Zookeeper, Confluent Control Center, and Cassandra.


The Rise of Modern Data Management | 58 min | Data Management | Chad Sanderson | Podcast
Chad Sanderson explores how cloud technology has transformed data management in today’s AI-driven era. He discusses modern practices like data change detection, data contracts, and CI/CD tests, emphasizing the roles of data producers and consumers.
DBRX and the Future of Open LLMs | 45 min | LLM | Ben Lorica, Hagay Lupesko | The Data Exchange Podcast
Hagay Lupesko from Databricks MosaicAI introduces DBRX, an innovative open LLM that merges quality with cost-effectiveness for AI. He discusses improving AI performance using high-quality training data and a mixture-of-experts model, particularly for coding and math tasks. By leveraging the open-source community and efficient deployment, DBRX aims to make advanced AI more accessible and continuously improve AI development.


Infoshare | Gdańsk | 22nd-23rd May
Software developers, business leaders, startuppers, investors, marketers, and enthusiasts of technology gather in Gdańsk to learn and get inspired at this celebration of the digital world. Every year we bring together thousands of people looking for a platform to connect and evolve, which makes Infoshare conference the biggest tech and startup event in CEE. This year DataMass is joining forces with Infoshare to make the new stage - all about AI/ML innovation, data engineering efficiency, and cloud scalability.

Pssst! Use the SC24-DATAPill10 code to get the 10% discount. The price will increase on 8th May!


Win a free Developer Pass to InfoShare!

🤔 How would you name the most clickbait and the most cringe presentation title for DataMass Stage at the InfoShare Conference?

✨ Will it be powered by AI?

✨ Will it be GenAI-related?

✨ Will it be starting by AI will take your job?

So, what do you do to win an InfoShare pass?

👉 Answer the above question

👉 Subscribe to weekly data & AI newsletter

👉 Follow InfoShare

🏆 Rules:

1. Submit your suggestion in the comments of this post or by sending the answer to datapill newsletter mail by 13th May 23:55
2. The Organizer of the contest is GetInData.
3. The winner will be chosen based on the most interesting proposal, as selected by the Organizer. We value your creativity and unique ideas, and we're excited to see what you come up with!
4. By submitting your proposal, you agree that the Organizer may use this idea for marketing purposes.
5. We will announce the winner in the comments on 14th May.
6. The Organizer reserves the right not to select the winner if the proposed answers are not distinctive, offensive, or discriminatory.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on