DATA Pill feed

DATA Pill #054 - 10 best open-source repos, LLM, Flink and Apache Iceberg + Python


Top 10 Best Open Source Projects on GitHub 2023 | 7 min | Data Science | Open Data Analytics
It’s not rocket science, Open Source Software has revolutionized the way software development is done today. Check the list of the top 10 fastest growing open source GitHub repositories. (There are a few data/analytics ones.)
Safer deployment of streaming applications | 9 min | Flink | Shi Kai Ng | Grab Tech Blog
In this article, you will explore how we ensure that deploying Flink applications remain safe as we incorporate the lessons learned through our journey to continuous delivery.

Let’s find out more about how users interact with Grab’s systems to develop and deploy Flink applications in three different ways, problems with that and great solutions.
Casual data engineering, or: A poor man's Data Lake in the cloud - Part I | 20 min | Data Engineering | Tobias Müller | Toblig
Let's take a step-by-step tour through the steps to create a scalable, cost-effective data lake on AWS. Whether you're making solutions for a startup, a small business, or a large enterprise, this guide will help you unlock the power of big data without breaking the bank.
Finding your way through the Large Language Models Hype | 12 min | ML | Piotr Chaberski | GetInData | Part of Xebia Blog
With the birth of chatGPT, the potential of LLMs is gaining a lot of attention as it is not only a BIG thing in the machine learning world but also in the everyday life of many people.

So the question is: what can LLMs really offer us here and now, given the rapidly changing landscape?
Connect, Process, and Share Trusted Data Faster Than Ever: Kora Engine, Data Quality Rules, and More | 8 min | Streaming | Bharath Venkat, David Araujo | Confluent Blog
Take a look at the necessary steps to ensure that data is accurate, complete, and reliable. This article also touches upon the benefits of sharing data between departments and the challenges that arise when multiple teams are involved.


Last week you could find Marcin’s blog post about running machine learning / deep learning models in BigQuery. This time, Marcin uses the latest (still in preview) capabilities of BigQuery ML that allow it to run ONNX models within the BigQuery itself.
An Introduction to the Hudi and Flink Integration | 7 min | Data Engineering | Danny Chan | OneHouse Blog
Let’s dive into an overview of how Danny created the Hudi-Flink integration. The background story, use cases and demo are waiting for you.
3 Ways to Use Python with Apache Iceberg | 8 min | Data Engineering | Alex Merced | Dremio Blog
Let’s take a quick look at three ways you can use Python code to work with Apache Iceberg data:

1.Using pySpark to interact with the Apache Spark engine.
2.Using pyArrow or pyODBC to connect to engines like Dremio.
3.Using pyIceberg, the native Apache Iceberg Python API.


Guidance | LLM
Guidance enables you to control modern language models more effectively and efficiently than traditional prompting or chaining. Guidance programs allow you to interleave generation, prompting, and logical control into a single continuous flow matching how the language model actually processes the text. Simple output structures like Chain of Thought and its many variants (e.g. ART, Auto-CoT, etc.) have been shown to improve LLM performance. The advent of more powerful LLMs like GPT-4 allows for even richer structure, and guidance makes that structure easier and cheaper. | Data Engineering
Are you looking for a real-time data integration tool, with an open-source version? Meet Artie Transfer, an open source data integration platform that enables real-time data replication between databases and data warehouses.


8 things that are not possible on ChatGPT thanks to the latest updates in a Twitter thread.


Amazon has launched an AI-powered everyday experience for shopping called Amazon Assistant.

This feature is similar to Alexa assistant as it allows shoppers to voice-search for products, compare prices and make purchases without navigating multiple pages.

The technology behind Amazon Assistant is expected to be integrated into other Amazon products, such as the Amazon App and Alexa, making shopping more convenient and personalized for customers.
Why Microsoft is combining all its data analytics products into Fabric | 5 min | Data Analytics | Anirban Ghoshal | InfoWorld
Microsoft releases Fabric - A Data Platform combining its data services into one suite. Mostly repackaging of existing services, but with a few extras, like OneLake (similar to GCP BigLake) Synapse Data Science or Eventstream in Real-time analytics. The goal is to create a unified experience of the platform.


The A.I. Dilemma | 1 h 7 min | AI | Tristan Harris, Aza Raskin | Center for Humane Technology
That was just before GTP4 was released. Tristan and Aza discussed how existing A.I. capabilities already pose catastrophic risks to a functional society, how A.I. companies are caught in a race to deploy as quickly as possible without adequate safety measures, and what it would mean to upgrade our institutions to a post-A.I. world.


European Women in Technology | 28-29th June | Amsterdam
European Women in Technology offers multiple touch points to learn, experience, network and be inspired for you to grow your personal brand and shine in your tech career.

Premium ticket holders will gain access to 60+ Strategic Workshops that will provide you with the opportunity to get hands-on and deep-dive into soft and technical topics in a more intimate environment.
ODSC Europe 2023 | 14-15th June | London & Online
With 300 hours of content, the conference features a wide range of sessions for data scientists at every level: the latest advances in machine learning, NLP, LLMs, data analytics, responsible AI and more.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on