DATA Pill #039 - unleashing ML power at Lyft and Spotify, BigData funeral and lots of news

ARTICLES

Powering Millions of Real-Time Decisions with LyftLearn Serving | 7 min | ML | Hakan Baba & Mihir Mathur | Lyft Engineering Blog

The key component for our ML platform is LyftLearn Serving. LyftLearn Serving is a robust, performant, and decentralized system for deploying and serving ML models; it can be used by any team at Lyft to easily infer models online through network calls.

In this article you will learn about the major components and important design decisions, key ideas, lessons learned and the next steps.

Unleashing ML Innovation at Spotify with Ray | 10 min | ML | Divita Vohra, Keshi Dai, David Xia & Praveen Ravichandran | Spotify Engineering Blog

The machine learning journey at Spotify. How ML has developed. What the lifecycle of ML projects at Spotify looks like and what the next steps are.

Enabling MLOPs in Three Simple Steps | 7 min | MLOps | Dustin Liu | Toward Data Science Blog

Dustin shares his experience with a project he engaged in based on involving the implementation of a multi-class classification prediction system, utilizing the financial transactional data, comprising over 10 million records and over 70 classes.

Through this project, he constructed a simplified MLOPs integration from an end-to-end ML flow perspective, that can be implemented in three steps:

1. Adding Data Extraction & Transformation Governance.

2. Productionalizing ML code.

3. Model Experiment Tracking and Management.

Data integrity vs. Data quality | 5 min | Data Engineering | Saeed Mohajeryami, PhD | Personal Blog

Data integrity and data quality are related yet distinct concepts in data engineering. Saeed digs deeper into them and explains each one of them and highlights their differences. Here are a few techniques that are used to ensure the integrity and important checks that should be considered regarding data quality.

Data Lineage is Broken — Here Are 5 Ways to Fix It | 7 min | Data Engineering | Barr Moses | Personal Blog

Why does data lineage matter? As data stacks grow more complex, mapping lineage becomes more challenging. But when done right, data lineage is incredibly useful. Whilst reading you can explore the signals that lineage may be broken, and the ways data teams can find a better approach.

1.Focus on quality over quantity through lineage

2.Surface what matters through field-level data lineage

3.Organize data lineage for clearer interpretation

…and more, so do not hesitate to read Barr's article.

Big Data is Dead | 10 min | Big Data | Jordan Tigani | MotherDuck

Is this clickbait?

This text will make the case that the era of Big Data is over. It had a good run, but now we can stop worrying about data size and focus on how we’re going to use it to make better decisions. Big Data is real, but most people may not need to worry about it. After reading this, answer some questions and decide if you are a good candidate for the new generation of data tools that will help you handle data at the size you actually have, not the size that people try to scare you into thinking that you might have someday.

Just being curious, do you agree with his views?

Adding Zonal Resiliency to Etsy’s Kafka Cluster: Part 1 | 8 min | Cloud & Kafka | Andrey Polyakov, Kamya Shethia | Etsy Blog

How etsy made our Kafka cluster resilient to zonal failures. For such an important production service, the migration had to be accomplished with zero downtime in production. This post will discuss how they accomplished that feat, and where they're looking to optimize costs following a successful rollout.

The technology behind GitHub’s new code search | 8 min | GitHub | Timothy Clem | GitHub Blog

How does the new code search system work? What about the architecture and technical underpinnings of the product?

TUTORIAL

This one introduces you to MLOps, demonstrates the usefulness, tells more about what processes / flows / pipelines are. What's more, you can find here a tutorial on how to build your own pipeline and consolidate each job.

Flink SQL: How to detect patterns with MATCH_RECOGNIZE | 3 min | Flink | Ververica Blog

You’ll learn about detecting patterns with the MATCH_RECOGNIZE function. And how to use Flink SQL to write queries for this type of problem.

Rebuilding a Cassandra cluster using Yelp’s Data Pipeline | 9 min | Cassandra | Muhammad Junaid Muzammil | Yelp Engineering Blog

How Yelp rebuilt one of their Cassandra(C*) clusters by removing malformed data using Yelp’s Data Pipeline.

Yelp orchestrates Cassandra clusters on Kubernetes with the help of operators. They tend to use multiple smaller clusters based on the data, traffic and business requirements. This strategy assists in containing the blast radius in case of failure events.

The primary driver for this effort was the discovery of data corruption across multiple nodes inside one of our Cassandra clusters. This corruption was widespread to different tables including those in the system keyspace.

NEWS

Microsoft launches Teams Premium with features powered by OpenAI | 2 min | AI | Tom Warren | The Verge Blog

OpenAI’s GPT-3.5 model got Microsoft Teams to the next level. Notes, mentions and a full transcript are all available, with each speaker’s contributions highlighted in a neat timeline of topics and chapters. Are you hungry for more? Dive into the text to read about AI-powered intelligent recap features, some existing Teams features and better meeting protections.

An important next step on our AI journey | 4 min | AI | Sundar Pichai | Google Blog

Everybody, let's meet Bard. Google has been working on an experimental conversational AI service, powered by LaMDA. Bard seeks to combine the breadth of the world’s knowledge with the power, intelligence and creativity of our large language models. It draws on information from the web to provide fresh, high-quality responses. What does this mean? For example, you can use Bard to simplify complex topics. What will it change? Read and find out.

Code in the Cloud With Anaconda—for Free! | 2 min | Cloud | Anna Ng | Anaconda Blog

Anaconda's fully-loaded and ready-to-code cloud notebook is now available for free.

dbt Labs signs definitive agreement to acquire Transform | 4 min | dbt | Tristan Handy | dbt Blog

dbt Labs has signed a definitive agreement to acquire Transform, the original innovators behind the semantic layer in the modern data stack.

PODCAST

Data Update - The best managers look for evidence in data - business intuition is no longer enough when making decisions. Two stories on how data-driven approach helped solve different business problems | 27 min | MLOps | Adrian Dembek, Piotr Menclewicz | Radio DaTa Podcast

If you hear that your company should be data-driven, but you're not sure what this means in practice, in this episode we share two stories of data driven companies. Both of them are examples of data literate companies from different perspectives. In the first story, you can learn how the big tech company were allowed to detect problems and start solving the right ones. The second one is how an e-commerce company prepared more effective promotions and increased revenue.

Using AI to Improve Data Quality in Healthcare | 40 min | AI | Host: Richie Cotton, Guests: Sunna Jo, Nate Fox | DataFramed Podcast

Nate Fox, the CTO, Co-Founder and President of Ribbon Health, and Sunna Jo, a former pediatrician who is now a data scientist at Ribbon Health, share how AI can scale data quality in healthcare.

Takeaways:

Data Engineering is very valuable when it comes to the scalability of data cleaning. It’s essential to think creatively about how to solve data quality challenges, so that your solutions work reliably at scale.

Having a strong and clear operating definition for what is considered good quality data can help you more effectively work with messy data, transform it into usable data, and draw meaningful insights from it.

CONFS EVENTS AND MEETUPS

GoDataFest 2023 | 15-17 Feb | In-person, Amsterdam

GoDataFest brings together conversations about the latest data technology into an event experience. Engage with product specialists and experienced practitioners of leading tech. Learn from the experiences of industry-leading enterprises. GoDataFest is proudly powered by Xebia.

Optimizing data in Apache Iceberg: Performance strategies & Foundations of Data Teams | 16 Feb | Double Webinar

Last reminder before next week’s webinar:

Optimizing data in Apache Iceberg: Performance strategies with Dipankar Mazumdar
Foundations of Data Teams with Jesse Anderson

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on GitHub

➡ Dig previous editions of DataPill