DATA Pill #049 - 91% of ML Models Degrade in Time, Introducing MLflow 2.3, and Secrets of Deep Reinforcement Learning

ARTICLES

91% of ML Models Degrade in Time | 10 min | ML | Santiago Víquez | nannyML

The study by Vela et al. showed that the ML model's performance doesn't remain static, even when they achieve high accuracy during deployment. And that different ML models age at different rates even when trained on the same datasets. Another relevant remark is that not all temporal drifts will cause performance degradation. Therefore, the choice of the model and its stability also becomes one of the most critical factors in dealing with performance temporal degradation.

Use dbt and Duckdb instead of Spark in data pipelines | 7 min | Data Engineering | Niels Claeys | datamindedbe Blog

Niels presents several reasons to consider using dbt and Duckdb instead of Spark. He also highlights some limitations and challenges of using DBT and DuckDB.

The article provides a comprehensive overview of DBT and DuckDB and how they can be used in data pipelines. It encourages readers to explore these tools as alternatives to Spark.

Fivetran Puts the Customer Last | 10 min | Data Engineering | Lauren Balik | Personal Blog

Lauren strikes back. This time some conspiracy theory about Modern Data Stack vendors and what the long-awaited Fivetran's S3 connector has to do with that. As usual, it may be a provocative narration style, but it is still good food for thought. If you start looking at your cloud spend, human capital, and products as a portfolio of investments that generate returns, you will develop habits that lead you away from these Modern Data Stack games.

The road to running Apache Flink applications on AWS KDA | 6 min | Cloud | Duc Anh Khu | Deliveroo Engineering blog

In this article, you will read about the road to running Apache Flink applications on AWS KDA. Why did the Deliveroo team choose AWS KDA, and what lessons they’ve learned? Dive into the text and let yourself know their plan for the future.

How We Performed ETL on One Billion Records For Under $1 With Delta Live Tables | 8 min | Dillon Bostwick, Shannon Barrow, Franco Patano, Rahul Soni | Data Engineering | Databricks Blog

Check out how you can optimize your Databricks deployment significantly using Delta Live Tables (DLT). It is a new feature that enables real-time change data capture (CDC) with transactional consistency, enabling analytics on data in motion. Dive into a detailed description of the process and tools used.

SELECT BigQuery_dataset WHERE physical_storage_cost < logical_storage_cost | 4 min | Robert Sahlin | Data Engineering | Data Engineering on GCP Blog

Robert shares a way that may save a few thousand dollars a year.
It's actually quite easy to extract datasets that have the biggest potential of cost savings, you can query the information schema. If your data compresses well and is partitioned and appended only then there is a good chance that you will save cost by switching the billing model to physical storage. For example, in our project used as a bronze layer we could see as much as 80% cost savings potential!

DATA LIBRARY

Artificial Intelligence Index Report 2023 | takes time to dig in | AI | Stanford University Human-Centered Artificial Intelligence

The sixth edition of the AI Index Report is here, featuring more original data than any previous version. Few takeaways for you:

Industry races ahead of academia.
The world’s best new scientist… AI?
AI is both helping and harming the environment.
The number of incidents concerning the misuse of AI is rapidly rising.

TUTORIAL

Managing Multiple BigQuery Projects With One dbt Cloud Project | 9 min | GCP | Lucas Ortiz | Xebia Blog

This one provides a step-by-step guide to set up a BigQuery connection in the dbt Cloud project, how to enable BigQuery API, and how to create a service account for the project. It concludes by providing a workflow to manage and execute dbt projects on multiple big projects in dbt Cloud.

Introducing MLflow 2.3: Enhanced with Native LLM Support and New Features | 9 min | Machine Learning | Ben Wilson, Harutaka Kawamura, Liang Zhang, Corey Zumar, Jin Zhang, Sunish Sheth | Databricks Blog

Databricks announced MLflow 2.3. This open-source ML platform has been enhanced with several innovative features that expand its capabilities in managing and deploying LLMs. One of the main highlights of this update is the improvement in LLM support, which now includes three new model flavors - Hugging Face Transformers, OpenAI functions, and LangChain. Additionally, users can now enjoy faster model download and upload speeds for model files when using cloud services, thanks to the introduction of multi-part download and upload functionality.

DATA ODDITIES

You Can Try Auto-GPT, the Next Generation of ChatGPT, Right Now | 4 min | AI | Jake Peterson | Lifehacker

Auto-GPT is a complex system relying on multiple components. It connects to the internet to retrieve specific information and data (something ChatGPT’s free version cannot do), features long-term and short-term memory management, uses GPT-4 for OpenAI’s most advanced text generation, and GPT-3.5 for file storage and summarization.

NEWS

Releasing Ververica Cloud - A Fully Managed Cloud Native Service | 3 min | Cloud | Vladimir Jandreski | Ververica Blog

Ververica has announced the beta release of Ververica Cloud. It is a fully-managed service for deploying, operating, and monitoring Apache Flink applications, including stream processing and real-time analytics. Ververica Cloud offers several benefits, including:

Simplified deployment and management of Apache Flink clusters
Efficient resource utilization and automatic scaling
Integration with popular data sources and sinks
Powerful monitoring and alerting capabilities

AWS announces Amazon Bedrock and multiple generative AI services and capabilities | 3 min | Cloud | About Amazon Blog

A generative AI service that can help developers create conversational agents, chatbots, and voice assistants is already released. Bedrock uses GPT-3 technology to generate text and natural language responses. It also includes pre-built conversational components and a machine learning model trained on diverse data sources.

Introducing AI Functions: Integrating Large Language Models with Databricks SQL | 5 min | Databricks SQL | Patrick Wendell, Xiangrui Meng, Eric Peter, Nicolas Pelaez, Jianwei Xie, Vinny Vijeyakumaar, Linhong Liu & Shitao Li | Databricks Blog

Databricks announces the public preview of AI Functions. AI Functions is a built-in DB SQL function, allowing you to directly access Large Language Models (LLMs) from SQL.

PODCASTS

Data and analytics for an audience engagement platform | 45 min | host: Adam Kawa guest: Ludwig Holmstrom | Radio DaTa Podcast

Ludwig works as a Product Analytics Director at Mentimeter. Before joining Mentimeter, he worked with data & analytics for over a decade at various companies such as Kry, Spotify, and Google.

Discussed subjects:

What is an audience engagement platform
Analytics use-cases at Mentimeter e.g. real-time visualization, customer journey
Autonomous teams at Mentimeter
Analytics stack at Mentimeter e.g. AWS, Redshift, LookerKPIs and dashboards e.g. Pirate Metrics (AARRR), Viral loop, LTV (Customer lifetime value)
Unique aspects of working with data at Mentimeter

Secrets of Deep Reinforcement Learning | 2 h 47 min | host: Tim Scarfe guest: Minqi Jiang | Machine Learning Street Talk

Dr. Tim Scarfe interviews Minqi Jiang, on the impact of deep reinforcement learning on technology, startups, and research. Minqi shares his experiences in balancing serendipity and planning, explains the role of objectives and Goodhart's Law in decision-making, and discusses the differences between RL and supervised learning.

They also explore the possibilities of open-endedness and the intelligence explosion, as well as limitations of RL and interpretability concerns with software 2.0.

CONFS EVENTS AND MEETUPS

Snowflake Summit 2023 | 26-29th June| Las Vegas

Attend Snowflake Summit 2023 to learn how to access, build, and monetize data, tools, models, and applications in ways that were previously unimaginable. Enable seamless alignment and collaboration across these crucial functions in the Data Cloud to transform nearly every aspect of your organization.

At the Summit, you’ll hear all about the latest innovations coming to the Data Cloud, and learn from hundreds of technical, data, and business experts about what’s possible for you and your organization in a world of data collaboration.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on GitHub

➡ Dig previous editions of DataPill