DATA Pill #04 - Kubernetes Cluster scaling, Training ML Models and 1 clickbait

ARTICLES

The end of Big Data | 10 min read | Databricks, Snowflake | Benn Stancil | Benn.Substact Blog

Ok, the title is clickbait, no need to change the job, but it still has some valid points.

Like:

Data for the middle class - if Big Data is commonly adopted, the tools must be simpler and focus on basic problems. If everyone wanted to use Big Data for complex stuff, there wouldn't be enough qualified people who would like to do such a job.

A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML) | 15 min read | BigQuery ML| 🎉 Michał Bryś | GetInData Blog

The process of training, validating and deploying a machine learning model using BigQuery ML (BQML) explained. The example model predicts the probability of adding an item to the shopping cart on an e-commerce website. GetInData used the Google Analytics 4 export to BigQuery to train a logistic regression classifier and then predicted the probability of a sample event (addedToCart) on the new sessions.

Dynamic Kubernetes Cluster Scaling at Airbnb | 10 min read | Kubernetes | Airbnb Blog

Over the past few years, Airbnb has shifted almost all online services from manually orchestrated EC2 instances to Kubernetes. Today, it runs thousands of nodes across nearly a hundred clusters to accommodate these workloads. In the blog post there are three stages of setting up a Kubernetes cluster.

Stage 1: Homogenous Clusters, Manual Scaling

Stage 2: Multiple Cluster Types, Independently Autoscaled

Stage 3: Heterogeneous Clusters, Autoscaled

Full-Spectrum ML Model Monitoring at Lyft | 11 min read | ML | Mihir Mathur | Lyft Blog

Discover how Lyft built a robust system for identifying and preventing model degradation and the variety of model monitoring approaches they developed. Plus conclusions:

Automate as much as possible. An even better solution to provide a seamless onboarding experience is to remove it entirely.
Monitoring will reduce shipping velocity slightly. By enforcing monitoring for all models, there is an additional prerequisite in the productionzation of models.

Monitoring Large-Scale Apache Flink Applications, Part 2: Metrics for Troubleshooting | 12 min read | Apache Flink | Nico Kruber | Ververica Blog

This post focuses on the metrics that are concentrated on expanding on these metrics and provides more insights into identifying resource bottlenecks and sources of errors.

You will also find some details about the Grafana dashboard.

NEWS

Amazon EMR Serverless Now Generally Available – Run Big Data Applications without Managing Servers | 6 min | Cloud Nativ | Anthony Alford | AWS News Blog

Another option to run Spark serverless (and many more Big Data tools).

PODCAST

Scaling Real-time Machine Learning at Chime | 24 min | MLOps.community

Peeyush Agarwal explains 2 key pieces of the ML infrastructure at Chime. He goes into detail about the current feature store design and feature monitoring process along with the ML monitoring setup.

A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore | 41 min | Data Engineering Podcast

SVP of engineering Shireesh Thota describes the impact on overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine.

CONFS AND MEETUPS

Berlin Buzzwords | 12-14 June | Berlin

What’s gonna happen?

Keynote by Fiona Coath, who will talk about opposition to surveillance capitalism and our responsibilities as technologists.
Nick Burch, who will explore what Wordle can teach us about Information Retrieval, Search and AI/ML.
Anshum Gupta will tell us what's new in Apache Solr 9.0

Open Data Science Conference | 15-16 June | London

Conferences focus areas:

Machine Learning & Deep Learning
Machine Learning for finance
NLP
Data Engineering & MLOps

Data Science Summit: Machine Learning | 21-22 June | Online and ONSITE

Reach agenda of Machine Learning, Data Mining & Exploration