DATA Pill #027 - what the SpaceX rocket has to do with ML models, and CampNou with Data Science

ARTICLES

Data Science at Spotify Camp Nou | 14 min | Data Science | Tomás Ravalli | Personal Blog

Tomás works at FC Barcelona’s Venue Business department which is responsible for selling match day and season tickets to football matches at Spotify Camp Nou.

The goal is to optimize match day revenue and fan experience, maintaining a baseline attendance. In this article Tomás shares:

how they approach data science challenges at Ticketing
how the team is organized
how they prioritize tasks
real-world examples of both fan-faced and internal projects they have tackled
who to blame if you don’t manage to buy a ticket before they sell out (just kidding ;))

Monitoring ML models with Vertex AI | 12 min | ML | Olejniczak Łukasz | Google Cloud - Community Blog

It's in this article that you'll learn how similar the work on ML models is to the work on SpaceX rocket launches. Of course, this is just a pretext for a more important topic: how to enable monitoring for launched ML Models, and for this purpose we will use Vertex AI.

Vertex AI Model Monitoring can help to detect:

Training-serving skew — which occurs when the input data distributions after production launch are different from the input data distribution used to train the model.
Prediction drift occurs when input data distribution after production launch changes significantly over time. Here Vertex AI does not need to know anything about the training dataset. Instead it collects statistics about the input data from requests for predictions sent during previous monitoring windows.

Uber Freight Near-Real-Time Analytics Architecture | 2 min | ML & AI | Claudio Masolo | InfoQ Blog

The Uber platform architecture uses Kafka, Flink and Pinot. The Kafka events generated by backend services, are aggregated by Flink.

Uber noticed a statistically significant boost across all key metrics since it started providing the information on performances to Freight drivers: -0.4% of late cancellations, +0.6% of on-time pick-ups, +1% of on-time drop-off and +1% of auto tracking performances.

Your ML prototype doesn't have to be messy. A few words about the GetInData Machine Learning Framework | 12 min | ML | Piotr Chaberski | GetInData Blog

In today’s blog, Piotr Chaberski considers the topic of the Machine Learning prototype.

After reading you will realize that it does not have to resemble a mix of spaghetti code.

You can find out how GetInData created a complete blueprint, which gives you a clear example of how to:

build reproducible, containerized working environments
track experiments and version our model prototypes
develop clean, production-quality machine learning code starting from the PoC phase
transfer our local sample-based implementations into a full-scale cloud environment

MLOps: why and how to build end-to-end product teams | 7 min | MLOps | Daniel Willemsen | GoDataDriven Blog

As of 2022 it’s estimated that less than 20% of machine learning models are brought into production. Why do so few companies bring ML models to production, and even fewer do so reliably and efficiently? Why do you need data science product teams to do MLOps?

Short communication lines between data scientists and engineers.
Easy to formulate common grounds between all experts.
Easy to monitor, maintain and iterate on a model once it has been “shipped” to production.

Also, what are the requirements?

The right roles & skills to take ownership over ML products
A platform team that can enable the data science product team
The right organizational environment to make data science in production a success.

Revamping the Apache Airflow-based workflow orchestration platform at Coinbase | 10 min | Apache Airflow | Data Platform & Services Team | Coinbase Blog

Nice lessons learned about Apache Airflow from Coinbase.

How Coinbase has revamped Apache Airflow-based the orchestration platform to ensure operational efficiency and development velocity. Experience gained in migrating pipelines and onboarding users to the revamped platform via crowdsourcing.

For the love of god, stop using CPU limits on Kubernetes (updated) | 6 min read | Kubernetes | Natan Yellin | Robusta.dev Blog

Many people think you need CPU limits on Kubernetes but this isn't true. In most cases, Kubernetes CPU limits do more harm than help.

Natan explains why CPU limits are harmful with three analogies between CPU starved pods and thirsty explorers lost in a desert. In this article, CPU will be water and CPU starvation will be… death.

Google launches Dataform for BigQuery | 4 min | BigQuery | Christian Lauer | CodeX Blog

At last, native integration with GCP. This is quite handy to manage code pipelines in SQL in BigQuery. It could be a useful addition, especially for Data Engineers and Data Analysts who only work with BigQuery.

10 New DevOps Tools to Watch in 2023 | 9 min | DevOps | Tiexin Guo | 4th Coffee Blog

A quick categorization of all the mentioned tools in this article:

Infrastructure-as-Code: Pulumi
Security: SOPS, Trivy
K8s/multi-cluster: Cluster API, Linkerd
CI/CD: GitHub Actions, Tekton, HashiCorp Harness
Monitoring: Thanos
Policy-as-Code: HashiCorp Sentinel

NEWS

Snowpark for Python GA and Snowpark-optimized warehouses in public preview | 6 min | ML & MLOps | Sri Chintala, Julian Forero & Samartha Chandrashekar | Snowflake Blog

It's great that they are moving forward with this - the combination of Snowpark + Kedro (an open-source Python framework for creating reproducible, maintainable, and modular data science code) will really be a powerful tool for MLOps, both from the perspective of ease of data processing and model training.

It’s HERE! Say Hello to Column-Level Lineage in DataHub | 6 min | GitHub | Maggie Hays | DataHub Blog

OK, fine, we may have missed it a bit, but it's still pretty fresh news.

Datahub got column-level lineage!

PODCAST

Data Journey with Arunabh Singh (Willa) - Building robust ML & Analytics capability very early, data & analytics at a FinTech, skills & competences for data scientists, ML/AI predictions for the next decades | 49 min | ML & AI | hosts: Adam Kawa; guests: Arunabh Singh | Radio DaTa

Arunabh Singh works as a director of data science at Willa - FinTech that helps professional freelancers, influencers, and social media content creators get paid immediately by brands for their freelance work and paid collaborations.

Topics of podcast:

Data that is used at Willa and business use cases that are developed using this data
The most important ML models implemented at Willa
The ML(Ops) stack at Willa and a decision to build ML & Analytics capability very early
The most important skills and competencies that data scientists should have these days
The main trends and predictions for ML/AI for the next decades
Plans for 2023 at Willa.

Growing and Supporting The Data Science Community At Anaconda | 55 min | AI | hosts: Tobias Macey; guests: Kevin Goldsmith | The Python Podcast.__init__

A very interesting podcast episode with Kevin Goldsmith (CTO at Anaconda, Inc.) about the technical and social challenges that data science is facing, plus the impact that open-source software has on this industry. Recorded a year ago, but it's still worth listening to.

The relationship between data-centric AI and knowledge-first AI.

CONFS EVENTS AND MEETUPS

BUILD.local Warsaw: Writing Data Directly to Snowflake with the Streaming API | 17 November | Warsaw

During this meeting Rafal Stryjek, Data Superhero, will demo Snowflake's streaming API, and how to use it to write data directly to Snowflake. There will be dedicated time to Q&A and networking.

Big Data Tech Warsaw Summit | 29-30 March 2023 | online and onsite | Call For Presentation till 15 November

A chance to speak in front of an audience of Big Data professionals.

More than 500 professionals will attend the conference to hear dozens of technical presentations. One of them could be yours ;)

And since we're already talking Snowflake and CFP…

Snowflake Summit 2023 | CFP | 26-29 June | Las Vegas

The Snowflake Summit is still a while away, but now it's time for a call to arms. If you want to take part in Vegas: a story of migration, transformation or innovation, you can submit your presentation.

________________________

Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub

Sylwia from the GetInData Content Machine