The conclusion from the ManyChat (marketing startup) use case: proper data architecture enables you to forget about the linear growth of spending (proportional to data volumes) and switch to almost constant expenses.
Nikolay presents an approach to creating analytical platforms for a startup that require constantly changing business models.
ML Observability is a relatively new field, focusing on monitoring the distribution of input features, predictions and performance metrics. Kyle explains why Etsy decided to implement centralized ML observability.
At Etsy, most models are retrained on a 24-hour basis on a sliding window of data (say the last 3 months). Since each day models are updated with the latest data, there’s little chance of poor performance due to model drift - one of the most frequently discussed subtopics of ML observability.
Through ML observability, we aimed to save on long-term cost by setting up the building blocks that we'd need for future features like intelligent retraining - which would only retrain our large, expensive models as needed.
Lack of observability has also led to more than one production incident with our ML models. These incidents had been caused by things like upstream data quality issues in the past, some of which were identified quite late and only solved through tedious, manual debugging.
How is multitask learning different from traditional learning? All is explained in this article.
In LinkedIn, the application of multitask learning has shown to improve model performance with significant product impact. Some of the guys from LinkedIn share what the assumptions were and how they solved the challenges that arose.
Rishit explains how to understand Kubeflow architecture and shares some of his experiences and tips such as:
Monitoring for skews, prediction quality, data drift and concept drift after deploying a model is just as important and you should be prepared to iterate based on these real-world interactions.
Users have a year to move from the Google Cloud Platform IoT Core service…
We prefer generic tools like draw.io, but this one specialized for GCP looks really nice and productive.
Cloudera is still in the game.
This looks nice, so I'll just leave it here…
Diffusers offer:
George shares the real-world use cases of real-time analytics, why reducing data complexity is key to improving the customer experience, the common problems that slow data-driven decision-making and more.
Marcin presents a new Kedro Azure ML plugin that enables running Kedro pipelines on an Azure ML Pipelines service. He discuss and demonstrates:
Large scale geospatial processing in databricks using Apache Sedona. Fernando Ayuso Palacios & Alihan Zihna share their experience and benefits of using it and present their solutions for setting up Apache Sedona on databricks, the common pitfalls, solving issues and implementing the GIS data pipeline on databricks.
This conference is focused on open source data tools. Presentations and discussions will cover everything from sharing best practices to building and operating complex systems at scale.