A really powerful case study: How Canva scaled data ingestion with Change Data Capture (CDC)
How Instacart built their MLOps platform using a hybrid approach of open-source and AWS services (MLflow , airflow, AWS SageMaker and their own framework MLCLI).
Simple yet clear advice on how to handle slowly changing dimension (SCD) data in a modern data warehouse/stack. If you understand that “computing is cheap, storage is cheap, engineering time is expensive, ” then the simple idea of snapshotting the dimension table every day sounds like the best one!
Slightly different big data:
“Today, Pinterest’s memcached fleet spans over 5000 EC2 instances across a variety of instance types optimized along compute, memory, and storage dimensions. Collectively, the fleet serves up to ~180 million requests per second and ~220 GB/s of network throughput over a ~460 TB active in-memory and on-disk dataset, partitioned among ~70 distinct clusters.”
Pinterest dives deep into practical optimizations running in the production environment along dimensions of hardware selection strategy, compute efficiency, and networking performance.
“Initially, we had manual agents review a statistically significant sample from resolved support interactions. They would manually verify and label the resolved support issues and assign root cause attribution to different categories and subcategories of issue types. We wanted to build a proof-of-concept (POC) that automates and scales this manual process by applying ML and NLP algorithms on the semi-structured or unstructured data from all support interactions, on a daily basis.”
In this blog post you will discover what Uber's approach was and the end-to-end design of data processing and ML pipelines.
Dev vs. Ops - “we” vs. “they”. How to change it to “we” & “we?
If people don’t trust each other and don’t feel safe, they invest their energy and time into securing themselves using various corporate approaches that we are all aware of.
Takeaways:
Cloudera adopts Apache Iceberg as a main data & table format!
LinkedIn Engineering recently open-sourced its feature store Feathr, which helps engineers to develop machine Learning products by simplifying feature management and usage in production.
New capabilities announced:
Walk through how to build a real-time dashboard with Cloud Run and Firestore.
Changes and trends that resulted in the creation of so-called Modern Data Platforms.
Big Data top trends summary and review of two conference tracks: