DATA Pill #021 - serverless Lock-in, real-time AI and a lot from the open-source giants

ARTICLES

Concerned about Serverless Lock-in? Consider Patterns! | 12 min | Cloud | Gregor Hohpe | The Architect Elevator Blog

A Lock-in to the cloud is maybe unavoidable, but the risk is strongly reduced if you introduce good architecture patterns as an abstraction layer. The author nicely describes this concept here.

Enabling real-time AI with Streaming Ingestion in Vertex AI | 7 min read | AI & ML | Erwin Huizenga & Kaz Sato | Google Cloud Blog

It’s difficult to set up the infrastructure needed to support high-throughput updates and low-latency retrieval of data.

Starting this month, the Vertex AI Matching Engine and Feature Store will support real-time Streaming Ingestion as Preview features. With Streaming Ingestion for Matching Engine, a fully managed vector database for a vector similarity search and items in an index are updated continuously and reflected in the similarity search results immediately.

This blog post covers how these new features can improve predictions and enable near real-time use cases, such as recommendations, content personalization and cybersecurity monitoring.

BTW, last week I recommended the ebook about Building a Feature Store (with an introduction to Vertex AI). That was a coincidence, but… right on time ;)

Upgrading Data Warehouse Infrastructure at Airbnb | 10 min read | Cloud | Ronnie Zhu, Edgar Rodriguez, Jason Xu, Gustavo Torres, Kerim Oktay & Xu Zhang | Airbnb Tech Blog

Airbnb’s experience with upgrading their Data Warehouse infrastructure to Spark and Iceberg.

In our data ingestion framework, we found that we could take advantage of Iceberg’s flexibility to define multiple partition specs to consolidate ingested data over time. Ingested tables write new data with an hourly granularity (ds/hr), and a daily automated process compresses the files on a daily partition (ds), without losing the hourly granularity, which later can be applied to queries as a residual filter.

No, you don’t need MLOps | 5 min read | MLOps | Lak Lakshmanan | Personal Blog

A bit of a provocative title, but the content features a concrete proposition of Keep it Simple alternatives to complex MLOps solutions.

It also provides some sort of a rule of thumb when complexity is actually necessary, so it's not all hype.

Evolution of Streaming Pipelines in Lyft’s Marketplace | 6 min read | Streaming | Rakesh Kumar | Lyft Engineering Blog

Lyft’s journey of evolving our streaming platform and pipeline to better scale and support new use cases. Each iteration provided a better scale, but also exposed shortcomings.

Timestone: Netflix’s High-Throughput, Low-Latency Priority Queueing System with Built-in Support for Non-Parallelizable Workloads | 9 min | Workloads | Kostas Christidis | Netflix Blog

Timestone is a high-throughput, low-latency priority queueing system which Netflix built in-house, to support the needs of their media encoding platform, Cosmos. Over the past 2.5 years, its usage has increased, and Timestone is now also the priority queueing engine backing the general-purpose workflow orchestration engine (Conductor) and the scheduler for large-scale data pipelines (BDP Scheduler). All in all, millions of critical workflows within Netflix now flow through Timestone on a daily basis.

In this article you can dive into the architecture and concept.

Super Tables: The road to building reliable and discoverable data products | 9 min | Big Data | Cliff Leung | LinkedIn Blog

Super Table is LinkedIn's idea on how to solve these Big Data issues that have been causing problems for the last decade:

1) multiple similar datasets often led to inconsistent results and wasted resources

2) a lack of standards in data quality and reliability made it hard to find a trustworthy dataset among the long list of potential matches

3) complex and unnecessary dependencies among datasets led to poor and difficult maintainability

Super Tables (ST) are pre-computed, denormalized, and consistently consolidated attributes and insights of entities or events that are optimized for common and efficient analytic use cases. STs have well-defined service level agreements (SLAs) and simplify data discovery and downstream data processing.

The Super Tables idea seems to suit the Data Mesh.

How Apache Druid becomes a game changer in Big Data Pipelines? | 6 min | Big Data | Teepika R M | Personal Blog

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics (“OLAP” queries) on large data sets. It is one of the most popular open source solutions for OLAP. It is designed for serving both real-time (streaming sources like Kafka, Kinesis) and historical data (batch sources like HDFS, S3).

A use case example:

Used for IoT and device metrics — Druid is often used as a time series solution. Data generated from devices can be ingested in real-time and perform ad-hoc analytics. Druid lets you search and filter on tag values faster than traditional time series databases.

TUTORIALS

LoadBalancer Services using Kubernetes in Docker (kind) | 11 min read | Kubernetes | Owain Williams | Groupon Blog

Tutorial to multi-node kind cluster with extraPortMappings to forward requests from your host to an NGINX ingress controller, which uses the path to send your request to the appropriate service, rewriting the target so it can recognise the request.

NEWS

OpenTest: McDonald’s debut into open-source software | 4 min | Adrian Theodorescu | McDonald’s Technical Blog

A short insight into why McDonald's open sourced OpenTest.

The open sourcing for us led to another significant benefit by reducing the unnecessary friction involved in getting the software onto people’s machines. No more approvals required and no more dependencies on other teams for the actual binaries and updates.

PODCAST

What Data Visualization Means for Data Literacy | 41 min | AI | Host: Ben Lorica, Guest: Yashar Behzadi | The Data Exchange

how data visualization increases organizational data literacy
the best practices for visual storytelling

Synthetic data technologies can enable more capable and ethical AI | 40 min | Data Visualization | Andy Cotgreave | DataFramed

Yashar Behzadi is the CEO & Founder of Synthesis AI, a startup that uses synthetic data technologies to enable teams to build AI applications, as well as gaming and metaverse applications.

CONFS AND MEETUPS

Data Driven Innovation | 12 0ctober | Online

The third edition of the Big Data, AI, ML and Data Science conference organized by Computerworld Magazine.

Product updates and a lot about accelerating app development.

Building Machine Learning pipelines with Kedro and Vertex AI on GCP | 25 October | MLOps | Michał Bryś | Free Webinar

Michał Bryś - Senior ML Engineer and Technical Product Owner will cover:

Why we need a pipeline for machine learning models
Kedro, an open-source Python framework for creating reproducible, maintainable and modular data science code
Q&A session

Firebase Summit 2022 | 18 0ctober | Hybrid: online and onsite in NY

Product updates and a lot about accelerating app development.