Key points:
A good read on how the issues associated previously with ETL can also be problematic in the ELT approach, unless proper data modeling and governance is introduced.
A Modern Data Platform is a common buzzword nowadays. However…
👉What does it take to call a data platform a modern one?
👉What are the key components and characteristics of an MDP?
👉What is completely new about an MDP and what is just an adaptation of the well-known, best practices from this field?
You'll find out how Ververica used Apache Flink as the primary stream processing framework for a Smart City project in the city of Warsaw. (The goal for the VaVeL project was to provide a general purpose framework of mining and managing multiple heterogeneous urban data streams in order for cities to become more efficient).
Additionally, they used Apache Flink to build the two main components of the project, namely our Vehicle Movement Analyser and Vehicle Delay Prediction Systems.
Lakehouses unify the capabilities from data lakes and data warehouses under a single architecture, where this simplification is made possible by using open formats and APIs that power both types of data workloads. Analogously, for MLOps, we offer a simpler architecture as we build MLOps around open data standards.
Abstract: data modeling techniques (dimensional modeling, data vault, anchor modeling) were originally created to solve a certain set of problems. Most of these problems are not relevant anymore: they can be resolved out of the box using a modern data stack. However, the issue of analyzing your data, its structure, its quality and its dependencies from an analytical point of view, is still applicable. Data modeling techniques can be very useful here, despite the fact that they were originally created for a slightly different purpose.
Researchers at Stanford University have open-sourced Diffusion-LM, a non-autoregressive generative language model that allows for the fine-grained control of a model's output text. When evaluated on controlled text generation tasks, Diffusion-LM outperforms existing methods.
Diffusion-LM is a generative language model that uses a plug-and-play control scheme, where the language model is fixed, and its generation is steered by an external classifier that determines how well the generated text matches the desired parameters. Users can specify several features of the desired output, including required parts of speech, syntax trees, or sentence length.
How leveraging graph information allows us to learn more about users, in addition to building more contextual representations of them. This article covers specific graph machine learning methods such as Graph Convolutional Networks, that are being used at Airbnb to improve upon existing machine learning models..
#Streaming - Databricks finally takes a serious shot at streaming for Spark!!
Recently, to Structured Streaming was introduced in Apache Spark 2.0. Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. The user can express the logic using SQL or Dataset/DataFrame API.
Databricks launches "new initiative codenamed Project Lightspeed to meet these requirements, which will take Spark Structured Streaming to the next generation". Seems more than intesting!
For us, the most exciting development is the Change Data Feed.
Google has introduced gcpdiag, an open source tool to detect configuration issues in Google Cloud projects. gcpdiag is a command-line tool which runs many such automated checks, called rules, and creates a report about all the issues that it detects.
In this article, you will find out who the Winners of "ML in Action" are, e.g.:
Persona Labs' Digital Personas - enable users to interact with life-like Personas in a format similar to a Zoom call : Speaking to them and seeing them respond in real time, just as a human would.
Brian Richardi shares his experience as a data science leader who transitioned to IT from Finance. He provides insights into utilizing collaboration, effective communication to drive value and the future of data science in Finance.
Join Corey Prowse from Citigroup discusses how Citi architected a Market Data System (Scala, Java, TypeScript) storing billions of data points, scaling up to over a thousand concurrent connections within an Agile DevOps environment.