Data Mesh + Open source components + data contracts
In this article Luis explains the “Data contract” concept, which ensures that information spread across different data products can be shared and reused along with a couple of technical implementations using open source components for one fundamental process in the data contracts lifecycle: its evaluation.
With the ultimate goal of building trust on “someone else's” data products, data contracts are artifacts that sit at the intersection of a (a) business glossary providing rich semantics, (b) a metadata catalog providing information about the structure (c) a data quality repository setting expectations about the content across different dimensions. To ease and promote data sharing.
The State of AI report 2022 has been released. Just wow - so much interesting content and recent developments summarized and analyzed in this report (not that new for someone that follows the AI field). There is also an investor's view on AI which is especially interesting.
An initiative that aims to further the field of machine learning security by identifying the top 10 most common vulnerabilities in the machine learning life cycle. It also includes a set of practical hands-on examples of each of these vulnerabilities, as well as the best practices to address them - all the content is available open source.
Renting computers is (mostly) a bad deal for medium-sized companies with stable growth, like Basecamp. The savings promised in reduced complexity never materialized.
The cloud excels at two ends of the spectrum:
Is the modern data stack even modern?
Isn’t it just a piecemeal of components from solutions we have known forever like SAP or Informatica?
Isn’t it just an unbundled version of Airflow?
All-In-One Data Stacks rises.
Ben shares examples of all-in-one solutions: Incorta, Keboola, Nexla, Mozart Data, Rivery.
The ad-tech firm partnered with Snowflake and announced the plan to launch a dedicated clean room solution called OpenAP data hub.
A very nice semantic layer tool that is open source. Top features:
To support model scaling on TPUs, we implemented the widely-adopted Fully Sharded Data Parallel (FSDP) algorithm for XLA devices as part of the PyTorch/XLA 1.12 release. This FSDP interface allowed us to easily build models with e.g. 10B+ parameters on TPUs and has enabled many research explorations.
Python is now officially a second dbt language (v1.3, support for BigQuery, Databricks and Snowflake).
The Python model is something similar to the model expressed in SQL - a series of data transformation methods on dataframe objects returning a single data object to be persisted in the platform. The article also mentions the typical problems that are better solved in Python. It's worth remembering that the set of intended use is narrow compared to the set of possible uses.
The new version of one of the most important databases in the big data ecosystem.
A report from the DoK Community. Insights from over 500 executives and technology leaders on how data on Kubernetes has a transformative impact on organizations, regardless of size or tech maturity.
Data on Kubernetes has a transformative impact on organizations. Respondents
see a direct link from running DoK and making big gains: the majority of them (83%) attribute over 10% of their revenue to running data on Kubernetes. One-third of organizations saw their productivity increase two-fold.
41 minutes about faster and simpler tools for new streaming applications.
A non-commercial conference organized by Scala enthusiasts for Scala engineers.
This conference has passed, but from this review you can get many takeaways. Creme de la creme of DataMass 2022