ARTICLES
Evolution of ML Fact Store 🕐 11 min read |ML Platform
Netflix shows the process of implementing Fact Store explaining design evolution and how they monitor the quality of data.
And giving conclusions f.e. to avoid premature optimization in designing.
Author: Vivek Kaushal | Netflix Technology Blog
Kubernetes Decision path 🕐 1 min read |Kubernetes
Something we like the most - the reference cart.
Author: Sid Palas | DevOps Directive
Introducing Databricks Workflows 🕐 5 min read + video |Workflows
Reliable orchestration for data, analytics, and AI.
Author: Databricks Blog
Supercharge Your Machine Learning Projects With Databricks AutoML
🕐 5 min read + video tutorials | AutoML
AutoML automatically trains models on a data set and generates customizable source code, significantly reducing the time-to value of ML projects.
Beginner and expert data scientists can get their ML models to production faster.
Author: Steve Swoyer | Databricks Blog
dbt Constraints: Automatic Primary Keys, Unique Keys, and Foreign Keys for Snowflake
🕐 5 min read + video tutorials | dbt & Snowflake
dbt Constraints is a new package that generates database constraints based on the tests in a dbt project. Compatible with Snowflake and PostgreSQL.
Author: Dan Flippo | Snowflake
Project Tardigrade delivers ETL at Trino speeds to early users
🕐 5 min read + video tutorials |Case Study
The goal of Project Tardigrade is to provide an “out of the box” solution for the problems mentioned above.
We’ve designed a new fault-tolerant execution architecture that allows us to implement an advanced resource-aware scheduling with granular retries.
Author: Trino Blog
Understanding Twitter conversations: A Wordle case study
🕐 10 min read |Data Science
This case study shows what happened when (and how) Wordle became the number 1 mainstream topic.
Author: Luren Fratamico | Twitter Blog
Google debuts Cloud Run jobs for containerized, scripted tasks 🕐 1 min read |Google Cloud Platform
During a developer keynote at Google I/O 2022, Google unveiled Cloud Run jobs,
an extension of Google Cloud’s service for developing and deploying containerized apps using languages including Go,
Python and Java.
Author: Kyle Wiggers | TechCrunch
TUTORIALS
Slim CI/CD with Bitbucket Pipelines 🕐 10 min read
Tutorial of CI/CD of DBT projects on bitbucket.
Author: Simon Podhajsky - Data Lead at iLife Technologies | dbt Developer Blog
Plural 🕐 1-30 min to dig in docs
It may be a deployment simplification of many open-source projects.
Steampipe - tool to instantly query your cloud services 🕐 3 min read
Tool that will give you the ability to write SQL-based queries to explore dynamic data.
Mods extend Steampipe's capabilities with dashboards, reports, and controls built with simple HCL.
Checkov - Policy-as-code for everyone 🕐 3 min read
Pretty smart tool preventing cloud misconfigurations during build-time for Terraform, CloudFormation,
Kubernetes, Serverless framework and other infrastructure-as-code-language.
Takes time to tune but it’s useful.
Apache Airflow 2.3 — Everything You Need to Know 🕐 7 min read
Airflow 2.3 might be the most important release of Airflow to date. Why?
- Dynamic task mapping lets Airflow trigger tasks based on unpredictable input conditions.
- A new LocalKubenetesExecutor you can use to balance tradeoffs between the LocalExecutor and KubernetesExecutor.
- A new REST API endpoint that lets you bulk-pause/resume DAGs.
Author: Steve Swoyer | Astronomer Blog
CDC in conjunction with Delta Live Tables on Databricks 🕐 7 min read
This article describes how to update tables in your Delta Live Tables pipeline based on changes in source data.
Mosaic - library for geospatial analysis 🕐 1-30 min to dig in docs
An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
DISCUS
How is analytics engineering different from data engineering? 🕐 3 min read
Zach Wilson - Tech Lead in Airbnb started a discussion about the differences between analytics and data engineers.
- That distinction between DE and AE is pretty clear...
- Analytics engineer is a kinda new name for ETL / BI developers, agree?
- The distinction between data analysts and analytics engineers is blurry, do you see it more clear?
BIG DATA CONFS ANS MEETUPS
Allegro Tech Live #28 | 19 May | online
This time the topic will be: the architecture of mobile solutions.
Gonna be two presentations:
Why M1 is so fast?
Payment module. How to gather bricks and build the whole construction.
Airflow Summit 2022| 25 May | Warsaw
Talks:
- Airflow in the Cloud: Lessons from the Field.
- How DAG Became a Test - Airflow System Tests Redefined.
- OpenLineage & Airflow - data lineage has never been easier.
- Running +150 production Airflow on Kubernetes, is that HARD ?
Rumor is that we will have Apache Airflow swag available for attendees
Berlin Buzzwords| 12-14 June | Berlin
What’s gonna happen?
- Keynote by Fiona Coath, who will talk about opposition to surveillance capitalism and our responsibilities as technologists.
- Nick Burch, who will explore what Wordle can teach us about Information Retrieval, Search and AI/ML.
- Anshum Gupta will tell us what's new in Apache Solr 9.0