DATA Pill feed

DATA Pill #01 - weekly Big Data, Cloud, AI/ML overview


Evolution of ML Fact Store 🕐  11 min read |ML Platform

Netflix shows the process of implementing Fact Store explaining design evolution and how they monitor the quality of data.
And giving conclusions f.e. to avoid premature optimization in designing.

Author: Vivek Kaushal | Netflix Technology Blog

Kubernetes Decision path 🕐 1 min read |Kubernetes

Something we like the most - the reference cart.

Author: Sid Palas | DevOps Directive

Introducing Databricks Workflows 🕐 5 min read + video |Workflows

Reliable orchestration for data, analytics, and AI.

Author: Databricks Blog

Supercharge Your Machine Learning Projects With Databricks AutoML
🕐 5 min read + video tutorials | AutoML

AutoML automatically trains models on a data set and generates customizable source code, significantly reducing the time-to value of ML projects.
Beginner and expert data scientists can get their ML models to production faster.

Author: Steve Swoyer | Databricks Blog

dbt Constraints: Automatic Primary Keys, Unique Keys, and Foreign Keys for Snowflake
🕐 5 min read + video tutorials | dbt & Snowflake

dbt Constraints is a new package that generates database constraints based on the tests in a dbt project. Compatible with Snowflake and PostgreSQL.

Author: Dan Flippo | Snowflake

Project Tardigrade delivers ETL at Trino speeds to early users
🕐  5 min read + video tutorials |Case Study

The goal of Project Tardigrade is to provide an “out of the box” solution for the problems mentioned above.
We’ve designed a new fault-tolerant execution architecture that allows us to implement an advanced resource-aware scheduling with granular retries.

Author: Trino Blog

Understanding Twitter conversations: A Wordle case study
🕐 10 min read |Data Science

This case study shows what happened when (and how) Wordle became the number 1 mainstream topic.

Author: Luren Fratamico | Twitter Blog

Google debuts Cloud Run jobs for containerized, scripted tasks  🕐 1 min read |Google Cloud Platform

During a developer keynote at Google I/O 2022, Google unveiled Cloud Run jobs,
an extension of Google Cloud’s service for developing and deploying containerized apps using languages including Go,
Python and Java.

Author: Kyle Wiggers | TechCrunch


Slim CI/CD with Bitbucket Pipelines 🕐 10 min read

Tutorial of CI/CD of DBT projects on bitbucket.
Author: Simon Podhajsky - Data Lead at iLife Technologies | dbt Developer Blog

Plural 🕐 1-30 min to dig in docs

It may be a deployment simplification of many open-source projects.

Steampipe - tool to instantly query your cloud services 🕐  3 min read

Tool that will give you the ability to write SQL-based queries to explore dynamic data.
Mods extend Steampipe's capabilities with dashboards, reports, and controls built with simple HCL.

Checkov - Policy-as-code for everyone 🕐  3 min read

Pretty smart tool preventing cloud misconfigurations during build-time for Terraform, CloudFormation,
Kubernetes, Serverless framework and other infrastructure-as-code-language.
Takes time to tune but it’s useful.

Apache Airflow 2.3 — Everything You Need to Know 🕐  7 min read

Airflow 2.3 might be the most important release of Airflow to date. Why?
  • Dynamic task mapping lets Airflow trigger tasks based on unpredictable input conditions.
  • A new LocalKubenetesExecutor you can use to balance tradeoffs between the LocalExecutor and KubernetesExecutor.
  • A new REST API endpoint that lets you bulk-pause/resume DAGs.
And there is more in the whole article.

Author: Steve Swoyer | Astronomer Blog

CDC in conjunction with Delta Live Tables on Databricks 🕐  7 min read

This article describes how to update tables in your Delta Live Tables pipeline based on changes in source data.

Mosaic - library for geospatial analysis 🕐  1-30 min to dig in docs

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.


How is analytics engineering different from data engineering? 🕐  3 min read

Zach Wilson - Tech Lead in Airbnb started a discussion about the differences between analytics and data engineers.

  • That distinction between DE and AE is pretty clear...
  • Analytics engineer is a kinda new name for ETL / BI developers, agree?
  • The distinction between data analysts and analytics engineers is blurry, do you see it more clear?


Allegro Tech Live #28 | 19 May | online

This time the topic will be: the architecture of mobile solutions.
Gonna be two presentations:
Why M1 is so fast?
Payment module. How to gather bricks and build the whole construction.

Airflow Summit 2022| 25 May | Warsaw

  • Airflow in the Cloud: Lessons from the Field.
  • How DAG Became a Test - Airflow System Tests Redefined.
  • OpenLineage & Airflow - data lineage has never been easier.
  • Running +150 production Airflow on Kubernetes, is that HARD ?

Rumor is that we will have Apache Airflow swag available for attendees

Berlin Buzzwords| 12-14 June | Berlin

What’s gonna happen?
  • Keynote by Fiona Coath, who will talk about opposition to surveillance capitalism and our responsibilities as technologists.
  • Nick Burch, who will explore what Wordle can teach us about Information Retrieval, Search and AI/ML.
  • Anshum Gupta will tell us what's new in Apache Solr 9.0
Made on