DATA Pill #036 - ML in video games, dbt vs Delta Live Tables and Snowflake Data Engineering Best Practices

ARTICLES

Top 14 Snowflake Data Engineering Best Practices | 13 min | Data Engineering | John Ryan | Personal Blog

A whole article with tips on how to make you deliver faster, more efficient and simpler data pipelines. You'll also find here the overall data transformation landscape on Snowflake, explained steps and the options available, and finally a summary of the best lessons learned from over 50 engagements with Snowflake customers.

Automate data lineage on Amazon MWAA with OpenLineage | 8 min | Cloud | Stephen Said, Paul Villena, Vishwanatha Nayak | Amazon Web Services Blog

If you have ever wondered how to get started with data lineage on AWS using OpenLineage, then this is a must-read for you. Check out the step-by-step configuration guide for the openlineage-airflow plugin on Amazon MWAA. Here you will also find the AWS Cloud Development Kit project that deploys a pre-configured demo environment for evaluating and experiencing OpenLineage first-hand.

Recycling Kubernetes Nodes | 5 min | Data Engineering | Ilkin Mammadzada | Yelp Engineering Blog

After reading this one, you will learn what problems the team encountered whilst administering Yelp’s clusters. The team decided to tackle them in two parts:

Protecting workloads from disruptions.
Node replacement automation.

After solving these problems it is easy to deploy new versions of Ubuntu or keep nodes fresh. The team can create migration manifests using a CLI tool and Clusterman will gradually replace all the instances whilst ensuring that the workloads are not disrupted and that new nodes are running correctly.

Meshing MLOPS on Azure with MLFlow | 6 min | MLOps | Keshav Singh | DevGenius Blog

In this blog you will see establishing the ML life cycle leveraging MLFlow – an open source machine learning platform and framework for managing the ML life cycle. This is a short step-by-step hands-on demonstration of the MLOPs standardization on a Mesh Platform.

Build an end to end JSON logging system for clients apps | 5 min | Data Engineering | Liang Ma, Core Eng, Wei Zhu | Pinterest Blog

Read the story on how the Pinterest team decided to create an end-to-end pipeline with the following characteristics:

It’s built with the least resistance
Has ready to use logging APIs on each platform
Developers don’t need to touch any backend stuff
It’s easy to query and visualize logs

Five Software Engineering Principles for Collaborative Data Science | 10 min | Data Science | Jo Stichbury | Towards Data Science Blog

A traditional software engineer sets out rules in code. In contrast, a data scientist identifies with learning algorithms that analyze patterns in data. But analytics projects are still bound together with conventional code, and as a data scientist, you can benefit from best practices first pioneered by software engineering.

When you start to cut code on a prototype, you may not prioritize maintainability and consistency, but adopting a culture and way of working that is already proven can get your prototype production-ready faster.

Use a standard and logical project structure
Make your environment reproducible with dependency management
Make your code reusable by making it readable

dbt vs Delta Live Tables | 7 min | Data Engineering | Rahul Sharma | Personal Blog

What is better? As always, we can say that it depends on your needs. But if you know exactly what you need, then you should consider the pros and cons of dbt and Delta Live Tables with this article. After reading, you will know where dbt and DLT shine and where not, and in the end - what Rahul likes and doesn't in both.

Introduction to The kedro-mlflow Implementation: Path Setting, Artifact Storing, and Metrics Saving | 7 min | MLOps | M Hamid Asn | Personal Blog

In this article, you will learn how you can seamlessly track your machine learning experiments by integrating Kedro with MLflow via the kedro-mlflow plugin (which we extensively use in our projects too!).

You will also find:

An MLflow and Kedro overview,
Kedro-MLflow introduction,
The author's learning experience with this toolset.

Cloud data warehouses have become extremely popular in recent years. Their low cost and fully-managed services make it easy for businesses to get started and scale their data analysis efforts as needed. However, the pricing models for these services can be complicated, with a lot of factors affecting cost. The choice between Snowflake and BigQuery will depend on the organization's specific needs and usage patterns. Discover which solution you should choose.

NEWS

Announcing dbt-fal adapter | 3 min | dbt | Features & Labels Blog

A new adapter that makes the dbt Python experience 10x better is available now. It is the easiest way to run your dbt Python models. Read how it can run your Python code locally and in the cloud, lets you run code in the same environment and provides easy environment management and isolation between models.

General availability of Azure OpenAI Service expands access to large, advanced AI models with added enterprise benefits | 4 min | Cloud | Eric Boyd | Microsoft Azure Blog

Azure OpenAI Service is generally available now. Now businesses can apply for access to the most advanced AI models in the world—including GPT-3.5, Codex, and DALL•E 2. Customers will also be able to access ChatGPT—a fine-tuned version of GPT-3.5 that has been trained and runs inference on Azure AI infrastructure—through Azure OpenAI Service soon.

PODCAST

Machine Learning for Video Games | 1 h 15 min | ML | host: John Cron; guest: Carly Taylor | Super Data Science Podcast

The relationship between data science and cyber security
The importance of low-latency for an optimal gaming experience
The future of gaming
How to transition from a quantitative academic background into data science

DATA TUBE

How to work with customers on Big Data / ML / Analytics Projects using Scrum Framework? | 27 min | Agile | Rafał Zalewski | GetInData

How do you work with our customers?
How does it look?
What do the meetings look like?
How do you structure the cooperation?
Who does what and when?

Making sure your client knows the answers to these questions before you start the project together. Why? Rafał explains in this video.

CONFS EVENTS AND MEETUPS

Finding the Right Things with OpenSearch | 24 Jan | Online

As our data grows it’s become increasingly challenging to find what you need. In comes OpenSearch to help out. Learn how OpenSearch uses search techniques to find the data that is relevant to you.

Big Data Tech Warsaw Summit | 29-30 March 2023 | Online and onsite

This year you can choose whether you prefer to watch the conference online or attend in person in Warsaw. Over 90 outstanding speakers who work with Big Data and top data-driven companies are waiting to share their knowledge with you. Use the code DataPill15 and get a 15% discount.

________________________

Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig in previous editions DataPill