DATA Pill #046 - Is the Data Engineer dead? And how Fivetran + dbt fail?

ARTICLES

How Fivetran + dbt actually fail: Part I, Part II, Part III, Part IV | 37 min | Data Engineering | Lauren Balik | Personal Blog

Do you remember this saga that was mentioned in DATA Pill #23? The last article argues that Fivetran's business practices are part of a more significant trend in the tech industry of prioritizing growth and revenue over long-term customer relationships and sustainable business practices.

Generative AI for banks: opportunities, risks and precautions | 10 min | AI | Jesper Nordström | Personal Blog

Let’s discuss the opportunities, risks, and precautions associated with generative AI in the banking industry. Some possibilities include improving customer experience, detecting fraud, and automating repetitive tasks. On the other hand, risks include data security and privacy concerns, biased algorithms, and potential job displacement. To mitigate these risks, Jesper suggests implementing ethical AI principles, ensuring transparency and explainability of AI models, and involving multiple stakeholders in the decision-making process.

The Data Engineer is dead, long live the (Data) Platform Engineer | 7 min | Data Engineering | Robert Sahlin | Personal Blog

Robert discusses the evolving role of data engineers and how the traditional responsibilities of a data engineer have shifted. He argues that the traditional data engineer role has become outdated with the advancements in technology and the increasing need for data-driven decision-making. In his opinion, data engineers should focus on developing and implementing scalable data architectures while also possessing a strong understanding of business needs and a wider range of technical skills. He also touches on the importance of collaboration between data engineers and data scientists.

TUTORIAL

A Modern Replacement For Airflow | 3 min | Software Development | april | Personal Blog, published in Bootcamp

A quick tutorial on how to replace AirFlow. It is time to introduce you to Mage. Mage is an open-source data pipeline tool for transforming and integrating data.

What does it give you?

Integrate and synchronize data from 3rd party sources.
Build real-time and batch pipelines using Python, SQL, and R to transform data.
Run, monitor, and orchestrate thousands of pipelines without losing sleep.

DATA LIBRARY

The Data Leader's guide to Deep Data Observability | 25 min | Data Engineering | Jake Noble | validio.io

Data leaders should measure five pillars of data quality: freshness, volume, schema, (lack of) anomalies, and distribution.

Obtaining a higher degree of Data Observability can help improve these five pillars of data quality, but not all Data Observability tooling is created equal.

We distinguish between “Shallow” Data Observability and “Deep” Data Observability, and data leaders should aim for the latter in order to fully measure the five pillars of data quality and to get full confidence in their data.

Deep Data Observability is different from Shallow, because it is fully comprehensive in terms of data sources, data formats, data granularity, validator configuration, cadence, and user focus.

NEWS

dbt-excel | dbt | Xebia Data

Tired of people telling you that you cannot build a data warehouse on top of Excel? Look no further! dbt-excel seamlessly integrates Excel into dbt, so you can take advantage of the dbt's rigor and Excel's flexibility. MUST WATCH about this release here.

Announcing the Release of Apache Flink 1.17 | Streaming | Leonard Xu | Apache Flink

Flink announces Apache Flink 1.17. It had 172 contributors enthusiastically participating and saw the completion of 7 FLIPs and 600+ issues, bringing many exciting new features and improvements to the community. What’s new about that?

For batch processing:

Streaming Warehouse API
Batch Execution Improvements
SQL Client/Gateway

For stream processing:

Streaming SQL Semantics
Checkpoint Improvements
Watermark Alignment Enhancement
StateBackend Upgrade

Google announces AI features in Gmail, Docs, and more to rival Microsoft | AI | James Vincent | The Verge

Google will make your work in Google Docs, Gmail, Sheets, and Slides easier with the help of AI. For example, Google says will be coming to Workspace apps in the future:

Draft, reply, summarize, and prioritize your Gmail
Brainstorm, proofread, write, and rewrite in Docs
Bring your creative vision to life with auto-generated images, audio, and video in Slides
Go from raw data to insights and analysis via auto-completion, formula generation, and contextual categorization in Sheets
…and a way more!

Microsoft announces Copilot: the AI-powered future of Office documents | AI | Tom Warren | The Verge

Firstly Google announced an AI bot in their suite, now Microsoft. Revolution to our work is coming. Copilot is a modern AI assistant that will help Microsoft 365 users create Office documents. It allows Office users to summon it to generate text in documents, create PowerPoint presentations based on Word documents, or even help use features like PivotTables in Excel.

DATA TUBE

How to use Python with Apache Kafka | 32 min | Stream Processing | host: Kris Jenkins, guest: Dave Klein | Confluent

Listen to talk with Dave, who has been an active member of the Kafka community for many years and noticed that there were a lot of Kafka resources for Java but few for Python. Kris and Dave discuss all things Kafka and Python: the libraries, the tools, pros & cons. What’s more?

What are the Python clients for Apache Kafka?
What are the use cases for using Python with Kafka?
Tips for getting started in Python with Kafka
Are there options similar to Kafka Streams for Python?

PODCASTS

Data Journey with Liudmyla Taranenko (Metadata.io) - Using data science to automate B2B marketing tasks | 45 min | host: Adam Kawa guest: Liudmyla Taranenko | Radio DaTa

Liudmyla is the Head of Data Science at Metadata.io, a company building the first Marketing OS for B2B.
Mentioned topics:

Data sources collected and used by Metadata.io
Data-driven features and product analytics at Metadata.io
ML algorithms at Metadata.io e.g. Neural Networks
Tech stack used at Metadata.io e.g. AWS, Databricks, (Py)Spark, MLflow

CDO Spotlight: The Value and Journey of Implementing a Data Product Mindset with Sebastian Klapdor of Vista | 36 min | host: Brian O’Neill guest: Sebastian Klapdor | Experiencing Data

Let’s listen to talk with Sebastian, who developed and grew a successful Data Product Management team at Vista. He explains what that process was like and what he learned. Sebastian shares valuable insights on how he implemented a data product orientation at Vista, what makes a good data product manager, and why technology usage isn’t the only metric that matters when measuring success. He also shares what he would do differently if he had to do it all over again.

CONFS EVENTS AND MEETUPS

Data & AI Leadership Day | 21th June | Zurich

The ultimate conference for data and AI leaders, innovators and enthusiasts. Get ready for an immersive experience with industry experts and thought leaders. Don’t miss out on this exciting opportunity to lead the way in data and AI innovation.

Lakehouse Day | 18th April 2023 | Zurich

Learn from leading global organizations on lakehouse best practices, and how the platform has become the destination for businesses of every size, in every industry and region around the world. There will also be opportunities to participate in technical quickstart training, product walkthroughs, and in-depth customer breakout sessions.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on GitHub

➡ Dig previous editions of DataPill