DATA Pill #119 - Data Platform, ETL; new Flink connector

ARTICLES

Build an interface to your data platform | 7 min | Data Architecture | Kris Peeters

Modern data platforms are complex. If you look at reference architectures, like the one from A16Z below, it contains 30+ boxes. Each box can be one or more tools, depending on how you design it. You might not need all boxes in your specific data platform, most data platforms we see in the real world often contain 10+ tools.

Meta's Research SuperCluster for Real-Time Voice Translation AI Systems | 3 min | AI, ML & Data Engineering | Vinod Goje | InfoQ Blog

Meta highlights the need for powerful computing systems capable of performing quintillions of operations per second to drive forward the development of advanced AI technologies. To achieve this objective, Meta expanded its AI infrastructure by building two 24k-GPU clusters.

ETL development life-cycle with Dataflow | 17 min | Data Engineering | Rishika Idnani & Olek Gorajek | Netflix Technology Blog

Dataflow provides a robust testing framework inside the Netflix data pipeline ecosystem. This is especially valuable for the Spark SQL code which used to not be easy to unit test. All these test features, whether for unit testing, integration testing, or data audits come in the form of Dataflow commands or Python libraries, which make them easy to set up, easy to run, and provide no excuse to not instrument all your ETL workflows with robust tests. And the best part is that, once created, all these tests will run automatically during standard Dataflow command calls or during the CI/CD workflows, allowing for automated checking of code changes made by folks who may be unaware of the whole setup.

OneTwo and Vertex AI Reasoning Engine: exploring advanced AI agent development on Google Cloud | 5 min | GenAI | Jared Chavez Personal Blog

This one demonstrate he effortless deployment of OneTwo agents, intelligent applications that leverage advanced language models, onto the Vertex AI Reasoning Engine. How the key to this simplicity lies in a custom template that streamlines the development process and seamlessly integrates with Reasoning Engine’s infrastructure and how the template eliminates the need for manual Dockerfile creation, image building etc.

TUTORIALS

Building an Ayurveda Healthcare Multi-PDF Agent with SingleStore and LlamaIndex | 9 min | GenAI | Akriti Upadhyay

Developing AI agents and RAG applications with a single PDF document is easy, but we encounter some challenges when dealing with Multi-PDFs. To address this, we have implemented a query pipeline that optimizes retrieval using HyDE.

New Microsoft Fabric Domain for Data Governance | 5 min | Data Governance | Abiola A. David Personal Blog

The new Microsoft Fabric domain feature, now available in the admin portal, revolutionizes data governance by aligning data with specific business needs, adhering to a data mesh architecture.

NEWS

New Flink Connector to BigQuery | GitHub

This connector lets you stream data from BigQuery tables to Flink, process it in real-time, and then write the results back to BigQuery.

PODCAST

Unpacking the 2024 Developer Survey results | 23 min | dev | Stack Overflow

most popular technologies,
most admired and desired programming languages,
feelings about/use of AI coding tools,

From Hype to Reality: The Current State of Enterprise Generative AI Adoption | 45 min | GenAI | The Data Exchange

Key challenges facing adoption, including data quality, privacy, and integration into existing workflows. It covers various use cases and applications, implementation strategies, and the AI startup landscape.

CONFS, EVENTS AND MEETUPS

AWS User Group 3city Meetup #8: GenAI edition | Gdańsk | 28 August

Topics:

Autistic Children Mood Recognition on AWS + OpenAI
AI tools for programming
Prompt engineering best practices for foundation models

Data Expo | Jaarbeurs Utrecht | 11-12 September

From agenda:

Knowledge graphs & metadata, your main ingredients for data democratization and data governance in a distributed model
ING in the Cloud: How we built a data platform at a bank
Applying machine learning to enable the creation of robust and efficient flight schedules at KLM

2024-08-23 00:12