DATA Pill feed

DATA Pill #119 - Data Platform, ETL; new Flink connector

ARTICLES

Build an interface to your data platform | 7 min | Data Architecture | Kris Peeters
Modern data platforms are complex. If you look at reference architectures, like the one from A16Z below, it contains 30+ boxes. Each box can be one or more tools, depending on how you design it. You might not need all boxes in your specific data platform, most data platforms we see in the real world often contain 10+ tools.
Meta's Research SuperCluster for Real-Time Voice Translation AI Systems | 3 min | AI, ML & Data Engineering | Vinod Goje | InfoQ Blog
Meta highlights the need for powerful computing systems capable of performing quintillions of operations per second to drive forward the development of advanced AI technologies. To achieve this objective, Meta expanded its AI infrastructure by building two 24k-GPU clusters.
ETL development life-cycle with Dataflow | 17 min | Data Engineering | Rishika Idnani & Olek Gorajek | Netflix Technology Blog
Dataflow provides a robust testing framework inside the Netflix data pipeline ecosystem. This is especially valuable for the Spark SQL code which used to not be easy to unit test. All these test features, whether for unit testing, integration testing, or data audits come in the form of Dataflow commands or Python libraries, which make them easy to set up, easy to run, and provide no excuse to not instrument all your ETL workflows with robust tests. And the best part is that, once created, all these tests will run automatically during standard Dataflow command calls or during the CI/CD workflows, allowing for automated checking of code changes made by folks who may be unaware of the whole setup.
This one demonstrate he effortless deployment of OneTwo agents, intelligent applications that leverage advanced language models, onto the Vertex AI Reasoning Engine. How the key to this simplicity lies in a custom template that streamlines the development process and seamlessly integrates with Reasoning Engine’s infrastructure and how the template eliminates the need for manual Dockerfile creation, image building etc.

TUTORIALS

Developing AI agents and RAG applications with a single PDF document is easy, but we encounter some challenges when dealing with Multi-PDFs. To address this, we have implemented a query pipeline that optimizes retrieval using HyDE.
New Microsoft Fabric Domain for Data Governance | 5 min | Data Governance | Abiola A. David Personal Blog
The new Microsoft Fabric domain feature, now available in the admin portal, revolutionizes data governance by aligning data with specific business needs, adhering to a data mesh architecture.

NEWS

This connector lets you stream data from BigQuery tables to Flink, process it in real-time, and then write the results back to BigQuery.

PODCAST

Unpacking the 2024 Developer Survey results | 23 min | dev | Stack Overflow
  • most popular technologies,
  • most admired and desired programming languages,
  • feelings about/use of AI coding tools,
Key challenges facing adoption, including data quality, privacy, and integration into existing workflows. It covers various use cases and applications, implementation strategies, and the AI startup landscape.

CONFS, EVENTS AND MEETUPS

Topics:
  • Autistic Children Mood Recognition on AWS + OpenAI
  • AI tools for programming
  • Prompt engineering best practices for foundation models
Data Expo | Jaarbeurs Utrecht | 11-12 September
From agenda:
  • Knowledge graphs & metadata, your main ingredients for data democratization and data governance in a distributed model
  • ING in the Cloud: How we built a data platform at a bank
  • Applying machine learning to enable the creation of robust and efficient flight schedules at KLM
2024-08-23 00:12