A recent data software project required extensive testing on transformed data using AWS Glue. PySpark was used for data transformation, and the test automation framework incorporated Pytest alongside PySpark. This cohesive approach ensured high data quality standards, which are crucial for accurate data analysis and ingestion by third-party tools. This blog shares four key takeaways from the experience of enhancing data quality testing in similar projects.
Traditional databases use monolithic designs optimized for storage, computing, SQL, and API. There's recently been a shift towards open standards, with vendors like Snowflake and Databricks adopting formats like Apache Iceberg. This blog explores the benefits of composable data systems and integrating large language models into data infrastructure.
Maciej shares expert tips on optimizing Apache Flink SQL jobs for better performance and reliability. He covers strategies for efficient joins, state management, and checkpointing, providing practical advice to enhance data processing workflows.
This blog demystifies agents, distinguishing domain-specific implementations from general AGI. It explores leveraging LLMs and AI agents for practical applications, addresses tools, costs, and latency and discusses their impact on various industries.
Do you remember the article "How Twitter processes 4 billion events in real-time daily" from DATA Pill 107? This time, Vu explores LinkedIn's approach to handling trillions of real-time events daily using Apache Beam. He highlights the transition from a lambda architecture to a more unified model, enhancing efficiency and transparency in the data processing pipeline.
Learn how to turn opaqueness into transparency by using Astronomer Cosmos to automatically render a dbt project into an Airflow DAG while running dbt on Azure Container Instances.
This tutorial explores how text-to-SQL has revolutionized database interactions and demonstrates its implementation using SingleStore Helios, LlamaIndex, and Groq.
Various paradigms exist in the evolving world of real-time data APIs, including S3, Snowflake, and REST APIs. This post explores why REST APIs dominate and demonstrates bridging them with data streaming using the OpenSky network and Apache Kafka®.
We chatted with Cassandra Shum, VP of Field Engineering at RelationalAI, about her company’s efforts to create what is called the industry’s first coprocessor for data clouds and language models. The goal is to allow companies to keep all their data where it is today while still tapping into the capabilities of the latest generation of AI tools.
This video demonstrates the entire process of reproducing GPT-2 (124M) from scratch. It covers building the GPT-2 network, optimizing its speed training, setting up the training run with GPT-2 and GPT-3 hyperparameters, and reviewing the results the following day. Note that this video builds on knowledge from earlier Zero to Hero Playlist videos. It closely resembles the creation of my nanoGPT repo, which is about 90% similar by the end.
Azure & AI Lowlands is a single day event with five tracks around the Microsoft Azure Platform. Focussing on cloud engineers, azure developers, AI engineers and AI enthousiasts.