DATA Pill feed

DATA Pill #066 - Powering the Latest LLM Innovation, Data contracts and schema enforcement with dbt.

ARTICLES

Three challenges in deploying generative models in production | 9 min | Data Engineering | Aliaksei Mikhailiuk | Towards Data Science
In this blog, we'll delve into deploying generative models in production, tackling key challenges. Let’s focus on the latest developments in diffusion and GPT-based models, while also exploring broader applications across various model types.
Where’s My Data — A Unique Encounter with Flink Streaming’s Kinesis Connector | 12 min | Data Engineering | Seth Saperstein | Lyft Engineering Blog
Read the story on how Lyft faced with perseverance a job dealing with massive data streams from Kinesis to S3 and how they encountered persistent issues that strained Flink's capabilities. The investigation revealed complex challenges, including CPU throttling, event time alignment and subtask interactions, leading to a 5-day deadlock state that impeded data emission. Lyft addressed the problem by enhancing Flink's functionality and implementing better monitoring and mitigation strategies to prevent similar incidents in the future.
Securely Scaling Big Data Access Controls At Pinterest | 14 min | Data Engineering | Soam Acharya, Keith Regier | Pinterest Engineering Blog
The article discusses how Pinterest has achieved secure scalability for big data access controls. By implementing a custom authorization system, Pinterest ensures that data access is tightly controlled and audited, preventing unauthorized access. This solution leverages fine-grained access policies, scalable infrastructure and rigorous auditing to maintain data security and integrity in a growing environment.

TUTORIALS

Use Amazon Athena to query data stored in Google Cloud Platform | 6 min | Cloud | Jonathan Wong | AWS Blog
In this article, you will explore streamlining data access across the Google Cloud Platform and AWS, optimizing efficiency. Leveraging data connectors enables multi-cloud adaptability, boosting business expansion. Moreover, derived insights from data analysis facilitate enhanced BI application development, advancing organizational data analysis workflows.
Data contracts and schema enforcement with dbt | 8 min | Data Engineering | Lucas Ortiz | Xebia Tech Blog
This article delves into data contracts and their implementation with dbt. It covers the basics of data contracts as agreements between data producers and consumers. The process of implementing data contracts in dbt, along with model versioning and constraints is explored. It concludes by showcasing how dbt Cloud's CI/CD features help prevent disruptive changes, ensuring data integrity.
Powering the Latest LLM Innovation, Llama v2 in Snowflake, Part 1 | 5 min | LLM | Justin Langseth | Snowflake Blog
It's worth taking a look at this first part of the newest series by Snowflake. This blog series covers how to run, train, fine-tune and deploy large language models securely inside your Snowflake Account with Snowpark Container Services.

TOOL

Singer.io | ETL
Singer is an open-source standard for writing scripts that move data between databases, web APIs, files, queues, and just about anything else you can think of.

Singer describes how data extraction scripts—called “taps”—and data loading scripts—called “targets”— should communicate, allowing them to be used in any combination to move data from any source to any destination. Send data between databases, web APIs, files, queues, and just about anything else you can think of.

DATA TUBE

The conspiracy to make AI seem harder than it is! | 1 h 30 min | AI | Gustav Söderström | Spotify R&D
From Spotify's corridors, comes an educational talk by an internal executive, now shared globally. The talk demystifies AI, making it accessible to all. What will you find here?

  • What is an LLM?
  • What about creativity?
  • How do you steer it?
  • Why did no one see it coming?
  • Intelligence is compression!
  • Diffusion Models - Generating images, video and music
  • Conditioning on text
Unlocking the Power of Data Science in the Cloud | 41 min | Data Science | Host: Guests: John Knieriemen, Solongo Guzman | DataCamp
In the episode, Richie, Solongo and John cover the motivation for moving analytics to the cloud, economic triggers for migration, success stories from organizations who have migrated to the cloud, the challenges and potential roadblocks in migration, the importance of flexibility and open-mindedness and much more.

REPLAY THE EVENT

AWS Storage Day | Virtually | 8 hours
Take a little step back and watch an event made by AWS. This event was ideal for anyone who is eager to learn more about:

• How to prepare for AI/ML with the storage decisions you make now
• How to deliver holistic data protection for your organization, including recovery planning to help protect against ransomware
• How to do more with your budget by optimizing storage costs for on-premises and cloud data

CONFS EVENTS AND MEETUPS

Google Cloud Summit Poland | On-site | 26th October | Warsaw
Join the biggest Google Cloud event of the year, organized for the first time in Poland at the Palace of Culture and Science in Warsaw. Google Cloud Summit Poland is a free event gathering the cloud community.

Make valuable connections, chat with Google Cloud leaders, partners, the tradeshow zone and discover technology trends in artificial intelligence, application modernization, collaboration, data cloud, open infrastructure and security - all to help you accelerate your digital transformation and improve your business results.
________________________
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on
Tilda