DATA Pill feed

DATA Pill #057 - LLM in Molecular Biology, Flink Flame Graph & real-time personalization at Pinterest


The Complex Data Models Behind Shopify's Tax Insights Feature | 9 min | Data Science | Siraj Ali | Shopify Engineering Blog
A business’s taxes can be difficult to manage, especially in the United States. The Shopify team launched the Tax Insights feature as part of Shopify Tax, which has helped merchants stay more on top of their tax compliance than ever before. The entire product entailed intensive data work behind the scenes. It included modifying several existing data models, creating four new ones, building in functionality to handle dynamically changing data and publishing insights to a key-value store that subsequently gets surfaced to the end user. A nice case to explore deeper.
Large Language Models in Molecular Biology | 40 min | LLM | Serafim Batzoglou | Towards Data Science Blog
Dive into an overview of some recent breakthroughs of deep learning-based language models in molecular biology. Read about how these advances will converge with the direct training of LLMs on large-scale biomolecular and population health data in the coming years and propel the field forward.

It includes a brief overview of LLMs, a more detailed introduction into molecular biology, then proceeds to describe a few recent LLM advances in molecular biology, and finally glance into the future.
Extracting Flink Flame Graph data for offline analysis | 7 min | Data Analysis | Krzysztof Chmielewski | GetInData | Part of Xebia Blog
This is a must read if you want to know how to extract data from the Flink UI and plot a Flame Graph from it for offline analysis. This solves the problem, with the Flink Flame Graph being updated during Job execution or even being no longer available after a job terminates.
Deep Multi-task Learning and Real-time Personalization for Closeup Recommendations | 10 min | Data Engineering | Haomiao Li, Travis Ebesu, Fan Jiang, Jay Adams, Olafur Gudmundsson, Yan Sun, Huizhong Duan | Pinterest Engineering Blog
At Pinterest, Closeup recommendations are an important feed of recommended content shown in pin closeups. They generate the highest number of impressions and play a crucial role in inspiring users. To provide high-quality recommendations, the Closeup relevance team uses advanced machine learning techniques. They have developed deep neural network models that predict user outcomes and incorporate sequential features and personalized blending to create real-time rankings. This blog post includes how the Pinterest Team:

  • got started on multi-task prediction
  • further improved multi-task prediction in our DNN architecture using the Multi-gate Mixture of Experts (MMoE)
  • introduced teacher-student regularization to stabilize ranking model predictions

and lots more.


Productionizing SQL-based workflows in Google Cloud | 10 min | Cloud | Kash Arcot | Google Cloud Blog
Dataform is a tool that enables cross-team collaboration on SQL-based pipelines. By pairing SQL data transformations with configuration-as-code, data engineers can collectively create an end-to-end workflow within a single repository.

The purpose of this article is to demonstrate how to set up a repeatable and scalable ELT pipeline in Google Cloud using Dataform and Cloud Build. The overall architecture discussed here can be scaled across environments and developed collaboratively by teams, ensuring a streamlined and scalable production-ready set up.
Airbyte Column Selection: Control over the exact data to sync | 4 min | Data Engineering | Malik Diarra | Airbyte Blog
Read about how to use column selection that has become available to the community on both Airbyte Open Source and Airbyte Cloud.


Extending Databricks Unity Catalog with an Open Apache Hive Metastore API | 4 min | Data Engineering | Todd Greenstein, Junlin Zeng, Vihang Karajgaonkar, Zeashan Pappa, Abhishek Pratap Singh, Sachin Thakur and Matei Zaharia | Databricks Blog
Databricks announces the preview of a Hive Metastore (HMS) interface for the Databricks Unity Catalog. This feature lets organizations centralize their data management, discovery and governance in the Unity Catalog and connect to it from a wide range of computing platforms. It also ensures consistent data governance across these platforms.
Announcing Dataform in GA: Develop, version control, and deploy SQL pipelines in BigQuery | 5 min | Cloud | Guillaume-Henri Huon, Lewis Hemens | Google Cloud Blog
Announcement of Dataform, which lets data teams develop, version control and deploy SQL pipelines in BigQuery. You can read the tutorial on how to set up a repeatable and scalable ELT pipeline in Google Cloud using it in the tutorial section.
Mayo Clinic teams up with Google Cloud to boost patient care. AI-powered tools will make it simpler for doctors to find important info and improve clinical workflows. This collaboration ensures HIPAA compliance for secure data access and informed decision-making.


The economic potential of generative AI | Takes some time to read | AI | McKinsey
To grasp what lies ahead requires an understanding of the breakthroughs that have enabled the rise of generative AI, which were decades in the making. ChatGPT, GitHub Copilot, Stable Diffusion, and other generative AI tools that have captured current public attention are the result of significant levels of investment in recent years that have helped advance machine learning and deep learning.

McKinsey created a great report to dive deeper into AI generative potential.
What will you find here?

  • Generative AI as a technology catalyst.
  • Generative AI use cases across functions and industries.
  • The generative AI future of work: Impacts on work activities, economic growth and productivity.
  • Considerations for businesses and society.


How to be a Tech Optimist | 41 min | AI | Host: Steve Hamm; Guest: Bob Muglia| The Data Cloud Podcast by Snowflake
In this podcast, Bob Muglia, an Enterprise, Builder and Author of The Datapreneurs: The Promise of AI and the Creators Building Our Future, answers every question you may have about the current and future state of generative AI.

BTW - here is a nice Snowflake related position in a very interesting project.


Data Mass | Call for Presentation | 5th October 2023
The Summit is aimed at people who use the cloud in their daily work to solve Data Engineering, Big Data, Data Science, Machine Learning and AI problems. The main idea of the conference is to promote knowledge and experience in designing and implementing tools for solving difficult and interesting challenges. If you have something to share with the community in this area - submit your presentation!
LLMs & The Generative Al Revolution | 14th September | Online
Join a dive deep into the revolutionary new world of LLMs, agents, auto-healing code, image generators, personalized tutors and more. Learn how to take advantage of these cutting edge new tools and how to do it consistently and reliably.

These talks will help you embrace the age of ambient intelligence and start putting these powerful new programs to work for you today.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on