DATA Pill feed

DATA Pill #072 - Hosted LLMs, Gen AI Lifecycle Patterns, Instacart’s Internal AI Assistant


Training Foundation Improvements for Closeup Recommendation Ranker | 8 min | ML | Fan Jiang, Liyao Lu, Laksh Bhasin, Chen Yang, Shivin Thukral, Travis Ebesu, Kent Jiang, Yan Sun, Huizhong Duan | Pinterest Engineering Blog
This blog post takes a deeper look into three areas for the closeup ranking model, specifically:

  • Training data logging and generation
  • Various sampling configurations and learnings
  • Periodic and automatic model refreshes with an in-house auto-retraining framework
Generative AI Lifecycle Patterns | 17 min | AI | Ali Arsanjani | Personal Blog
Ali discusses strategies for scaling Generative AI adoption from research and prototypes to enterprise-scale production. He explores various techniques commonly employed to tackle the common challenges in this journey, aiding in the maturation of the development process.
Scaling Productivity with Ava — Instacart’s Internal AI Assistant | 6 min | AI | Zain Adil, Kevin Lei, Ada Cohen | Instacart Tech Blog
This one dives into the development journey of Ava, an internal AI assistant powered by OpenAI's GPT-4 and GPT-3.5 models at Instacart. Ava has expanded its user base beyond engineering, added features like conversation search and Slack integration, introduced the Ava Prompt Exchange for task-specific templates, and has plans to enhance its capabilities and expose APIs for broader company-wide use.
You don’t need hosted LLMs, do you? | 6 min | LLM | Sergei Savvov | Better Programming
Sergei compares two approaches to using LLMs: making API calls to OpenAI versus deploying your model. This article discusses cost, text generation quality, development speed and privacy.


This tutorial provides a step-by-step guide for onboarding Lake Formation permissions in hybrid access mode for specific users, even when the database is already accessible to other users via IAM and S3 permissions. It covers the setup process for hybrid access mode within an AWS account and between two AWS accounts, offering detailed instructions.
Read about the end-to-end generative AI pipeline using Teradata Vantage tools. It covers problem definition, data exploration, preparation, training and model operationalization, all presented for easy replication. Access to Teradata Vantage is provided through ClearScape Analytics Experience™, a web-based platform with full capabilities.
Building a ShopifyQL Code Editor | 6 min | SQL | Trevor Harmon | Shopify Blog
In October 2022, Shopify introduced ShopifyQL Notebooks. This first-party app enables merchants to analyze their shop data using ShopifyQL, thanks to a custom adapter that integrates ShopifyQL with the CodeMirror code editor framework. This integration allows syntax highlighting, code completion, lintingand tooltips, enhancing the overall ShopifyQL code editing experience.


Announcing DuckDB 0.9.0 | 3 min | Data Engineering | Mark Raasveldt and Hannes Mühleisen | DuckDB Blog
This DuckDB update introduces various exciting features, including out-of-core hash aggregates, storage and index improvements, DuckDB-WASM extensions, extension auto-loading, improved AWS and Azure support, Iceberg support and a PySpark-compatible API. Notably, it enhances performance, memory management and storage efficiency, making DuckDB a more powerful and user-friendly database.
So long data silos: Announcing BigQuery Omni cross-cloud joins | 3 min | Data Analytics | Vidya Shanmugam | Google Cloud Blog
Google introduces BigQuery Omni's new cross-cloud join feature, enabling seamless data querying and analysis across multiple cloud platforms in a single SQL statement. This eliminates the need for data copying, simplifies operations and reduces costs, allowing direct joins across clouds without creating intermediate tables or ETL pipelines.


How Apache Flink Delivers for Deliveroo | 23 min | Data Streaming | host: Alex Williams; guests: Felix Angell and Duc Anh Khul | The New Stack
TNS Host Alex Williams chats with Deliveroo engineers Felix Angell and Duc Anh Khu about their shift to Apache Flink for real-time data streaming. They stress the importance of flexible data modeling for rapid product development, yet acknowledge its complexity. This leads to a request for a self-serve configuration feature in MSF for customizable low-level settings and auto-scaling based on metrics. This transition to Flink and MSF enables Deliveroo to prioritize core tasks like continuous integration and delivery, while effectively handling data processing.


AI and the Future of Speech Technologies | 37 min | AI | host: Ben Lorica; guest: Yishay Carmiel | The Data Exchange
Interview highlights:

  • Generative AI for Audio (text-to-speech; text-to-music; speech synthesis)
  • Speech Translation
  • Automatic Speech Recognition and other models that use audio inputs
  • Speech Emotion Recognition
  • Restoration
  • Similarities in recent trends in NLP and Speech
  • Diarization (speaker identification), and implementation challenges
  • Voice cloning and risk mitigation


GoDataFest | Amsterdam | 24-26th October 2023
Take part in a multitude of sessions focused on various data & AI technologies and platforms. From fireside chats and presentations to ask-me-anything sessions and/or workshops, each session is hosted by seasoned experts.

You can expect insights, developments and tutorials about the latest and greatest data technology. Topics include modern data platforms, analytics engineering, data democratization, AI, MLOps, pipeline orchestration and much, much more.
Kubeflow Summit 2023 | Hybrid Event | 6th October 2023
Get ready to dive into the world of Kubeflow, the open-source machine learning platform built on Kubernetes. Our summit will feature engaging sessions, hands-on workshops and networking opportunities to connect with like-minded individuals.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on