DATA Pill feed

DATA Pill #055 - Microsoft builds the bomb and queues for Kafka


Microsoft builds the bomb | 7 min | Data Analytics | Benn Stancil | Personal Blog
Last week we mentioned that Microsoft released the Fabric - A Data Platform, combining its data services into one suite. Now it is time to dig into the very interesting comments on it. Highly recommended to read, even if you don’t agree in the end.
DoorDash identifies Five big areas for using Generative AI | 6 min | AI | Alok Gupta | DoorDash Engineering Blog
Generative AI creates new content from existing data. OpenAI's GPT models are popular examples. ChatGPT, used for this response, is one such model. DoorDash can leverage Generative AI in five categories with anonymized and pseudonymized data:

  1. Assistance of customers to complete tasks
  2. Better tailored and interactive discovery
  3. Generation of personalized content and merchandising
  4. Extraction of structured information
  5. Enhancement of employee productivity

Let’s explore each of these in more detail.
Queues for Kafka | 9 min | Data Engineering | Andrew Schofield | Personal Blog
Let's compare two types of queues in Kafka: producer-consumer and consumer-group, the benefits of using queues in Kafka such as improved performance, better fault tolerance and easier scalability.
Applying LLMs to Enterprise Data: Concepts, Concerns, and Hot-Takes | 11 min | LLM | Sam Stone | Towards Data Science Blog
This article provides an overview of applying LLMs to enterprise data, highlighting their potential benefits and addressing ethical and practical considerations. Responsible navigation of these challenges is crucial for effectively leveraging the power of language models in enterprise settings.
Understanding LLMOps: Large Language Model Operations | 12 min | LLM | Leonie Monigatti | Personal Blog
Another one about LLM. Take a look at a comprehensive overview of LLMOps, covering the training, fine-tuning, deployment, monitoring and maintenance of large language models. It highlights the challenges and considerations involved in managing LLMs and offers insights into future directions.


Introducing the Apache Iceberg Catalog Migration Tool | 8 min | Data Engineering | Dipankar Mazumdar, Ajantha Bhat | Dremio Blog
In this one, the Dremio team introduces the iceberg-catalog-migrator tool as a simple and efficient solution for transferring Iceberg tables from one catalog to another. They discuss various situations where this tool can be advantageous and provide practical instructions on how to use it for performing such migrations.
Latency goes subsecond in Apache Spark Structured Streaming | 10 min | Data Engineering | Jerry Peng, Pranav Anand, Sourav Gulati, Karthik Ramasamy, Michael Armbrust, Matei Zaharia | Databricks Blog
This one focuses on the improvements the Databricks team have made around offset management to lower the inherent processing latency of Structured Streaming. These improvements primarily target operational use cases such as real time monitoring and alerting that are simple and stateless.


Karapace | Data Engineering
Worth knowing the tool for data platform engineers.

Karapace supports the storing of schemas in a central repository, which clients can access to serialize and deserialize messages. Karapace rest provides a RESTful interface for your Apache Kafka cluster, allowing you to perform tasks such as producing and consuming messages and perform administrative cluster work, all the while using the language of the WEB.


An insane case for any of you using open-source LLMs on normal GPUs. Itamar fine-tuned a 33B-parameter LLM on Google Colab in just a few hours.


Google's new feature enables developers to manage multiple containers in a single Cloud Run service, making it possible to build and deploy more complex applications using serverless technology.

Check out how this feature works in detail, including how it simplifies the deployment process, how it allows for better container resource allocation, and how it enables easier application scaling.
Check Frank Slootman's opinion if generative AI will bring about the obsolescence of entire industries. Will AI replace human labor in many industries, leading to job loss and the need for workers to acquire new skills?

Read how in his opinion, companies must adapt to the changes brought about by AI to remain competitive in the global market.
Announcing the General Availability of Databricks SQL Serverless! | 3 min | SQL | Cyrielle Simeone, Shant Hovsepian, Gaurav Saraf | Databricks Blog
This blog post will highlight the advantages of DB SQL Serverless and update you on the newest features to enhance your data warehousing performance.

Databricks announces that serverless computing for Databricks SQL is now Generally Available on AWS and Azure. It guarantees unparalleled stability, support, and enterprise readiness for your mission-critical workload on the Databricks Lakehouse Platform.


Building AI Products with ChatGPT | 56 min | AI | Host: Richie Cotton; Guest: Joaquin Marques | DataFramed Podcast
During this talk, Joaquin and Richie discuss AI's development, its current state and future possibilities. He shares information on ongoing projects at Kanayma and tips for building AI products, including ChatGPT's impact. Let’s explore the practical consequences, technical details and real-world applications of AI, offering guidance for incorporating AI into your company's products.


Berlin Buzzwords | 18-20th June | Berlin
Join the conference focused on open source software projects in the field of data analysis, machine learning, scalability, storage and searchability. Not only can you discover the latest trends in the world of search with projects like Elasticsearch, OpenSearch, Solr and different vector search engines, but also learn more about projects such as Apache Flink, Spark, Kafka, MongoDB and many others.
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Made on