DATA Pill feed

DATA Pill #035 - ChatGPT3, recommendation system from YouTube & hype for Rust


Using ChatGPT3 as a Data Engineer | 4 min | Data Engineering | Abhijit Menon | Personal Blog

A tool which has become a great meme, but is also useful for day-to-day data engineering use cases. Read ​​some of the really helpful use cases Abhijit found that might kickstart some of you into using it regularly.

Can ChatGPT Write Better SQL than a Data Analyst? | 6 min | Data Engineering | Marie Truong | Personal Blog

It isn't rocket science, plenty of people went pretty crazy about ChatGPT3. A different article but from another perspective. Should Data Analysts be worried about this tool? Let’s play the game with Marie and check who wrote better SQL.

A Practical Guide to Building an Online Recommendation System | 12 min | Recsys | Jake Noble | Blog

How can TikTok recommend videos to you that were uploaded minutes ago?
How can YouTube pick up on your brand-new interest immediately after you watched one video about it?
How can Amazon recommend products based on what you currently have in your shopping cart?

From the pen of Jake who worked on the YouTube recommendation system:
An overview of online recommendation systems, the various approaches for building different subcomponents and offer some guidance to help you reduce costs, manage complexity and enable the team to ship ideas.

Self-serve feature platforms: architectures and APIs | 13 min | Data Platform | Chip Huyen | Personal Blog

In this one you will read about the evolution of feature platforms and self-serve feature engineering. This two-part text compares feature platform vs. a feature store or model platform and shows slow interaction speed for streaming features. Feature API and Functionality for fast experimentation are also mentioned. It takes quite some time to read the whole text, but without a doubt it's worth it.

Moving from Data Science to Model Science | 5 min | Data Science | Steve Jones | Data & AI Masters Blog

Steve writes about the data science process. Data Science focuses on deriving insight, often using models, from data. In Model Science it is a combination of both Data and models that becomes an important evolution of Data Science. Will model generation change the job? 

Improving Video Voice Dubbing Through Deep Learning | 6 min | Deep Learning | Paul McCartney, Vivek Kwatra, Yu Zhang, Brian Colonna & Mor Miller | Google Blog

Google shares their research for increasing voice dubbing quality using deep learning, providing a viewing experience closer to that of a video produced directly for the target language. 
How they work with technologies for cross-lingual voice transfer and lip reanimation, which keeps the voice similar to the original speaker and adjusts the speaker’s lip movements in the video to better match the audio generated in the target language. Both capabilities were developed using TensorFlow, which provides a scalable platform for multimodal machine learning.

Rust for Data Engineering—what's the hype about? | 10 min | Engineering | Sara Landfors | Heroes of Data

Rust is a “blazingly fast and memory efficient” programming language. It is highly reliable with a rich type system and an ownership model that together guarantee memory-safety and thread-safety. And it’s now hype for Rust.

When is Rust a great option for data engineering and when is it a terrible choice? 

+ Rust might be the best choice for data engineering if speed, performance and reliability are your top priorities, you’re working with Apache Arrow, if safety and security are highly prioritized, and when you might want to have rigidly defined data types.
--Rust might not be the best choice for data engineering if Rust as a language isn’t compatible with the rest of your data ecosystem, if you want to prototype something quickly, if your team doesn’t already know Rust, and if you expect to have a hard time finding and sourcing Rust talent.


Accelerate orchestration of an ELT process using AWS Step Functions and Amazon Redshift Data API | 10 min | AWS | Poulomi Dasgupta, Tahir Aziz & Raks Khare | AWS Big Data Blog

How to use AWS Step Functions, Amazon DynamoDB, and Amazon Redshift Data API to orchestrate the different steps in your ELT workflow and process data within the Amazon Redshift data warehouse.

Flink SQL: Joins Series 3 (Lateral Joins, LAG aggregate function) | 7 min | Flink SQL | Ververica Blog

You can think of lateral join as a foreach loop in SQL that iterates through a collection, applies some transformation on each iteration and produces an output. Lateral join is very useful in processing data that is stored in a hierarchical or nested format. This blog describes how to perform a lateral table join.



Deep Learning Pioneer Geoffrey Hinton Publishes New Deep Learning Algorithm | 2 min | Deep Learning | Anthony Alford| InfoQ Blog

Geoffrey Hinton recently published a paper on the Forward-Forward algorithm (FF), a technique for training neural networks that uses two forward passes of data through the network, instead of backpropagation, to update the model weights.



Data Trends & Predictions for 2023 | 39 min | Data Science | host: Richie Cotton; guest: Martijn Theuwissen & Jonathan Cornelissen | DataFramed Podcast

You may remember that last week’s edition was ALL about 2023 predictions, and you can read it here: DATA PILL 34

Today one more trends&prediction piece of content:

Here's several companies now investing in cloud native notebooks on top of the Jupyter ecosystem. Some of them are completely rewriting it and some of them are building on top of it. I think that will be extremely powerful because it takes away the friction for people to get started with notebooks and it enables easier collaboration.

A.I. for Medicine | 1 h 20 min | AI | host: Jon Krohn; guest: Charlotte Deane | Super Data Science Podcast 

AI prediction tools for antibodies and using statistics to prepare healthcare systems for pandemics. What is the variety of potential partnerships between medicine and machine learning?

Data Science at Shopify and Stitch Fix | 37 min | Data Science | host: Ben Lorica; guest: Wendy Foster & Olivia Liao | Super Data Science Podcast 

Talk about building Data Teams and Data Products with two data science leaders in e-commerce: Wendy Foster from Shopify and Olivia Liao from Stitch Fix.


Streamlining Large-Scale Java Development Using Error Prone | 40 min | Analysis | Sander Mak | GOTO Conference

Nowadays using static analysis to spot bugs in your code is a staple of modern Java development. During this presentation Sander shows how instead of finding issues you can fix them automatically. What are tools that are making that possible and how can you effectively use them? Watch and learn now.



Big Data Minds Europe | 26-28 Feb | Berlin

Smart Data Analytics and best ways to generate value from data.
Efficient data management, data analytics and data governance concepts, networking with data experts & decision makers.  

Let's talks about Kubernetes management at scale | 24 Jan | Online & Bydgoszcz

Over two years ago, OpenX infrastructure was migrated to Google Cloud to save time, increase scalability and be available in more regions worldwide. Every application was migrated to GKE. Now you have the chance to hear about it from SRE Ninja Anna Yemets. 
Let's talk about load balancing, scaling and how to put out every fire you face while running GKE at scale!


Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig in previous editions DataPill 

Made on