DATA Pill feed

DATA Pill #013 - why Facebook pays so much for AWS and how black Swan events affect influenced ML in eCommerce sector

It looks like the summer slowdown is favourable to analyzing recent trends, and financial reports and wondering what lies ahead.

In today's newsletter, you'll find some of the most interesting predictions and analyses.
We'll look into the reports and pockets of some of the companies.
Let’s have a look.


Building Scalable Real Time Event Processing with Kafka and Flink | 15 min read | Data | Allen Wang | DoorDash Engineering Blog

Ad-hoc solutions where different technologies are piled on top of each other is not only inefficient but also difficult to scale and operate. Picking the right framework and creating the right building blocks is crucial in order to ensure success. Apache Kafka, Apache Flink, Confluent Rest Proxy and Schema registry prove to be both scalable and reliable. Researching and leveraging the sweet spots of those frameworks dramatically reduced the time needed to develop and operate this large scale event processing system. 

Data Mesh — A Data Movement and Processing Platform @ Netflix | 7 min read | Data Mesh | Netflix Blog

Data Mesh is now a general purpose data movement and processing platform for moving data between Netflix systems at scale. It also has a growing number of use cases. This article provides an overview of the system.

Scaling Kafka Consumer for Billions of Events | 5 min read | GCP & Kafka | Archit Agarwal | PayPal Technology Blog 

PayPal is in the process of migrating its analytical workloads to the Google Cloud Platform (GCP). This is part of the migration designed streaming application which consumes data from Kafka and streams it directly to BigQuery. It reduces the time for readouts from 12 hours to a few seconds (it takes approximately 30–35 billion events on a daily basis). This article displays  an approach to testing these  applications and how they increased the performance of the application by tuning a few parameters

DATASTREAMAWS: Powering the Internet and Amazon’s Profits | 10 min read | Cloud, AWS | Aran Ali | Visual Capitalist 

There are some interesting facts that you can discover in this article, e.g.:
  • Facebook (Meta) pays so much for AWS (compared to e.g. Netflix which is the most prominent well-known user)
  • LinkedIn is in the top 3 (they announced their migration to azure after being taken over by Microsoft and are still the main client of AWS)

It's also interesting that more companies acquired by Microsoft (not only LinkedIn) still use their previous cloud provider e.g. Microsoft acquired GitHub and they are still on AWS too.

The moral of the story: Usually, cloud migration for the sake of migration doesn’t justify the costs.

2022 Big Data Trends: Retail and eCommerce become one of the hottest sectors for AI/ML | 11 min read | ML/AI | Adam Kawa | GetInData Blog 
Nowadays, we can see that AI/ML is visible everywhere.
Including advertising, healthcare, education and many other sectors. 
Adam as CEO shares his conclusions based on the data/ML-related projects that GetInData is running, internal market research and that Retail and eCommerce have become one of the hottest sectors for AI/ML.

  •  The factors that favor the development of this Big Data trend, 
  •  the specifics of AI/ML in the retail and eCommerce sector,
  •  how “black swan” events and the introduction of regulations affect it,
  • how this trend will behave in the future. 
Can we expect this Big Data trend to grow bigger? 


Announcing Photon Engine General Availability on the Databricks Lakehouse Platform | 4 min read | Databricks Blog 

Photon is now generally available on Databricks across all major cloud platforms.

Open Sourcing All of Delta Lake | 7 min read |  Databricks Blog 

Today, it is the most comprehensive Lakehouse format used by over 7,000 organizations, processing exabytes of data per day.
Delta Lake's story from day 1 and it’s genesis in Apple until Delta Lake 2.0 and bringing Delta Lake APIs to open-source.



Interpretable Machine Learning | 50  min | ML | Serg Masis | DataFramed

How bias can produce harmful outcomes in machine learning systems, the different types of technical and non-technical solutions for tackling bias, the future of machine learning interpretability and much more. 

One of takeaways: The best way to assess risk is to view machine learning models as systems with different factors that interact with each other. This prioritizes experimentation, not just inference or prediction, to determine how different aspects of the model impact each other and the outcome.

The A.I. Platforms of the Future | 38  min | AI | Ben Taylor | SuperDataScience

Ben Taylor is the Chief AI Strategist at DataRobot, he shares his predictions about what we can expect from industry platforms 10 years from now.

  • Data type and source expansions
  • Democratizing A.I. via more approachable systems that don't require coding
  • Voice-activated systems will make A.I. even more accessible 



Artificial Intelligence for Business Leaders | 1h | Pedram Mokrian | Stanford Online

Why AI has become such a high priority and how business leaders can think about developing and adopting AI solutions.


GCP Summer of Learning 2022 | 6 weeks from 20 August | GCP | Google

It's maybe not exactly a conf or meetup but a free, 6-week training course about GCP with 5 role-based tracks (like Infrastructure or Data Management) wiz biz and technical levels. Seems worth considering. 

Made on