DATA Pill #047 - Leaving Amazon after 7.5 years, a new method for Kubernetes integration, and more

ARTICLES

Top 5 Data Streaming Trends for 2023 | 9 min | Data Streaming | Kai Waehner| Personal Blog

Let’s see how Kai identifies five trends in the data streaming space that are expected to gain momentum over the next few years. What are they?

Cloud-native lakehouses
Decentralized data mesh
Data sharing in real-time
Improved developer and user experience
Advanced data governance and policy enforcement

Kai’s point of view on these trends is waiting for you.

We all know personalization is the key to creating a competitive advantage in today's market, where customers expect tailored experiences based on their preferences, needs, and behaviors.

But what happened in 2012 at Spotify, or how did Kcell use real-time streaming to provide help?

Read about the importance of real-time context and persona. Real-time context refers to using data and analytics to understand a customer's current situation and needs.

What use cases can you find there?

Advertise a relevant product
Personalized gifts
Emergency situation without a delay
Fast and Smart Decision-Making,
Increasing rare chances for a highly-profitable conversion

Tencent Data Engineer: Why We Go from ClickHouse to Apache Doris? | 13 min | Data Science | Jun Zhang, Kai Dai | Geek Culture Blog

Meaty article with cheers and tears, lessons learned and practical tips that will be helpful during transition from ClickHouse to Doris. Read about problems with ClockHouse and the step-by-step transition to Apache Doris.

Vault Secrets Operator: A new method for Kubernetes integration | 6 min | Data Engineering | Rich Dubose, Tom Chwojko - Frank | HashiCorp Blog

With the Vault Secrets Operator, businesses can automatically generate new credentials and distribute them to pods running in the Kubernetes cluster. It makes it easier to rotate secrets continuously and reduces the likelihood of data breaches. However, the authors also explain the importance of secret management, particularly in Kubernetes clusters, as unauthorized users can easily access secrets, and this can lead to data exposure or application compromise.

DATA PRO

Speaking at conferences: A complete guide | 7 min | Self-Development | Cassie Kozyrkov | Hackernoon Blog

This is an excellent article for those who want to be better speakers. The mistakes beginners make, how to avoid them, how to act on the stage, some practical details, and a little about equipment you might need to make it all easier.

Left Amazon after 7.5+ years; Here is my honest review. | 9 min | Careers | Pooya Amini | Personal Blog

How was it to work at Amazon for 7.5 years? Read a balanced view of working at Amazon and highlights the pros and cons of working for the tech giant. Pooya writes about the job roles available at Amazon and how they can sometimes be draining, the company's work culture and how it operates, and the perks of working there. The article also touches upon Amazon's hiring process, which the author points out is extensive, with multiple rounds of interviews.

TUTORIAL

Terraform with YAML: Part 1 | 5 min | Data Engineering | Chris ter Beke | Xebia Tech Blog

In the first part of the tutorial, you will find a basic understanding of the benefits of using YAML configuration files in your Terraform code. How you can replace as much HCL code as possible with YAML, what the benefits are of doing so, and a simple example is waiting for you.

TOOLS

Fugue project | Data Engineering | Fugue

Fugue provides users with a cohesive platform to perform distributed computing, enabling the execution of Python, Pandas, and SQL code on Spark, Dask, and Ray with minimal need for adjustment.

It’s used for:

Parallelizing or scaling existing Python and Pandas code by bringing it to Spark, Dask, or Ray with minimal rewrites.
Using FugueSQL to define end-to-end workflows on top of Pandas, Spark, and Dask DataFrames.
FugueSQL is an enhanced SQL interface that can invoke Python code.

Amazon DataZone (Preview) | 2 min | Data Analytics | Amazon Web Services Blog

Use Amazon DataZone to share, search, and discover data at scale across organizational boundaries. Collaborate on data projects through a unified data analytics portal that gives you a personalized view of all your data while enforcing your governance and compliance policies.

K8sGPT gives Kubernetes SRE superpowers to everyone | 3 min | AI | K8sGPT

K8sgpt is a tool for scanning your kubernetes clusters, diagnosing and triaging issues in simple english. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI.

NEWS

Announcing new BigQuery inference engine to bring ML closer to your data | 5 min | Data Analytics | Amir Hormati, Abhinav Khushraj | Google Cloud Blog

Google announces BigQuery ML inference engine, which allows you to run predictions on a broad range of models hosted across multiple locations.

With this new feature, you can run ML inferences across:

Imported custom models trained outside of BigQuery with a variety of formats (e.g., ONNX, XGBoost, and TensorFlow)
Remotely-hosted models on Vertex AI Prediction
State-of-the-art pretrained Cloud AI models (e.g., Vision, NLP, Translate, and more)

DATA TUBE

Have you ever wondered how to scale scrum to multiple teams working on the same projects? If so, do not hesitate to watch a video about Nexus Framework! Well explained what Nexus Framework is and how it helps to integrate and coordinate work in multiple teams. Everything is based on a real example from one of the GetInData | Part of Xebia projects, where they had to scale scrum to 30+ experts in different groups.

Introducing AlloyDB Omni | 3 min | Databases | Andi Gutmans | Google Cloud

The technology preview of AlloyDB Omni, a downloadable edition of AlloyDB designed to run on-premises, at the edge, across clouds, or even on developer laptops. AlloyDB Omni offers the AlloyDB benefits you’ve come to love, including high performance, PostgreSQL-compatibility, and Google Cloud support, all at a fraction of the cost of legacy databases.

PODCASTS

Data Journey with Jonas Björk (Acast) - Data & analytics at Acast, AI & trends in the podcasting industry | 56 min | host: Adam Kawa guest: Jonas Björk | Radio DaTa

Listen to the data journey with Adam Kawa and his guest Jonas Björk, the CTO at Acast, in the latest episode of Radio Data. Acast, the Swedish-born podcast hosting and monetization platform, is revolutionizing the podcasting industry by allowing creators to distribute their content across various podcasting apps and monetize their shows through advertising and listener support.

This episode includes:

the data collection and utilization process at Acast,
how measuring podcasts differs from measuring songs on platforms like Spotify Real-life analytics use cases implemented at Acast,
the cutting-edge cloud-managed data tech stack used by Acast, featuring AWS, Snowflake, Airflow, Python, Rust, and more,
how AI/ML is being used to revolutionize the podcasting industry today and tomorrow.

#132 The Past, Present, and Future, of the Data Science Notebook | 56 min | hosts: Adel Nehme and Richie Cotton guest: Jodie Burchell | DataFramed by datacamp

In this episode, Jodie discusses the evolution of data science notebooks over the last few years, noting how the move to remote-based notebooks has allowed for the seamless development of more complex models straight from the notebook environment. The conversation also covers the following:

tooling challenges that have led to modern IDEs and notebooks,
how data science notebooks have evolved to help democratize data for the wider organization,
the tradeoffs between engineering-led approaches to tooling compared to data science approaches,
what generative AI means for the data profession,
Jodie’s predictions for data science, and more.

CONFS EVENTS AND MEETUPS

DevOps Enterprise Summit Amsterdam 2023 | 15-17th May | Amsterdam

Learn through experience reports from technology leaders helping their organizations win, rapidly disseminate winning tools and techniques and ways of thinking, and bring in the best experts for the problems identified by the community.

BTW, we have a great job offer for DevOps, check it out here.

________________________

Have any interesting content to share in the DATA Pill newsletter?

➡ Join us on GitHub

➡ Dig previous editions of DataPill