Skyplane is an open-source developer tool for transferring data across cloud object stores. Skyplane is 164x faster than rsync and 113x faster than AWS DataSync.
It's amazing how a simple concept can provide such a boost. The VMka starts so fast that there is no need to think twice when copying a large volume.
Based on a presentation by Sonja Ericsson from Spotify.
Most of the data pipelines in Spotify relied on a tool called Luigi, which was built in-house by Spotify and open-sourced in 2012. In essence, it is a client-side orchestration framework (with a server scheduler) used to build data pipelines in Python. Due to increasing scale orchestration demands, Spotify decided to go with Flyte which was built by Lyft and open-sourced in 2020.
This article shows why and how.
That will be not 1 but 3 articles (there is also something about Airbyte, dbt & MDS in general).
Get some 🍿 ready, as Fivetran's CEO definitely got on the wrong side of the author (check Part 3).
The whole saga initiated by the author is not only about Fivetran, but in general about the things that can be easily forgotten when deciding about MDS. Note for example how "over-normalization" and a subtle change of # of connectors to Monthly Active Row impacted the pricing. There is a 'T' before the L in the ELT — the decision by the EL provider to create as highly normalized as possible landing tables.
Based on a report by Wancloud.
The report reveals that IT decision-makers are taking action to rein in costs, with 39% noting they’ve decided to move or leave significant cloud consumption and high-performance workloads on premise, and a further 29% noting they’ve switched public cloud providers in the first half of 2022 due to high costs.
Based on Ji Krochmal's experience in building a new ETL pipeline—basically from scratch in Budbee.
In the old pipeline, there was ETL logic running in Airflow. It was quite slow because it was implemented in Python, and because it duplicated a lot of data (10K-100K times). As a consequence, it used a lot of AWS resources, and it was then queried by Athena without partitioning, which was expensive. Ji identified the opportunity to migrate all of this logic wholesale to Databricks, which ended up being a smart move. As a result, the old pipeline became much faster, used fewer resources, and was much easier to improve—while leveraging the logic and toolset of the new pipeline. The new system can be seen in the picture below, with the red line showing the new connections.
Piotr describes the 3 steps through the data journey you can take to become a more data-driven company.
PySpark UDFs offer flexibility since they enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state "what to do", whereas PySpark, as a sandbox, encapsulates "how to do it". That makes PySpark easier to use, but it can be difficult to identify performance bottlenecks and apply custom optimizations.
Elasticsearch architecture is evolving. The new architecture enables many immediate and future improvements, including:
Meta introduces Velox, an open source unified execution engine aimed at accelerating data management systems and streamlining their development.Velox helps consolidate and unify data management systems.
The first public preview of Fleet, accessible to everyone. Fleet is our new distributed polyglot editor and IDE.
A new advanced system for character animation from Ubisoft in France and Concordia University in the United States. The team developed a deep learning model for in-game character animation that allows developers to automatically generate natural character animation, reducing time costs.
This is a review of “Software Engineering at Google” curated by Titus Winters, Tom Manshreck and Hyrum Wright with takeaways and personal conclusions from the book, like:
Hyrum’s law - ubiquitous law, referenced many times in the book.
With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
For example, Linux wraps process identifiers (PID) when they exceed 32768 (2^15). Even if you can set the higher value, it would break because many libraries assume that the maximum is 32768.
Purvi Shah, the VP of Enterprise Big Data Platforms at American Express, explains how they have invested in the cloud to power this visibility and the complex suite of integrations they have built and maintained across legacy and modern systems to make it possible.
InfoQ Conference that gathers software leaders at early adopter companies, presenting the trends and practices.
Speak up on The Open Lakehouse Conference.
The conference is related to data lakes and lakehouses, as well as data and analytics more broadly (e.g., last time, Wayfair spoke about their streaming analytics architecture). For data practitioners.