DATA Pill feed

DATA Pill #02 - weekly Big Data, Cloud, AI/ML overview

ARTICLES 


1. MLOps: 5 Machine Learning problems resulting in ineffective use of data
| 10 min read | ML & MLOps | 🎉 Jakub Jurczak | GetinData 
5 Machine Learning areas are at risk of inefficiency that could be siled by MLOps
  • Data silos trap - mismatching IDs between warehouses can make joining data between different sources difficult and sometimes even impossible.
  • Time goes by. So does data. - You can never know if data that is being processed is new or stale data, so there is a need for some TTL (time to live) information that says how long old data is good.
  • Skewing data - If the value of a feature changes significantly over time, then the model performance could suffer.

 

2. Scaling data access by moving an exabyte of data to Google Cloud
| 7 min | GPC, BigQuery | Wini Tran, Di Zhao | Twitter Blog
Technical dive into how Twitter approached migration to BigQuery, conclusions and results:
  • Decrease the development time required for new dataset ingestion down from two weeks to one hour.
  • Reduce the maintenance required for data engineers by leveraging managed services, including Airflow.

3. Enabling engineering best practices into data workflows | 6 min read | dbt | Eetu Huhtala | If Technology 
Code, not graphical user interfaces, is the best abstraction to express complex analytic logic. — What, exactly, is dbt?
Using dbt to automate deployments explained with an If Technology case study. Why did they start using dbt? How the process and the result looked like.
Some of the key benefits of using dbt:
  • Automated dependency management within a data pipeline
  • Modular code: single query per file
  • Automate deployments, both for test and production, enabling continuous integration

 

4. Orchestrate big data jobs on on-premises clusters | 5 min | AWS |AWS Blog
Step Functions enables thousands of workflows to run parallel. Additionally, Lambda provides flexibility implementing arbitrary interfaces to the on-premises infrastructure and its compute resources. With additional steps in the orchestration, the solution also allows operations to monitor thousands of parallel jobs in a visual interface for better debugging.

 

5. My Journey to Analytics Engineering: How I Got Started and You Can, Too | 10 min | dbt | Emily Hawkins - Data Engineering Manager
Drizly data stock. And the prejudice that Analytics Engineering is empowering, fun and lucrative career.



 

NEWS 


1. Announcing General Availability of Databricks Feature Store | 6 min | ML & MLOps | Databricks Blog
The first feature store co-designed with data and MLOps platform is generally available (GA).

 

2. Google Cloud launches AlloyDB, a new fully managed PostgreSQL database service | 5 min | Techcrunch Blog
Google announced the launch of AlloyDB, a new fully managed PostgreSQL-compatible database service that the company claims to be twice as fast for transactional workloads as AWS’s comparable Aurora PostgreSQL (and four times faster than standard PostgreSQL for the same workloads and up to 100 times faster for analytical queries).

 

3. Extending BigQuery Functions beyond SQL with Remote Functions, now in preview | 5 min + tutorial | From Google
With Remote Functions, you can now write custom SQL functions in Node.js, Python, Go, Java, NET, Ruby, or PHP. This ability means you can personalize BigQuery for your company, leverage the same management and permission models without having to manage a server.

 

PODCAST

 

1. 5 current trends in the data and AI landscape (H12022) | 22 min | Radio DaTa
  • Retail becomes a very hot sector for AI/ML (plus new data sources, Metaverse, MLOps, Responsible AI)
  • Modern Data Platforms (plus SQL, hiring, open-source, data engineering pipelines)
  • Public Cloud (plus cloud-native, platform unification, data residency)
  • Data quality and data auditing
  • Data access (data cataloging, data discovery, and data mesh).
All explained and with ideas on how to follow such trends.

 

2. Machine Learning for Optimization | 26 min | The Data Exchange
How machine learning can be used to learn constraints in optimization problems. Use cases and trends in the use of machine learning for optimization problems.

 

3. Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way | 58 min | Data Engineering Podcast
Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization.

 

4. Automating ML Model Deployment | 1h | Super Data Science 
Dr. Doris Xin, joins Jon Krohn to discuss how automating ML model deployment delivers groundbreaking change to data science productivity
  • How Linea reduces ML model deployment down to a couple of lines of Python code
  •  Linea use cases

 

DATAtube 

 

1. The future of Cloud databases  | 28 min | Google Cloud Tech 
75 % of all databases are expected to be in the cloud this year. How AlloyDB is going to meet this trend?

 

2. Build End-To-End Data Pipelines With Snowflake | 40 min | Snowflake
How you can build faster, more performant, and smarter data pipelines, with language of your choice with Snowpark? You can see some of latest capabilities in action.


CONFS AND MEETUPS

 

Airflow Summit 2022 | 23-37 May
Still can register on the biggest Airflow Event of the Year! Reunion of the global community of Apache Airflow practitioners and data leaders such as 🎉 GetInData! 
Dig it to sessions like:
OpenLineage & Airflow - data lineage has never been easier by Maciej Obuchowski and Paweł Leszczyński 

 

5 Reasons NOT to Attend Snowflake Summit 2022 | 13-16 June (now lower price registration)
  1. You’re satisfied with how you currently share and collaborate on data
  2. You don’t want to be the first to hear about exciting new Snowflake product features
Made on
Tilda