Data silos trap - mismatching IDs between warehouses can make joining data between different sources difficult and sometimes even impossible.
Time goes by. So does data. - You can never know if data that is being processed is new or stale data, so there is a need for some TTL (time to live) information that says how long old data is good.
Skewing data - If the value of a feature changes significantly over time, then the model performance could suffer.
Decrease the development time required for new dataset ingestion down from two weeks to one hour.
Reduce the maintenance required for data engineers by leveraging managed services, including Airflow.
3. Enabling engineering best practices into data workflows | 6 min read | dbt | Eetu Huhtala | If Technology Code, not graphical user interfaces, is the best abstraction to express complex analytic logic. — What, exactly, is dbt? Using dbt to automate deployments explained with an If Technology case study. Why did they start using dbt? How the process and the result looked like. Some of the key benefits of using dbt:
Automated dependency management within a data pipeline
Modular code: single query per file
Automate deployments, both for test and production, enabling continuous integration
4. Orchestrate big data jobs on on-premises clusters | 5 min | AWS |AWS Blog Step Functions enables thousands of workflows to run parallel. Additionally, Lambda provides flexibility implementing arbitrary interfaces to the on-premises infrastructure and its compute resources. With additional steps in the orchestration, the solution also allows operations to monitor thousands of parallel jobs in a visual interface for better debugging.
2. Google Cloud launches AlloyDB, a new fully managed PostgreSQL database service| 5 min | Techcrunch Blog Google announced the launch of AlloyDB, a new fully managed PostgreSQL-compatible database service that the company claims to be twice as fast for transactional workloads as AWS’s comparable Aurora PostgreSQL (and four times faster than standard PostgreSQL for the same workloads and up to 100 times faster for analytical queries).
3. Extending BigQuery Functions beyond SQL with Remote Functions, now in preview | 5 min + tutorial | From Google With Remote Functions, you can now write custom SQL functions in Node.js, Python, Go, Java, NET, Ruby, or PHP. This ability means you can personalize BigQuery for your company, leverage the same management and permission models without having to manage a server.
Retail becomes a very hot sector for AI/ML (plus new data sources, Metaverse, MLOps, Responsible AI)
Modern Data Platforms (plus SQL, hiring, open-source, data engineering pipelines)
Public Cloud (plus cloud-native, platform unification, data residency)
Data quality and data auditing
Data access (data cataloging, data discovery, and data mesh).
All explained and with ideas on how to follow such trends.
2. Machine Learning for Optimization | 26 min | The Data Exchange How machine learning can be used to learn constraints in optimization problems. Use cases and trends in the use of machine learning for optimization problems.
3. Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way| 58 min | Data Engineering Podcast Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization.
4. Automating ML Model Deployment| 1h | Super Data Science Dr. Doris Xin, joins Jon Krohn to discuss how automating ML model deployment delivers groundbreaking change to data science productivity
How Linea reduces ML model deployment down to a couple of lines of Python code
Linea use cases
DATAtube
1. The future of Cloud databases | 28 min | Google Cloud Tech 75 % of all databases are expected to be in the cloud this year. How AlloyDB is going to meet this trend?
2. Build End-To-End Data Pipelines With Snowflake| 40 min | Snowflake How you can build faster, more performant, and smarter data pipelines, with language of your choice with Snowpark? You can see some of latest capabilities in action.
CONFS AND MEETUPS
Airflow Summit 2022| 23-37 May Still can register on the biggest Airflow Event of the Year! Reunion of the global community of Apache Airflow practitioners and data leaders such as 🎉 GetInData! Dig it to sessions like: OpenLineage & Airflow - data lineage has never been easier by Maciej Obuchowski and Paweł Leszczyński