We couldn’t be more excited to announce the primary sessions for our second annual Data Engineering Summit, co-located with ODSC East this April. Join us for two days of talks and panels from leading experts and data engineering pioneers. In the meantime, take a look at our first group of sessions. 

How to Practice Data-Centric AI and Have AI Improve its Own Dataset

Jonas Mueller | Chief Scientist and Co-Founder | Cleanlab

Data-centric AI is poised to be a game changer for Machine Learning projects. Manual labor isn’t any longer the one option for improving data. Instead, Data-centric AI introduces systematic techniques to utilize the baseline model to search out and fix dataset issues, enabling you to enhance your model’s performance without changing the code. 

In this session, you’ll learn learn how to operationalize fundamental data-centric AI ideas across a big selection of datasets. With an exploration of real-world data, this session will equip you with the knowledge to right away retrain higher models. 

Tutorial: Introduction to Apache Arrow and Apache Parquet, using Python and Pyarrow

Andrew Lamb | Chair of the Apache Arrow Program Management Committee | Staff Software Engineer | InfluxData

Take a deep dive into the fundamentals of Apache Arrow and Apache Parquet with Andrew Lamb. You’ll learn learn how to load data to/from pyarrow arrays, csv and parquet files, and learn how to use pyarrow to quickly perform analytic operations resembling filtering, aggregation, joining, and sorting. 

In completing these tasks you’ll experience the advantages of the open Arrow ecosystem firsthand, in addition to see how Arrow facilitates fast and efficient interoperability with pandas, pol.rs, DataFusion, DuckDB, and other technologies that support the Arrow memory format. 

Data Engineering within the Age of Data Regulations

Alex Gorelik | Distinguished Engineer | LinkedIn

As AI advances, so do data regulations like GDPR, CCPA, DMA, and plenty of others. These regulations allow users to manage their data and put limitations on what corporations can do with it. In many cases, the power to operate in a rustic depends on adhering to those restrictions. 

This talk will illustrate a real-world example of learn how to convert these regulations into policy and subsequently, learn how to integrate policy enforcement in data engineering practices. 

The 12 Factor App for Data

James Bowkett | Technical Delivery Director | OpenCredo

To cope with an increasingly data-centric world, the 12-factor app helps define learn how to take into consideration and design cloud-native applications. This session will take you thru the 12 principles of designing data-centric applications which were useful across 4 categories: Architecture & Design, Quality & Validation (Observability), Audit & Explainability, and Consumption.

Engineering Knowledge Graph Data for a Semantic Recommendation AI System

Ethan Hamilton | Data Engineer | Enterprise Knowledge

This in-depth session will teach learn how to design a semantic advice system. These systems represent data as knowledge graphs and implement graph traversal algorithms to assist find content in massive datasets. These systems usually are not only useful for a big selection of industries, they’re fun for data engineers to work on. 

Data Pipeline Architecture – Stop Building Monoliths 

Elliott Cordo | Founder, Architect, Builder | Datafutures 

Although common, data monoliths present several challenges, especially for larger teams and organizations that allow for federated data product development. 

In this session, you’ll explore possible solutions from Microservices and Event Based Architecture, with a deal with multi-Airflow infrastructure, micro-DAG packing and deployment, DBT multi-project implementation, rational use of containers, and data sharing/publication strategies.

Is Gen AI A Data Engineering or Software Engineering Problem?

Barr Moses | Co-Founder & CEO | Monte Carlo

At the beginning, Gen AI gave the impression of a software engineering and API integration project. However, as production and talent turn into more accessible, the teams who got a head start on finding ways to utilize Gen AI might be ahead of the sport. Join this session with Barr Moses to get his tackle the query of whether Gen AI is a knowledge engineering or software engineering problem. 

Dive into Data: The Future of the Single Source of Truth is an Open Data Lake

Christina Taylor | Senior Staff Engineer | Catalyst Software

Join this session for an exploration of constructing a centralized data repository that ingests from a wide range of sources, including service databases, SAAS applications, unstructured files, and conversational data. Using real-world examples, you’ll see how you’ll be able to reduce costs and vendor lock-in by migrating from proprietary data warehouses to an open data lake. 

With the insights gained during this session, you’ll be higher equipped to decide on essentially the most appropriate technology to accommodate diverse analytics, machine learning, and product use cases.

Tale of Apache Parquet reaching pinnacle of friendship with Data Engineers

Gokul Prabagaren | Engineering Manager | CapitalOne

Join this session to see how a 100% Cloud-operated company runs its data processing pipeline and the way Apache Parquet plays a pivotal role in each step of our processing. You’ll explore a wide range of design patterns implemented using Parquet and Spark, in addition to how the corporate’s resiliency has increased with the usage of Apache Parquet. 

Conclusion

At the Data Engineering Summit on April twenty fourth, co-located with ODSC East 2024, you’ll be on the forefront of all the main changes coming before it hits. So get your pass today, and keep yourself ahead of the curve.

This article was originally published at summit.ai