This repository includes a set of presentations and hands-on tutorials for the
Data Lake
course of the Informatique pour la Science des Données
Master program at Universié Paris Sud.
I decided to open the content of this module to:
- make it available to anyone desiring to learn about data engineering
- improve the quality of the module by making it public and open for contributions
Presentations:
- Introduction to Data Systems covers:
- What is Big Data
- From DataWarehouse to Data Lake
- Data processing architecture:
Lambda
architecture andkappa
architecture
- Introduction to
HDFS
- Design goals and concepts of HDFS
- Description of data operations in HDFS
Introduction to Data Storage formats:
avro
,orc
andparquet
- Data Streaming with
Apache Kafka
- Data integration problem
- What is a
write ahead log
- Apache Kafka concepts
- Distributed Databases - NoSQL
- Discusso=ion about the
CAP Theorem
- Classes on No-SQL Databases
- An Introduction to
MongoDB
- Discusso=ion about the
Tutorials: