Data-engineering-nanodegree

Projects done in the Data Engineering Nanodegree by Udacity.com

Course 1: Data Modeling

Introduction to Data Modeling

➔ Understand the purpose of data modeling

➔ Identify the strengths and weaknesses of different types of databases and data storage techniques

➔ Create a table in Postgres and Apache Cassandra

Relational Data Models

➔ Understand when to use a relational database

➔ Understand the difference between OLAP and OLTP databases

➔ Create normalized data tables

➔ Implement denormalized schemas (e.g. STAR, Snowflake)

NoSQL Data Models

➔ Understand when to use NoSQL databases and how they differ from relational databases

➔ Select the appropriate primary key and clustering columns for a given use case

➔ Create a NoSQL database in Apache Cassandra

Project: Data Modeling with Postgres and Apache Cassandra

Course 2: Cloud Data Warehouses

Introduction to the Data Warehouses

➔ Understand Data Warehousing architecture

➔ Run an ETL process to denormalize a database (3NF to Star)

➔ Create an OLAP cube from facts and dimensions

➔ Compare columnar vs. row oriented approaches

Introduction to the Cloud with AWS

➔ Understand cloud computing

➔ Create an AWS account and understand their services

➔ Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQL

Implementing Data Warehouses on AWS

➔ Identify components of the Redshift architecture

➔ Run ETL process to extract data from S3 into Redshift

➔ Set up AWS infrastructure using Infrastructure as Code (IaC)

➔ Design an optimized table by selecting the appropriate distribution style and sorting key

Project 2: Data Infrastructure on the Cloud

Course 3: Data Lakes with Spark

The Power of Spark

➔ Understand the big data ecosystem

➔ Understand when to use Spark and when not to use it

Data Wrangling with Spark

➔ Manipulate data with SparkSQL and Spark Dataframes

➔ Use Spark for ETL purposes

Debugging and Optimization

➔ Troubleshoot common errors and optimize their code using the Spark WebUI

Introduction to Data Lakes

➔ Understand the purpose and evolution of data lakes

➔ Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue

➔ Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages

➔ Understand the components and issues of data lakes

Project 3: Big Data with Spark

Course 4: Automate Data Pipelines

Data Pipelines

➔ Create data pipelines with Apache Airflow

➔ Set up task dependencies

➔ Create data connections using hooks

Data Quality

➔ Track data lineage

➔ Set up data pipeline schedules

➔ Partition data to optimize pipelines

➔ Write tests to ensure data quality

➔ Backfill data

Production Data Pipelines

➔ Build reusable and maintainable pipelines

➔ Build your own Apache Airflow plugins

➔ Implement subDAGs

➔ Set up task boundaries

➔ Monitor data pipelines

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.idea		.idea
.vscode		.vscode
1_dend_data_modeling		1_dend_data_modeling
2_dend_cloud_data_warehouses		2_dend_cloud_data_warehouses
3_dend_spark_data_lakes		3_dend_spark_data_lakes
4_dend_airflow_data_pipelines		4_dend_airflow_data_pipelines
cheatsheets		cheatsheets
.gitignore		.gitignore
DEND.code-workspace		DEND.code-workspace
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
data-engineering.jpg		data-engineering.jpg
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-engineering-nanodegree

Course 1: Data Modeling

Introduction to Data Modeling

Relational Data Models

NoSQL Data Models

Project: Data Modeling with Postgres and Apache Cassandra

Course 2: Cloud Data Warehouses

Introduction to the Data Warehouses

Introduction to the Cloud with AWS

Implementing Data Warehouses on AWS

Project 2: Data Infrastructure on the Cloud

Course 3: Data Lakes with Spark

The Power of Spark

Data Wrangling with Spark

Debugging and Optimization

Introduction to Data Lakes

Project 3: Big Data with Spark

Course 4: Automate Data Pipelines

Data Pipelines

Data Quality

Production Data Pipelines

Project: Data Pipelines with Airflow

About

Releases

Packages

Languages

License

revstech/Data-engineering-nanodegree

Folders and files

Latest commit

History

Repository files navigation

Data-engineering-nanodegree

Course 1: Data Modeling

Introduction to Data Modeling

Relational Data Models

NoSQL Data Models

Project: Data Modeling with Postgres and Apache Cassandra

Course 2: Cloud Data Warehouses

Introduction to the Data Warehouses

Introduction to the Cloud with AWS

Implementing Data Warehouses on AWS

Project 2: Data Infrastructure on the Cloud

Course 3: Data Lakes with Spark

The Power of Spark

Data Wrangling with Spark

Debugging and Optimization

Introduction to Data Lakes

Project 3: Big Data with Spark

Course 4: Automate Data Pipelines

Data Pipelines

Data Quality

Production Data Pipelines

Project: Data Pipelines with Airflow

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages