Projects done in the Data Engineering Nanodegree by Udacity.com
➔ Understand the purpose of data modeling
➔ Identify the strengths and weaknesses of different types of databases and data storage techniques
➔ Create a table in Postgres and Apache Cassandra
➔ Understand when to use a relational database
➔ Understand the difference between OLAP and OLTP databases
➔ Create normalized data tables
➔ Implement denormalized schemas (e.g. STAR, Snowflake)
➔ Understand when to use NoSQL databases and how they differ from relational databases
➔ Select the appropriate primary key and clustering columns for a given use case
➔ Create a NoSQL database in Apache Cassandra
➔ Understand Data Warehousing architecture
➔ Run an ETL process to denormalize a database (3NF to Star)
➔ Create an OLAP cube from facts and dimensions
➔ Compare columnar vs. row oriented approaches
➔ Understand cloud computing
➔ Create an AWS account and understand their services
➔ Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQL
➔ Identify components of the Redshift architecture
➔ Run ETL process to extract data from S3 into Redshift
➔ Set up AWS infrastructure using Infrastructure as Code (IaC)
➔ Design an optimized table by selecting the appropriate distribution style and sorting key
➔ Understand the big data ecosystem
➔ Understand when to use Spark and when not to use it
➔ Manipulate data with SparkSQL and Spark Dataframes
➔ Use Spark for ETL purposes
➔ Troubleshoot common errors and optimize their code using the Spark WebUI
➔ Understand the purpose and evolution of data lakes
➔ Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue
➔ Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages
➔ Understand the components and issues of data lakes
➔ Create data pipelines with Apache Airflow
➔ Set up task dependencies
➔ Create data connections using hooks
➔ Track data lineage
➔ Set up data pipeline schedules
➔ Partition data to optimize pipelines
➔ Write tests to ensure data quality
➔ Backfill data
➔ Build reusable and maintainable pipelines
➔ Build your own Apache Airflow plugins
➔ Implement subDAGs
➔ Set up task boundaries
➔ Monitor data pipelines