Skip to content

Sidkian/SoftCart-Data-Engineering-Project

Repository files navigation

IBM Data Engineering Capstone Project

Technology used: MySQL, MongoDB, PostgreSQL, DB2 on Cloud, IBM Cognos Analytics, Python, Apache Airflow, Hadoop, Spark

Abstract

This captsone project is part of the IBM Data Engineering Professional Certificate.

The task is to build a data platform for retailer data analytics by assuming the role of an Associate Data Engineer who has recently joined an e-commerce organization called SoftCart.

Objective

  • Design a data platform that uses MySQL as an OLTP database and MongoDB as a NoSQL database.
  • Design and implement a data warehouse and generate reports from the data.
  • Design a reporting dashboard that reflects the key metrics of the business.
  • Extract data from OLTP and NoSQL databases, transform it and load it into the data warehouse, and then create an ETL pipeline.
  • Create a Spark connection to the data warehouse, and then deploy a machine learning model

About the Company

  • SoftCart, an e-commerce company, uses a hybrid architecture with some of its databases on premises and some on cloud
  • SoftCart's online presence is primarily through its website, which customers access using a variety of devices like laptops, mobiles and tablets.
  • All catalog data of the products is stored in the MongoDB NoSQL server and all transactional data like inventory and sales are stored in the MySQL database server. SoftCart's webserver is driven entirely by these two databases.
  • Data is periodically extracted from these two databases and put into the staging data warehouse running on PostgreSQL.
  • The production data warehouse is on the cloud instance of IBM DB2 server.
  • BI teams connect to the IBM DB2 for operational dashboard creation. IBM Cognos Analytics is used to createdashboards.
  • SoftCart uses Hadoop cluster as its big data platform where all the data is collected for analytics purposes.
  • Spark is used to analyse the data on the Hadoop cluster.
  • To move data between OLTP, NoSQL and the data warehouse, ETL pipelines are used and these run on Apache Airflow.

Project Assignment Outputs

OLTP Database: MySQL

Documentation: OLTP Database

MySQL is used to design the OLTP database for the E-Commerce website. The sales OLTP Database design is based on the oltp data provided. The data is loaded into sales_data table, and a datadump.sh bash script is created to take backups using mysqldump.

NoSQL Database: MongoDB

Documentation: NoSQL Database

MongoDB is used to set up a NoSQL database for the E-Commerce website. The catalog NoSQL database is created and catalog data is loaded into electronics collection using mongoimport. Also, electronics.csv containing a subset of fields is exported from the electronics collection using mongoexport

Data Warehouse: PostgreSQL

Documentation: Data Warehouse

PostgreSQL is used to design and create the data warehouse schema using pgAdmin's ERD Design Tool. The star schema design is based on the sample order data provided.

Data Warehouse ERD

Fact Table: softcartFactSales

Dimension Tables: softcartDimDate , softcartDimCategory , softcartDimCountry , softcartDimItem

Data Warehouse Reporting: IBM DB2

Documentation: Data Warehouse Reporting

IBM DB2 is used to generate reports out of the data in the data warehouse. Data warehouse schema is used to create the data warehouse and the following data is loaded:

The following queries and MQT are created for data analytics:

Reporting Dashboard: IBM Cognos Analytics

IBM Cognos Analytics is used to design a reporting dashboard that reflects the key metrics of the business. Ecommerce.csv is load into sales_history table in the DB2 warehouse and used to generate the following reports:

  • Month wise total sales for the year 2020

    Line Chart

  • Category wise total sales

    Pie Chart

  • Quarterly sales of mobile phones

    Bar Chart

ETL: Python

Documentation: ETL

Python is used to perform incremental data load from staging data warehouse to production data warehouse and sync up the databases. automation.py python script connects to the staging and production data warehouses and loads all new records since last load from staging to production.

Data Pipeline: Apache Airflow

Documentation: Data Pipeline

Airflow is used to create a data pipeline that analyzes the web server log. The process_web_log dag:

  • Extracts the ipaddress field from the web server log file and saves it into text file
  • Filters out all the occurrences of ipaddress “198.46.149.143” from the text file and saves the output to a new text file
  • Load the data by archiving the transformed text file into a TAR file

Big Data Analytics: Spark

Documentation: Big Data Analytics

Spark is used to analyze search terms on the e-commerce web server provided in searchterms.csv. Spark.ipynb loads data into a spark dataframe and queries it to answer questions like the number of times the term gaming laptop was searched or what are the top 5 most frequently used search terms. Also, a pre-trained sales forecasting model is used this to predict the sales for 2023.