The objective is to predict 3 months of item-level sales data at different store locations.
This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset.
You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.
What's the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost?
This is a great competition to explore different models and improve your skills in forecasting.
This is a Kaggle dataset which can be found in this link: https://www.kaggle.com/c/demand-forecasting-kernels-only/
- train.csv - Training data
- test.csv - Test data (Note: the Public/Private split is time based)
- sample_submission.csv - a sample submission file in the correct format
- date - Date of the sale data. There are no holiday effects or store closures.
- store - Store ID
- item - Item ID
- sales - Number of items sold at a particular store on a particular date.
I followed in this project the steps of the project management method called CRISP-DM. This method has undergone modifications aimed at the reality of a Data Science project and with that it was called CRISP-DS.
Your main principle is doing the project following multiples cycles as the necessity.
0.0 - IMPORTS
0.1 - Helper Function
0.2 - Loading Data
1.0 - DESCRIPTION OF DATA
1.1 - Rename Columns
1.2 - Data Dimensions
1.3 - Data Types
1.4 - Check NA
1.5 - Fillout NA
1.6 - Change Types
1.7 - Descriptive Statistical
- 1.7.1 - Numerical Attributes
- 1.7.2 - Categorical Attributes
2.0 FEATURE ENGINEERING
2.1 - Creation of Hyphoteses
- 2.1.1 - Demographic Hyphoteses
- 2.1.2 - Geographic Hyphoteses
- 2.1.3 - Sociocultural Hyphoteses
2.2 - Final list of Hypotheses
2.3 - Feature Engineering
3.0 - VARIABLE FILTERING
3.1 - Line filtering
3.2 - Column Selection
4.0 - EXPLORATORY DATA ANALYSIS
4.1 - Univariate Analysis
- 4.1.1 - Response Variable
- 4.1.2 - Numerical Variable
- 4.1.3 - Categorical Variable
4.2 - Bivariate Analysis
- 4.2.1 - Summary of Hyphoteses
4.3 - Multivariate Analysis
- 4.3.1 - Numerical Attributes
- 4.3.2 - Categorical Attributes
5.0 - DATA PREPARATION
5.1 - Normalization
5.2 - Rescaling
5.3 - Transformation
- 5.3.1 - Encoding
- 5.3.2 - Response Variable Transformation
- 5.3.3 - Nature Transformation
6.0 - FEATURE SELECTION
6.1 - Split dataframe into training and test dataset
6.2 - Boruta as Feature Selection
- 6.2.1 - Best Feature from Boruta
7.0 - MACHINE LEARNING MOMDELLING
7.1 - Average Model
7.2 - Linear Regression Model
- 7.2.1 - Linear Regression Model - Cross Validation
7.3 - Linear Regression Regularized Model
- 7.3.1 - Linear Regression - Lasso - Cross Validation
7.4 - Random Forest Regressor
- 7.4.1 - Random Forest Regressor - Cross Validation
7.5 - XGBoost Regressor
- 7.5.1 - XGBoost Regressor - Cross Validation
7.6 - Compare Model's Performance
- 7.6.1 - Single Performance
- 7.6.2 - Real Performance - Cross Validation
8.0 - HYPERPARAMETER FINE TUNING
8.1 - Random Search
8.2 - Final Model
9.0 - TRANSLATION AND INTERPRETATION OF THE ERROR
9.1 - Business Performance
9.2 - Total Performance
9.3 - Machine Learning Performance
10.0 - DEPLOY MODEL TO PRODUCTION
10.1 - Energy Consumption Class
10.2 - API Handler
10.3 - Tester