This is a basic machine learning project that aims to predict the next three months of demand of an item in a store using the historical data of past 5 years. The problem statement used in this project is from a Kaggle competition named "Store Item Demand Forecasting Challenge".
The dataset contains five years of store item sales data. This is time-series data as the sales are dependent on time. It is structured in a tabular format with four columns:
- 1st column: date
- 2nd column: store id
- 3rd column: item id (referring to the id of each item within the store)
- 4th column: the number of times the item has been sold (a particular item has been sold X number of times in Y particular store on Zth date)
Based on five years of sales data from 10 different stores, this project aims to predict the next three months of sales for each individual item in these stores. By analyzing historical trends, the model will forecast future demand to aid in inventory management and sales planning.
- CAT BOOST: It is an algorithm for gradient boosting on decision trees. It is a very popular machine learning algorithm which is used in recommendation systems and forecasting.
- UPGINI: Upgini is a Python library that helps achieve highly accurate forecasting models. The data we have is sparse, with only two main features: the date of sales and the number of sales. It is not a lot of information for our machine learning model to understand how to perform the prediction process for the sales of various items. Upgini solves this problem by automatically searching through thousands of public data sources to find the most relevant features.It then integrates these features with our existing dataset, improving the model's performance.
- PANDAS: Handle dataframes by downloading a CSV file, converting it into a pandas dataframe, and then feeding it into our model.
- Google Colab
- Jupyter notebook (or)
- Any other IDE you like
- catboost - pip install catboost
- upgini - pip install upgini (only works with python version >=3.7 and < 3.10)
- pandas - pip install pandas
- Install the libraries.
- Download the dataset and prepare the input data.
- Split the dataset into test and train sets.
- Split the datasets into features (input values) and labels (what we want to predict)
- Enrich the features using upgini library to get relevant features and their corresponding SHAP value (It is a mathematical value that indicates how relevant or how influential this feature is towards the prediction.)
- Defining the catboost model
- Adding the new features to the original dataset
-
Model's performance and evaluation under:
- original dataset without any enrichment
- newly formed enriched dataset
One might encounter an error while adding new features to the original dataset when selecting a random sample whose size is greater then 10,000 in step-2. This is due to the row limit imposed by the trial version of Upgini. The trial version only allows you to enrich up to 10,000 rows. To resolve this, you have a few options:
- Reduce the number of rows to 10,000 or fewer: You can sample a smaller subset of your dataset for enrichment to stay within the trial limits.
- Upgrade your Upgini account: Consider upgrading to a paid plan if you need to enrich more than 10,000 rows.
- Bypass the enrichment for larger datasets: If you don't need the enrichment, you can proceed without it.