Skip to content

A basic machine learning project to predict store item demand for the next three months using five years of historical data.

Notifications You must be signed in to change notification settings

SnPreethi/Sales_Forecasting_And_Data_Enrichment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

SALES FORECASTING & DATA ENRICHMENT

This is a basic machine learning project that aims to predict the next three months of demand of an item in a store using the historical data of past 5 years. The problem statement used in this project is from a Kaggle competition named "Store Item Demand Forecasting Challenge".

WHAT DOES THE DATASET CONTAIN?

The dataset contains five years of store item sales data. This is time-series data as the sales are dependent on time. It is structured in a tabular format with four columns:

  • 1st column: date
  • 2nd column: store id
  • 3rd column: item id (referring to the id of each item within the store)
  • 4th column: the number of times the item has been sold (a particular item has been sold X number of times in Y particular store on Zth date)

WHAT ARE WE GOING TO DO WITH THIS DATASET?

Based on five years of sales data from 10 different stores, this project aims to predict the next three months of sales for each individual item in these stores. By analyzing historical trends, the model will forecast future demand to aid in inventory management and sales planning.

WHAT ARE WE USING TO SOLVE THIS PROBLEM?

  1. CAT BOOST: It is an algorithm for gradient boosting on decision trees. It is a very popular machine learning algorithm which is used in recommendation systems and forecasting.
  2. UPGINI: Upgini is a Python library that helps achieve highly accurate forecasting models. The data we have is sparse, with only two main features: the date of sales and the number of sales. It is not a lot of information for our machine learning model to understand how to perform the prediction process for the sales of various items. Upgini solves this problem by automatically searching through thousands of public data sources to find the most relevant features.It then integrates these features with our existing dataset, improving the model's performance.
  3. PANDAS: Handle dataframes by downloading a CSV file, converting it into a pandas dataframe, and then feeding it into our model.

IDE TO USE?

  • Google Colab
  • Jupyter notebook (or)
  • Any other IDE you like

HOW TO INSTALL THE LIBRARIES?

  • catboost - pip install catboost
  • upgini - pip install upgini (only works with python version >=3.7 and < 3.10)
  • pandas - pip install pandas

TASK TYPE - REGRESSION

STEPS

  1. Install the libraries.
  2. Download the dataset and prepare the input data.
  3. Split the dataset into test and train sets.
  4. Split the datasets into features (input values) and labels (what we want to predict)
  5. Enrich the features using upgini library to get relevant features and their corresponding SHAP value (It is a mathematical value that indicates how relevant or how influential this feature is towards the prediction.)
  6. Defining the catboost model
  7. Adding the new features to the original dataset
  8. Model's performance and evaluation under:
    • original dataset without any enrichment
    • newly formed enriched dataset

NOTE

One might encounter an error while adding new features to the original dataset when selecting a random sample whose size is greater then 10,000 in step-2. This is due to the row limit imposed by the trial version of Upgini. The trial version only allows you to enrich up to 10,000 rows. To resolve this, you have a few options:

  • Reduce the number of rows to 10,000 or fewer: You can sample a smaller subset of your dataset for enrichment to stay within the trial limits.
  • Upgrade your Upgini account: Consider upgrading to a paid plan if you need to enrich more than 10,000 rows.
  • Bypass the enrichment for larger datasets: If you don't need the enrichment, you can proceed without it.

About

A basic machine learning project to predict store item demand for the next three months using five years of historical data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages