This is a customer segmentation project based on the Customer Segmentation Classification Dataset from Kaggle.
Customer segmentation is the process of dividing customers into groups based on common characteristics so that companies can market to each group effectively and appropriately. It is an important tool for businesses to identify their most valuable customers and develop targeted marketing campaigns to increase sales and customer loyalty.
Shown below is the organization of the project:
├── README.md <- The top-level documentation for this project.
├── data
│ ├── processed <- The final data sets for customer segmentation.
│ └── raw <- The original, immutable datasets.
│
├── images <- The images used in the README documentation
├── notebooks <- Jupyter notebooks containing the explorations performed in this project
├── requirements.txt <- The requirements file for reproducing the project
├── src <- Source code used in this project.
Following the preceding structure, the succeeding list are the code files and their descriptions:
notebooks/handling_missing_values.ipynb
: notebook containing data cleaning and preprocessing specifically on handling columns with missing valuesnotebooks/eda_on_segments.ipynb
: notebook containing exploratory data analysis and segment description generationnotebooks/feature_engineering.ipynb
: notebook containing creation of new features for trainingsrc/app.py
: Streamlit web application for the customer segmentation modelsrc/utils.py
: contains all helper functions for cleaning, preprocessing, training, and deploymentsrc/train.py
: code for training and selecting the best customer segmentation modelmodels/model.joblib
: best trained model (gradient boosting classifier)
The dataset contains the following features:
- ID: Customer ID
- Gender: Gender of the customer
- Age: Age of the customer
- Spending Score: Score assigned based on customer behavior and spending nature
- Family Size: Number of family members of the customer
- Graduated: Whether the customer has graduated or not
- Profession: Profession of the customer
- Work Experience: Work experience of the customer in years
- Var_1: Anonymised category for the customer
- Segmentation: (target) Customer segment
- Data cleaning: Handling missing values by dropping those with less than 3% of the data and performing KNN Imputation for those with greater than 3% missing data. It was also found through the
missingno
library that there seems to be no correlation between missing values, so each one was handled independently. - Feature engineering: New features were determined and for this case, the main change was the inclusion of an
Other
profession category to include those who have less than 5% of total observations. TheOther
category was also included to allow for that input option in the Streamlit model deployment because in a practical use case scenario, there are tons of different professions so the model should be able to handle those cases. - Encoding and scaling: Scaling for numerical values was achieved using the MinMax Scaler because of the skewed distributions in all numerical features. For categorical features, three encoding schemes: onehot, label, and ordinal encoding were used. The order of the columns after encoding and scaling were taken note of as this order must be followed for the inputs during deployment.
- Training and tuning: The model was trained with six classifiers: logistic regression, knn, decision trees, naive bayes, random forests, and gradient boosting with the best performing model in terms of accuracy being selected and tuned via GridSearchCV. Model saved using joblib.
- Deployment: Inputs were taken from the user and the same encoding and scaling schemes were applied and a dataset with shape (1, 16) was generated for prediction. The model, loaded in using joblib, then predicts and gives a resulting customer segmentation.
To run the application, first create a virtual environment. I used miniconda as my virtual environment manager and create an environment with the following command:
conda create --name segmentation python=3.9
conda activate segmentation
The next step is to clone the repository in the virtual environment by running:
HTTPS: git clone https://github.com/gersongerardcruz/customer_segmentation.git
SSH: git clone git@github.com:gersongerardcruz/customer_segmentation.git
Then, move into the repository and install the requirements with:
cd customer_segmentation
pip install -r requirements.txt
Finally, deploy the app locally via streamlit by moving into the src
directory and running:
cd src
streamlit run app.py
The streamlit app should look like this:
If you want to train a new model after having changed updated the training data, use the following command while still in the src folder:
python3 train.py
This command will train and tune a new method based on the best scoring classifier and update the model.joblib
file in the models
directory.
- Collect better and more distinguishing data on customer behavior and preferences to improve segmentation accuracy
- Explore additional features that could provide valuable insights into customer behavior and preferences, such as purchase history, social media activity, and customer feedback.
Customer segmentation is an important tool for businesses to effectively target and engage their most valuable customers. This project demonstrates the use of machine learning algorithms to segment customers based on their demographic and behavioral characteristics. With more distinguishing and improved data, this model can be further refined to provide even more accurate customer segmentation.