Giftstore Shop needs our help to better understand its customers. Let's explore unsupervised machine learning techniques to analyze its customer segments and provide helpful conclusions about its customers with Giftstore Shop's marketing team. The marketing team will use this information to improve their marketing campaigns to customers. We clean the data and aggregate the data to a customer level, we apply scaling to ensure that our features do not negatively affect the models. Lastly, we compare KMeans, DBSCAN, and Spectral Clustering models to see what model best fits the data. We uncover that DBSCAN performed the best at detecting customer segmentation than the other models.
Customer retention & engagement is key in all business ventures. But in online retail, it is critical! In this repo, I pretend that I am a data scientist hired by UK-based Giftstore Shop's marketing team to analyze & understand customer insights that will help enhance their marketing campaigns. The marketing team wants to increase website traffic and by proxy purchases made on the online retail site.
We understand that Giftstore Shop is an online retailer located in the United Kingdom. They mainly sell "unique" gifts for all occasions: Birthdays, Holidays, etc. The data provided by Giftstore Shop's marketing team consists of online retail data describing the transactions and customers tagged to every customer. Their marketing team has tasked me with identifying overall customer segments and behaviors that these groups share so that the marketing team can tailor campaigns to their customers.
This repo is part of the work completed within UMBC's DATA602 Course: Intro to Data Analysis and Machine Learning.
In this project, I attempt to achieve the following:
- Data Preparation: Cleaning & aggregating data to ensure that the data provided does not contain any nulls or outliers before adding to unsupervised learning models.
- Deriving New Features From Dataset: Since this is a customer segmentation exercise, I created additional features to extrapolate on customer engagement with Giftstore Shop, including aggregation & creation of categorical labels inferring customer purchasing behaviors.
- Comparing Different Unsupervised Learning Models: Comparing K Means, DBSCAN, and Spectral Clustering and analyzing how each model differentiates cluster patterns in the dataset.
UCI Machine Online Retail Data Set
This dataset contains 541,909 transactions with 8 attributes explaining each instance. For a UK-based and registered, online-only retail company, the transactions occurred between 01/12/2010 and 09/12/2011.
A few notes on the company:
- Mainly sells unique all-occasion gift-store
- Many of its customers are wholesalers
Attribute Information:
- InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.
- UnitPrice: Unit price. Numeric. Product price per unit in sterling pounds (£).
- CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal. The name of the country where a customer resides.
After we fit our clustering algorithms to our engineered datasets, we can lower the dimensionality of the data onto a 2D plane to get a better understanding of our customer clusters.
By applying dimensionality reduction techniques using PCA (note we use PCA AFTER the data is fit to the model so that it does not affect the model's interpretation of the data), we observe that most customers are clustered together, with some trailing noise to the right of the visual. This could possibly be interpreted as two separate perpendicular clusters but in this analysis we will infer that it is one single cluster (for future analysis, we might want to look at other clustering algorithms that can detect perpedicular clusters):
- We observe that KMeans broke the single cluster into 3 distinct regions, splitting the primary cluster into 3 equal parts;
- DBScan understood that most customers belong to a single cluster (great job, DBScan!);
- Spectral Clustering identified that most customers belong to a single segment but still broke off the primary cluster into 2 larger groups to the right of the visual. I consider this the 'happy medium' between Kmeans & DBScan where there is still some presence of smaller clusters but not as much as KMeans.
Here are the outcomes of each clustering algorithm shown in the visuals below:
As we observe, KMeans & Spectral Clustering tried to split the large cluster of customers based on customers who are new, who bought few products, and who made a minimal purchases on the site. DBScan detected that most customers generally have the same spending patterns and lumped most all customers into a single cluster, which aligns to the visuals where we observe one distinct cluster. This also aligns with the general observation that most customers are wholesalers.
We can observe each cluster's behavior against our continuous features, demonstrating how each model interpreted customer segmentation. As stated above, KMeans & Spectral Clustering favored customers who did not buy a lot from the site & who were newer; DBScan determined that generally the spending patterns of customers are the same, which aligns with the primary customer cluster:
DBScan is recommended for further analysis; I would like to compare DBScan to other clustering algorithms that can detect clusters that are perpendicular to one another on a 2D plane.
So what's next? While we explored KMeans, I suggest that the marketing team conduct a campaign to draw more attention to the online retail site in general so that they can continue to attract new customers and additional revenue streams. However, the clusters indicate that retention could be improved so I recommend an additional campaign that markets to their long term wholesale customers. Perhaps provide discounts for these continued customers so that they come back to purchase again?
The marketing team is now equipped with information to help drive new campaigns that will hopefully meet their primary customer targets! This way, Giftstore Shop can continue to find new customers while also focusing on current customer retention (since these folks spend a ton!)
This assignment contains 3 primary areas:
- Dataset in Repo. Local copy of the original dataset from the ICU Machine Learning Repository.
- Summary and Report: Jupyter Notebook including a detailed abstract on problems in assignment, code relevant to project, and visualizations supporting the completion of the project.
- Code: Area to perform testing of dataset, functions, and implement models before final project output.
Contributors : Lee Whieldon
Languages : Python Tools/IDE : Anaconda Libraries : pandas, numpy, matplotlib, seaborn, sklearn, yellowbrick, os
Assignment Submitted : November 2020