Analysing survey results can be challenging. It's common for customers to experience similar experiences leading to groups of individuals selecting the same survey responses. However, they won't all select the same as people are interested in different things. For example, some go to a restaurant for atmosphere whilst others go primarily for the food. Clustering can help us gain a general concensus of our customers.
Clustering is an unsupervised machine learning method. Unsupervised techniques don't require labelled data (it doesn't need us humans to teach the model directly). This notebook will focus on clustering customer survey responses from an airline. This analysis could be used on any surveys that use a numerical scale.
Create a real life scenario to make the randomly generated data more meaningful and easier to follow.
An airline company would like to analyze survey results at scale. This could help the company identify loyal customers and improve the retiontion of non loyal customers.
Please note that it's important for each response to be on the same scale. An airline has been collecting feedback from their customers through their app. It's important for the executive team at the airline to understand the reponses to provide improvements to the airline.
0 - Strongly Disagree
1 - Disagree
2 - Neutral
3 - Agree
4 - Strongly Agree
Lastly, you must include the fundamental question: "I would return to "ABC" in the future"
I randomly generated data as I wanted to test if this analysis would gain insights before carrying out an actual survey on a large number of people.
To make this step dynamic in the future, we should write some code to automatically select the suitable number of clusters. Currently we are plotting a graph and manually selecting the best number of clusters. Instead, we would "plot a graph" and write code to approximate the correct number of K to remove manual intervention.
An example dashboard I have created can be seen below:
- Import relevant packages
- Creating the raw data (dataframe)
- Unsupervised Learning (KMeans) - Elbow Method to determine K
- Unsupervised Learning (KMeans) - Predict Cluster
- Output clusters into dataframe
- Visualise results
- Export to PowerBI and Visualise results
- Discussion
- Conclusion
To view the notebook in your browser follow the link below: https://github.com/VirajVaitha123/Survey-Analysis---Powered-by-Unsupervised-Learning/blob/master/Customer%20Feedback%20Analytics%20-%20Unsupervised%20Learning.ipynb
Alternatively, to interact with the code please follow the steps below:
Step 1: Download required files
git clone https://github.com/VirajVaitha123/Survey-Analysis---Powered-by-Unsupervised-Learning.git
Step 2: Create the virtual environment
- run the following command relative to your directory to create the environment with the relevant dependencies
conda env create -f DataScience.yml
Step 3: Access notebook in Jupyter Notebook
Jupyter Notebook
- Open and edit the notebook
Copy and paste the notebook link below to nbviewer website: https://github.com/VirajVaitha123/Survey-Analysis---Powered-by-Unsupervised-Learning/blob/master/Customer%20Feedback%20Analytics%20-%20Unsupervised%20Learning.ipynb
TO DO
- Data is randomly generated and not representative of a real sittuation, should adjust
- Question states Where there any delays?, this is the one questions where a postive score reflects negatively. Each Question should be on a 0 = negative 5= positive scale.
- Plotly box plot would look more attractive
- Discussion and conclusion, there is no comments on my analysis and readers would not be able to see the outcomes of the algorithm. It's important to add this!