CST_categorizer

A project which aims to sort a customer support ticket according to the relevant category

Inspiration:

This project was inspired by my frustration of logging an online complaint with a service provider where I had to scroll through many complaint categories and see which one fit best. Some of these sounded too similar and overall the whole thing took much of my time. I wanted to build a feature where a customer can just describe the complaint and have the system automatically categorize and log it.

Challenges

This sounds similar to a classification problem but here the input is a text description. We need to convert them into embeddings and then attempt to classify. The solution should capture context from embeddings which becomes difficult in higher dimensions.

Dataset

We will use this customer support ticket dataset from kaggle URL(https://www.kaggle.com/datasets/suraj520/customer-support-ticket-dataset)

EDA

We can see the column 'Ticket Description' captures the textual data pertaining the customer issue. We also have 'Ticket Subject' which provides the class label for type of ticket. There are 16 categories each with approx 500 tickets

Ticket Subject	Count
Refund request	576
Software bug	574
Product compatibility	567
Delivery problem	561
Hardware issue	547
Battery life	542
Network problem	539
Installation support	530
Product setup	529
Payment issue	526
Product recommendation	517
Account access	509
Peripheral compatibility	496
Data loss	491
Cancellation request	487
Display issue	478

Other observations are captured in EDA The below approaches can be found in Ticket_topic

Approach 1

I initially sought to convert these ticket description texts into embeddings using 4 approaches namely

TF-IDF
GLOVE
BERT CLS
Sentence Transformers

After generating embeddings I tried to use clustering, an unsupervised learning approach to find the clusters in data. I used K means for clustering

Results

We don't see any segmentable clusters in any of these techniques. Even metrics such as Adjusted Rand Index (ARI), Normalized Mutual Information (NMI) and, Fowlkes-Mallows Index (FMI) indicate poor performnece. This could be caused from poor embeddings or the descriptions being full of noise.

Approach 2

Next I tried training a model to predict label. After cleaning and preprocessing, text data was converted into numerical representations using

TF-IDF Vectorizer (Term frequency-inverse document frequency
Sentence Transformers (BERT-based): Contextual embeddings
GloVe Embeddings: Global vectors for word representation

Then I applied different combinations of machine learning models:

TF-IDF + Naive Bayes
TF-IDF + Logistic Regression
TF-IDF + Random Forest
GloVe + Random Forest
Sentence Transformers + Logistic Regression

Also tried Deep Learning for improved performance:

BERT for Text Classification
LSTM with GloVe embeddings

Results

Even with deep learning approaches, we got poor results. Accuracy of around 6-7% and almost all metrics (precision, recall, F1-score) at 0 for most classes. It could have been caused by Data issues such as Noisy data, underfitted model, dataset being too small etc.

Approach 3

I approached the problem from an unsupervised point of view. Instead of using predefined topics, which are typical across various industries, we can try defining customized topics based on a company's dataset. BERTTopic addresses this problem. It uses BERT or other transformer models to create dense vector representations (embeddings) of each item in the corpus. The embeddings are then clustered, using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). Each cluster represents a potential topic. Advantages here are that we don't have to specify the number of clusters in advance and the number of clusters can be reduced after model fitting.

Results

I got some satisfactory results with this approach when I tested it for a new ticket

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
EDA_CST.ipynb		EDA_CST.ipynb
LICENSE		LICENSE
README.md		README.md
Ticket_Topic.ipynb		Ticket_Topic.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CST_categorizer

Inspiration:

Challenges

Dataset

EDA

Approach 1

Results

Approach 2

Results

Approach 3

Results

About

Releases

Packages

Languages

License

SnehaR26/CST_categorizer

Folders and files

Latest commit

History

Repository files navigation

CST_categorizer

Inspiration:

Challenges

Dataset

EDA

Approach 1

Results

Approach 2

Results

Approach 3

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages