A text classification model from data collection, model training, and deployment.
The model can classify 69 different types of questions categories.
The keys of deployment\category_types_encoded.json
contains the questions categories.
Data was collected from a Stackoverflow Website's questions segment: https://stackoverflow.com/questions The data collection process is divided into two steps:
-
Question URL Scraping: The question urls were scraped with
scraper\questions_url_scraper.py
and the urls are stored along with question title indata\questions_urls.csv
-
Question Details Scraping: Using the urls, full question and categories/tags were scraped with
scraper\questions_details_scraper.py
and stored indata\questions_details.csv
.
In total, I scraped 22124 book details and 22257 question urls. Some urls didn't contain any valid page. Those details were ignored.
Initially there were 10634 different categories in the dataset. After some analysis, I found out 10565 of them are rare (contains less amout of related questions). So, I removed those categories and then I have 69 categories only. After removing the data with rare categories there were 17011 samples left in total. Fortunately, dataset didn't have any null values.
Finetuned a distilrobera-base and distilbert-base-uncased model from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed at notebooks
folder of this branch.
In the table I showed the multilabel accuracy, F1 score(macro & micro) for two models.
Model | Accuracy_multi | F1 Score(Micro) | F1 Score(Macro) |
---|---|---|---|
distilroberta-base | 98.4 | 67.03 | 53.34 |
distilbert-base-uncased | 98.3 | 64.44 | 52.74 |
From the above table, we see that, multilabel accuracy are very closed for both the models. But, the F1 Score(Micro & Macro) of distilroberta-base
is higher than distilbert-base-uncased
model's F1 Score. So, we can say that, distilroberta-base
performed slightly better for the given dataset.
The trained model has a memory of 300+ MB. I compressed this model using ONNX quantization and brought it to ~78.8 MB.
The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment
folder or see live here.
Deployed a Flask App built to take question description and show the categories as output. Check flask
branch for the details. The website is live here.