MultiLabel-Question-Category-Classifier(QA_Classifier)

A text classification model from data collection, model training, and deployment. The model can classify 69 different types of questions categories. The keys of deployment\category_types_encoded.json contains the questions categories.

Data Collection

Data was collected from a Stackoverflow Website's questions segment: https://stackoverflow.com/questions The data collection process is divided into two steps:

Question URL Scraping: The question urls were scraped with scraper\questions_url_scraper.py and the urls are stored along with question title in data\questions_urls.csv
Question Details Scraping: Using the urls, full question and categories/tags were scraped with scraper\questions_details_scraper.py and stored in data\questions_details.csv.

In total, I scraped 22124 book details and 22257 question urls. Some urls didn't contain any valid page. Those details were ignored.

Data Preprocessing

Initially there were 10634 different categories in the dataset. After some analysis, I found out 10565 of them are rare (contains less amout of related questions). So, I removed those categories and then I have 69 categories only. After removing the data with rare categories there were 17011 samples left in total. Fortunately, dataset didn't have any null values.

Model Training

Finetuned a distilrobera-base and distilbert-base-uncased model from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed at notebooks folder of this branch.

Result Analysis

In the table I showed the multilabel accuracy, F1 score(macro & micro) for two models.

Model	Accuracy_multi	F1 Score(Micro)	F1 Score(Macro)
distilroberta-base	98.4	67.03	53.34
distilbert-base-uncased	98.3	64.44	52.74

From the above table, we see that, multilabel accuracy are very closed for both the models. But, the F1 Score(Micro & Macro) of distilroberta-base is higher than distilbert-base-uncased model's F1 Score. So, we can say that, distilroberta-base performed slightly better for the given dataset.

Model Compression and ONNX Inference

The trained model has a memory of 300+ MB. I compressed this model using ONNX quantization and brought it to ~78.8 MB.

Model Deployment

The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or see live here.

Web Deployment

Deployed a Flask App built to take question description and show the categories as output. Check flask branch for the details. The website is live here.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
dataloaders		dataloaders
deployment		deployment
models		models
notebooks		notebooks
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiLabel-Question-Category-Classifier(QA_Classifier)

Data Collection

Data Preprocessing

Model Training

Result Analysis

Model Compression and ONNX Inference

Model Deployment

Web Deployment

About

Releases

Packages

Languages

License

MdTanvirHossainTusher/QA_Classifier

Folders and files

Latest commit

History

Repository files navigation

MultiLabel-Question-Category-Classifier(QA_Classifier)

Data Collection

Data Preprocessing

Model Training

Result Analysis

Model Compression and ONNX Inference

Model Deployment

Web Deployment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages