NLP - Text category classifier

Marcos V. O. Assis (mvoassis@gmail.com)

Objective:

Generate an algorithm capable of classifying articles based on their titles.

Language:

For development and testing, words in Portuguese-BR were used.

Approach:

Download Word2Vec pre-trained CBOW model from NILC (cbow_s300)
Load the model using the Gensin library.
Vectorize the article's titles using the NLTK library and CBOW model.
Training a Logistic Regression classification model using the Scikit Learn library.

Categories considered:

Mundo (World)
Cotidiano (Daily life)
Mercado (Market)
Esporte (Sports)
Ilustrada (Illustrated)
Colunas (Column)

Files

Train dataset -> Link - 90000 entries
Test dataset -> Link - 20513 entries

Results

0.8 Accuracy score using CBOW (against 0.3 Accuracy from Dummy Classifier)
Regarding individual categories, the proposed model achieved F1-Score:
1. colunas - 0.78
2. cotidiano - 0.69
3. esporte - 0.90
4. ilustrada - 0.23
5. mercado - 0.81
6. mundo - 0.79
CBOW is slightly better than Skipgram for this problem.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
NLP_Word2Vec_Category_classifier.ipynb		NLP_Word2Vec_Category_classifier.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP - Text category classifier

Marcos V. O. Assis (mvoassis@gmail.com)

Objective:

Language:

Approach:

Categories considered:

Files

Results

About

Releases

Packages

Languages

mvoassis/nlp-category-classifier

Folders and files

Latest commit

History

Repository files navigation

NLP - Text category classifier

Marcos V. O. Assis (mvoassis@gmail.com)

Objective:

Language:

Approach:

Categories considered:

Files

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages