The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
For more information, click this link: http://qwone.com/~jason/20Newsgroups/
The data source: http://qwone.com/~jason/20Newsgroups/
-
BoA, Tf-idf, LDA
- Gensim; sklearn
-
Doc2Vec
- Gensim
-
Visualization
- t-SNE [1] or PCA
- matplotlib; seaborn; visdom; tensorboard
- Matlab is also powerful
-
Document clustering
- sklearn.cluser.Kmeans
- sklearn.metrics
1. Preprocess the dataset
- Clean the data and build the vocabulary
- Visualize the statistics of the dataset
- Baseline document features
- Bag-of-words; TF-IDF Model
2. Topic Modeling
3. Vector representation of documents
4. Comparison between different document representations
FinalProject_codes1.ipynb
- Vector Representations
FinalProject_codes2.ipynb
- Topic Modeling
Text Classification methods in NLP using Deep Learning.ipynb
- Convoluted Neural Network
Topic_modeling_and_clustering_Report.pdf
- PDF report that decribes the appreaches used for document representation and classsification in python and compares
the different approaches along with their visual representations using t-SNE