The goal of this project is to develop a framework which is capable of clustering news articles on the basis of their text contents. Several techniques such as TFIDF, cosine similarity, truncated SVD, and k-means clustering are applied to this project. This project is basically composed of four parts as follows:
- Process and tokenize the news articles
- Build a sparse TF-IDF matrix from all terms of the news articles
- Perform dimensionality reduction using truncated SVD
- Cluster the news articles using k-means clustering