Demo for News Article Collection and Volume Reduction Pipeline

Brief notebooks that run through the following processes using a dataset of NYT Front Page articles

Find efficient keywords with word embeddings (gensim)
Remove duplicitous articles with cosine similarity on TFIDF vectors (scikit-learn)
Remove duplicitous articles with entity extraction and jaccard similarity (spacy)
Classify relevant articles (scikit-learn)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
postBuild		postBuild
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback