Brief notebooks that run through the following processes using a dataset of NYT Front Page articles
- Find efficient keywords with word embeddings (
gensim
) - Remove duplicitous articles with cosine similarity on TFIDF vectors (
scikit-learn
) - Remove duplicitous articles with entity extraction and jaccard similarity (
spacy
) - Classify relevant articles (
scikit-learn
)