- We first need to transform text to vectors
- String to vectors tutorial
- Create a dictionary first that maps words to ids
- Transform the text into vectors through
dictionary.doc2bow(texts)
- Corpus streaming tutorial (For very large corpuses)
- Models (e.g. LsiModel, Word2Vec) are built / trained from a corpus
- Transformation interface tutorial
- Docs, Source
- tf-idf scores are normalized (sum of squares of scores = 1)
- Detects words that belong in a phrase, useful for models like Word2Vec ("new", "york" -> "new york")
- Docs, Source (uses bigram detectors underneath)
- Phrases example on How I Met Your Mother
- Docs, Source (very standard LSI implementation)
- How to interpret negative LSI values
- Random Projection (used as an option to speed up LSI)
- Colouring words by topic in a document, print words in a topics
- Topic Coherence, a metric that correlates that human judgement on topic quality.
- Compare topics and documents using Jaccard, Kullback-Leibler and Hellinger similarities
- America's Next Topic Model slides -- How to choose your next topic model, presented at Pydata Berlin 10 August 2016 by Lev Konstantinovsky
- Classification of News Articles using Topic Modeling
- LDA: pre-processing and training tips
- Tool to get the most similar documents for LDA, LSI
- Similarity queries tutorial
- Model evolution of topics through time
- Easy intro to DTM. Evolution of Voldemort topic through the 7 Harry Potter books.
- Dynamic Topic Modeling and Dynamic Influence Model Tutorial
- Python Dynamic Topic Modelling Theory and Tutorial
- Docs, Source (very simple interface)
- Simple word2vec tutorial (examples of
most_similar, similarity, doesnt_match
) - Comparison of FastText and Word2Vec
- Doc2vec Quick Start on Lee Corpus
- Docs, Source (Docs are not very good)
- Doc2Vec requires a non-standard corpus (need sentiment label for each document)
- Great illustration of corpus preparation, Code (Alternative, Alternative 2)
- Doc2Vec on customer review (example)
- Doc2Vec on Airline Tweets Sentiment Analysis
- Doc2vec to predict IMDB review star rating. Reproducing the Google paper
- Tool to get the most similar documents for word2vec
- Word Movers Distance for Yelp Reviews tutorial
- Document Classification using Bayesian Inversion and several word2vec models(one for each class)
- Deep Inverse Regression with Yelp Reviews
- Extract most important keywords and sentences from the text
- Tutorial on TextRank summarisation
- Tutorial showing API for document classification with various techniques: TF-IDF, word2vec averaging, Deep IR, Word Movers Distance and doc2vec
- Movie plots by genre
- Radim Řehůřek - Faster than Google? Optimization lessons in Python.
- MLMU.cz - Radim Řehůřek - Word2vec & friends (7.1.2015)
- Making an Impact with NLP -- Pycon 2016 Tutorial by Hobsons Lane
- NLP with NLTK and Gensim -- Pycon 2016 Tutorial by Tony Ojeda, Benjamin Bengfort, Laura Lorenz from District Data Labs
- Word Embeddings for Fun and Profit -- Talk at PyData London 2016 talk by Lev Konstantinovskiy. See accompanying repo
Based on wonderful resource by Jason Xie.