-
Notifications
You must be signed in to change notification settings - Fork 37
Topic Modeling
Sean Gallagher edited this page Sep 28, 2015
·
3 revisions
We're looking into using topic modeling for many aspects of our project in the upcoming months. Specifically, we think it can provide us the following interesting features:
- Search the document corpus using topic modeling similarity.
- Find the similarity between Q-A A-P and Q-P.
But besides those things, we are also simply exploring the topic space (bit of a pun).
- Gensim methods
- LSA, available on Gensim, search index
- LDA, same
- word2vec, same
- GloVe vectors: We use Wikipedia 300 dimension (GWP300), and Common Crawl 840B-token 300 dimension (GCC840)
- Composes Best Predict Vectors (CBP)
These are using the equation format used in scripts/gensim/analog.py
-
compare(w('king') - w('man') + w('woman'), w('queen'))
- Should be high (e.g. > 0.6)
-
queenish=w('king') - w('man') + w('woman'); compare(queenish, w('queen')) - compare(queenish, w('king'))
- Should be positive (low is ok)
-
compare(w('putin') - w('russia') + w('usa'), w('obama'))
- Should be high
-
potusish = w('putin') - w('russia') + w('usa'); compare(potusish, w('obama')) - compare(potusish, w('putin'))
- Should be positive
-
compare(w('democrat'), w('republican'))
- Should be high
-
compare(w('party'), w('republican')) - compare(w('party'), w('democrat'))
- Should have low magnitude