-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero probabilities in LDA model #2418
Comments
Hi @piskvorky, Apparently it was one of my team members.
|
It is indeed with a combination of large vocabulary + topics. Topics 500 and 1000 suffer the problem. Our dict size is > 300K. We also use online updates chunks of 100K documents, with a target total corpus size of 50M |
Hi, @enys |
Quick answer, no. |
Quite stupid question: how can a topic probabilities be a fully zeros if |
Hi @horpto, Sorry for the late reply.
|
@enys, sorry for late response, |
Ping @enys are you able to share a reproducible example? We'd like to get to the bottom of this. |
I have this issue (zero probabilities for words in show_topics) only when using gensim.models.LdaMulticore. Output of gensim.models.ldamodel.LdaModel is as expected. |
@davidalbertonogueira Same comments as above apply. |
I'm sorry but my current dataset is proprietary. I reckon that I could try to create a small example that generates the same error, but I would have to do it with online publically available data, and therefore, there's no point in doing that myself. I share the dimensions in case it helps someone trying to replicate the error: |
@davidalbertonogueira that seems different from the issue reported here, which had a huge (500k) vocabulary and lots of topics (1000). In your case, you have only 14k vocab + 10 topics. Likely unrelated, a separate issue. |
Should I open a new issue then? @piskvorky |
Only if you're able to include the reproducing example :) Otherwise there isn't much we'll be able to do anyway. Thanks. |
I have far less experience than the other reporters (i.e. it could be something I'm doing wrong), but I'm seeing the same thing--one or more topics with near-zero probabilities, and the terms are usually alphabetically contiguous. My corpus is derived from the Yelp Dataset Challenge licensed for academic use...I may be able to share the contents but unsure, I'll have to read closely. However, it's also very small and I'm doing a small number of topics (10-100)...again, could be something naive I'm doing. My code looks like this... the very low
In the latest run that runs over 74310 documents with 100000 features. Then I dump the topics to a text file (among other things) and my "empty" topic looks like this:
|
Here's the requested version output, sorry, missed that:
|
Problem description
A user reported "empty" topics (all probabilities zero), during LdaModel training:
https://groups.google.com/forum/#!topic/gensim/LuPD2VSouSQ
Apparently some of the recent optimizations in #1656 (and maybe elsewhere?) introduced numeric instabilities.
Steps/code/corpus to reproduce
Unknown. Probably related to large data size: large vocabulary in combination with large number of topics, leading to float32 under/overflows.
User reported that changing the
dtype
back to float64 helped and the "empty topics" problem went away.Versions
Please provide the output of:
The text was updated successfully, but these errors were encountered: