-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identical topics #416
Comments
Well, the sampler is not guaranteed to converge :) And the perplexity was high and oscillating a lot. I'll post back if it works next time. |
This seems to be due to the previous divide by zero error. It's also not limited to ldamulticore but also occurs in ldamodel, when simply trying to model Wikipedia with 1000 topics. |
I have been running further tests, and it occurs with 750 topics, but not 500, when using 100,000 words in the vocab on the english wikipedia. |
I received your log, I'm on it. Sorry this is taking so long Brian. We're moving countries and I've only had time for "trivial" open source fixes lately. Debugging this one looks more substantial :) |
Oh no worries, I am not trying to rush or anything. I didn't even realize they were the same bug at first. |
Experiencing the same issue, but only when adjusting the eta prior |
@brianmingus Is this resolved? If not, could you please post the ling to the log gist? Thanks |
I doubt this is resolved - it won't be resolved by accident. |
@brianmingus Ok, could you please turn into a more tractable bug report? |
This is a serious bug in gensim where it fails to converge when there are a certain number of topics. I think this bug is sufficiently spec'd out - @piskvorky seems to grok it. |
I got the same bug when I set topics=1000, and I solved this problem by setting the parameter alpha=50/topic_num, eta=0.1, iteration=500 |
@brianmingus @ocsponge please attach concrete code & dataset for reproducing your problem |
You do not "need info" for this bug. It is sufficiently spec'd out. Please stop asking for more info. |
@brianmingus I don't agree with you because I can't reproduce it now, for this reason, I asked for additional information (code and dataset). |
I provided enough info to replicate; @piskvorky did not ask for more info. If you are interested in working on this ticket, the appropriate steps are to check out gensim from the date the ticket is posted, and a current one. If you can replicate on the old one but not the new one, it's fixed. |
@menshikh-iv , @tmylk , @piskvorky , I'm having the same issue and am including my dataset, dictionary, and code. This is a corpus pulled from gutenberg project, split into 3.5 M documents using a rather clipped vocabulary of ~66000 words. I did not have problems when trying a 400-topic version but did run into issues with 1000 topics. Dataset is 2GB zipped and can be downloaded from google drive. Dictionary and repro code are attached as zips. When I run the code I get a numerical value for topic diff in the first tranche of documents viewed. But then later I get topic diff=inf. Here is the logging information:
|
Based on @ocsponge post, I tried modifying alpha and eta. I have found that in mine case the problem goes away if I set eta = 0.01 but persists if I set eta=0.001. With 2000 topics, default alpha, and eta=0.01, my topics were converging fine. |
Thank you very much @TC-Rudel for additional information, now this problem can be reproduced. |
Are there any updates on this issue? |
@stevemarin not yet |
Same here, I am getting "topic diff=inf" on the log after the second merge (running multicore). What does "topic diff=inf" actually mean and what are potential causes? It would be good to understand the meaning of this better in order to come up with strategies for how to avoid this. Previous comments mentioned changing the number of topics, the eta, or maybe alpha or the number of iterations, but I do not understand how those settings are related to the topic diff? Could the vocabulary size have an influence? |
This means that overflow happens somewhere (typically - division by "almost-zero" value) -> model breaks (produce inf\nans). Related issue: #2115 |
@johann-petrak we applied some "workaround" for this, see #2308, hope that helps |
@menshikh-iv I don't think that 'workaround" will solve this problem. I've had the same problem even after my patch. I can try to explore this a bit later. |
This issue is caused by width of dtype. First of all I have had a warning on
It's not obvious at a first glance (of course, all thinks that
Default dtype of LDA is np.float32. After I've changed it on np.float64 problem disappears. |
@horpto do you see a way to use float64 precision only where needed (internal calculations), but keep the big parameter matrices in float32 (less RAM)? IIRC the only reason for the float32 default was to save memory. |
@piskvorky I guess, we can change |
can u suggest what should be the value of topic_diff in general? |
Then is there any issue for np.float16? What happens when i changed to np.float16 because i got same thing as in np.float32. |
This doesn't seem right.. LDA training on enwiki with 1000 topics. (gensim unmodified)
2015-08-02 12:09:07,550 : INFO : merging changes from 3750 documents into a model of 3831719 documents
2015-08-02 12:09:35,378 : INFO : topic #938 (0.001): 0.037_census + 0.034_population + 0.027_unincorporated + 0.020_community + 0.017_households + 0.016_landmarks + 0.016_$
2015-08-02 12:09:35,522 : INFO : topic #986 (0.001): 0.015_festival + 0.014_films + 0.013_documentary + 0.010_director + 0.009_award + 0.008_directed + 0.008_producer + 0.$
2015-08-02 12:09:35,666 : INFO : topic #492 (0.001): 0.066_kaunas + 0.048_davidson + 0.037_rosenberg + 0.034_kalamazoo + 0.026_blood + 0.024_sha + 0.023_thorpe + 0.022_vei$
2015-08-02 12:09:35,811 : INFO : topic #392 (0.001): 0.018_laser + 0.016_tucker + 0.015_optical + 0.014_forensic + 0.012_imaging + 0.011_pulse + 0.011_lab + 0.009_sample +$
2015-08-02 12:09:35,954 : INFO : topic #890 (0.001): 0.126_dutch + 0.116_van + 0.071_netherlands + 0.069_amsterdam + 0.034_holland + 0.027_hague + 0.022_der + 0.021_willem$
2015-08-02 12:09:36,098 : INFO : topic #769 (0.001): 0.064_icf + 0.053_cove + 0.050_newfoundland + 0.043_vancouver + 0.041_nunataks + 0.036_columbia + 0.030_labrador + 0.0$
2015-08-02 12:09:36,242 : INFO : topic #75 (0.001): 0.043_dong + 0.042_xu + 0.042_yi + 0.025_narayana + 0.024_tao + 0.023_bingham + 0.023_fei + 0.020_parr + 0.020_ren + 0.$
2015-08-02 12:09:36,386 : INFO : topic #742 (0.001): 0.040_peters + 0.031_leith + 0.030_kahn + 0.028_levy + 0.028_bart + 0.022_hedley + 0.019_bandit + 0.018_robyn + 0.017_$
2015-08-02 12:09:36,529 : INFO : topic #438 (0.001): 0.035_editor + 0.035_newspaper + 0.034_magazine + 0.021_published + 0.018_news + 0.016_daily + 0.014_journalism + 0.01$
2015-08-02 12:09:36,673 : INFO : topic #410 (0.001): 0.046_forest + 0.030_reserve + 0.028_forests + 0.024_species + 0.023_conservation + 0.020_habitat + 0.016_moist + 0.01$
2015-08-02 12:09:36,816 : INFO : topic #322 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:36,960 : INFO : topic #407 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:37,103 : INFO : topic #808 (0.001): 0.091_sf + 0.067_jensen + 0.066_isaac + 0.056_slater + 0.047_informatics + 0.045_hospice + 0.045_rot + 0.042_koblenz +$
2015-08-02 12:09:37,248 : INFO : topic #282 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:37,391 : INFO : topic #894 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000*toto + 0.000$
2015-08-02 12:09:37,606 : INFO : topic diff=inf, rho=0.008998
2015-08-02 12:09:37,902 : INFO : PROGRESS: pass 0, dispatched chunk #12366 = documents up to #3091750/3831719, outstanding queue size 3
2015-08-02 12:09:55,582 : INFO : PROGRESS: pass 0, dispatched chunk #12367 = documents up to #3092000/3831719, outstanding queue size 2
2015-08-02 12:10:03,008 : INFO : PROGRESS: pass 0, dispatched chunk #12368 = documents up to #3092250/3831719, outstanding queue size 3
2015-08-02 12:10:17,426 : INFO : PROGRESS: pass 0, dispatched chunk #12369 = documents up to #3092500/3831719, outstanding queue size 3
The text was updated successfully, but these errors were encountered: