Identical topics #416

ghost · 2015-08-02T18:17:00Z

This doesn't seem right.. LDA training on enwiki with 1000 topics. (gensim unmodified)

2015-08-02 12:09:07,550 : INFO : merging changes from 3750 documents into a model of 3831719 documents
2015-08-02 12:09:35,378 : INFO : topic #938 (0.001): 0.037_census + 0.034_population + 0.027_unincorporated + 0.020_community + 0.017_households + 0.016_landmarks + 0.016_$
2015-08-02 12:09:35,522 : INFO : topic #986 (0.001): 0.015_festival + 0.014_films + 0.013_documentary + 0.010_director + 0.009_award + 0.008_directed + 0.008_producer + 0.$
2015-08-02 12:09:35,666 : INFO : topic #492 (0.001): 0.066_kaunas + 0.048_davidson + 0.037_rosenberg + 0.034_kalamazoo + 0.026_blood + 0.024_sha + 0.023_thorpe + 0.022_vei$
2015-08-02 12:09:35,811 : INFO : topic #392 (0.001): 0.018_laser + 0.016_tucker + 0.015_optical + 0.014_forensic + 0.012_imaging + 0.011_pulse + 0.011_lab + 0.009_sample +$
2015-08-02 12:09:35,954 : INFO : topic #890 (0.001): 0.126_dutch + 0.116_van + 0.071_netherlands + 0.069_amsterdam + 0.034_holland + 0.027_hague + 0.022_der + 0.021_willem$
2015-08-02 12:09:36,098 : INFO : topic #769 (0.001): 0.064_icf + 0.053_cove + 0.050_newfoundland + 0.043_vancouver + 0.041_nunataks + 0.036_columbia + 0.030_labrador + 0.0$
2015-08-02 12:09:36,242 : INFO : topic #75 (0.001): 0.043_dong + 0.042_xu + 0.042_yi + 0.025_narayana + 0.024_tao + 0.023_bingham + 0.023_fei + 0.020_parr + 0.020_ren + 0.$
2015-08-02 12:09:36,386 : INFO : topic #742 (0.001): 0.040_peters + 0.031_leith + 0.030_kahn + 0.028_levy + 0.028_bart + 0.022_hedley + 0.019_bandit + 0.018_robyn + 0.017_$
2015-08-02 12:09:36,529 : INFO : topic #438 (0.001): 0.035_editor + 0.035_newspaper + 0.034_magazine + 0.021_published + 0.018_news + 0.016_daily + 0.014_journalism + 0.01$
2015-08-02 12:09:36,673 : INFO : topic #410 (0.001): 0.046_forest + 0.030_reserve + 0.028_forests + 0.024_species + 0.023_conservation + 0.020_habitat + 0.016_moist + 0.01$
2015-08-02 12:09:36,816 : INFO : topic #322 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:36,960 : INFO : topic #407 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:37,103 : INFO : topic #808 (0.001): 0.091_sf + 0.067_jensen + 0.066_isaac + 0.056_slater + 0.047_informatics + 0.045_hospice + 0.045_rot + 0.042_koblenz +$
2015-08-02 12:09:37,248 : INFO : topic #282 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000_toto + 0.000$
2015-08-02 12:09:37,391 : INFO : topic #894 (0.001): 0.000_jawbone + 0.000_antiochus + 0.000_ddr + 0.000_gault + 0.000_noon + 0.000_fahey + 0.000_toth + 0.000*toto + 0.000$
2015-08-02 12:09:37,606 : INFO : topic diff=inf, rho=0.008998
2015-08-02 12:09:37,902 : INFO : PROGRESS: pass 0, dispatched chunk #12366 = documents up to #3091750/3831719, outstanding queue size 3
2015-08-02 12:09:55,582 : INFO : PROGRESS: pass 0, dispatched chunk #12367 = documents up to #3092000/3831719, outstanding queue size 2
2015-08-02 12:10:03,008 : INFO : PROGRESS: pass 0, dispatched chunk #12368 = documents up to #3092250/3831719, outstanding queue size 3
2015-08-02 12:10:17,426 : INFO : PROGRESS: pass 0, dispatched chunk #12369 = documents up to #3092500/3831719, outstanding queue size 3

ghost · 2015-08-03T00:28:37Z

Well, the sampler is not guaranteed to converge :) And the perplexity was high and oscillating a lot. I'll post back if it works next time.

ghost · 2015-08-03T16:16:36Z

This seems to be due to the previous divide by zero error. It's also not limited to ldamulticore but also occurs in ldamodel, when simply trying to model Wikipedia with 1000 topics.

ghost · 2015-08-04T23:45:35Z

I have been running further tests, and it occurs with 750 topics, but not 500, when using 100,000 words in the vocab on the english wikipedia.

piskvorky · 2015-08-05T09:43:35Z

I received your log, I'm on it.

Sorry this is taking so long Brian. We're moving countries and I've only had time for "trivial" open source fixes lately. Debugging this one looks more substantial :)

ghost · 2015-08-05T14:57:22Z

Oh no worries, I am not trying to rush or anything. I didn't even realize they were the same bug at first.

huihuifan · 2015-11-09T00:04:57Z

Experiencing the same issue, but only when adjusting the eta prior

tmylk · 2016-01-10T06:33:20Z

@brianmingus Is this resolved? If not, could you please post the ling to the log gist? Thanks

ghost · 2016-01-11T00:02:46Z

I doubt this is resolved - it won't be resolved by accident.

tmylk · 2016-01-11T15:27:07Z

@brianmingus Ok, could you please turn into a more tractable bug report?
Upload log to a gist, provide code to reproduce etc

ghost · 2016-01-11T20:03:46Z

This is a serious bug in gensim where it fails to converge when there are a certain number of topics. I think this bug is sufficiently spec'd out - @piskvorky seems to grok it.

ocsponge · 2017-06-12T01:49:52Z

I got the same bug when I set topics=1000, and I solved this problem by setting the parameter alpha=50/topic_num, eta=0.1, iteration=500

menshikh-iv · 2017-10-03T07:02:09Z

@brianmingus @ocsponge please attach concrete code & dataset for reproducing your problem

ghost · 2017-10-03T13:30:59Z

You do not "need info" for this bug. It is sufficiently spec'd out. Please stop asking for more info.

menshikh-iv · 2017-10-03T14:06:58Z

@brianmingus I don't agree with you because I can't reproduce it now, for this reason, I asked for additional information (code and dataset).

ghost · 2017-10-03T14:09:46Z

I provided enough info to replicate; @piskvorky did not ask for more info.

If you are interested in working on this ticket, the appropriate steps are to check out gensim from the date the ticket is posted, and a current one. If you can replicate on the old one but not the new one, it's fixed.

TC-Rudel · 2017-10-30T23:51:34Z

@menshikh-iv , @tmylk , @piskvorky , I'm having the same issue and am including my dataset, dictionary, and code. This is a corpus pulled from gutenberg project, split into 3.5 M documents using a rather clipped vocabulary of ~66000 words. I did not have problems when trying a 400-topic version but did run into issues with 1000 topics.

Dataset is 2GB zipped and can be downloaded from google drive.

Dictionary and repro code are attached as zips.
Create_LDA_Model_repro.zip
dictionary.zip

When I run the code I get a numerical value for topic diff in the first tranche of documents viewed. But then later I get topic diff=inf.

Here is the logging information:

C:\Winpython\WinPython-64bit-3.5.4.0Qt5\python-3.5.4.amd64\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2017-10-30 17:30:34,223 : INFO : loading Dictionary object from clean_vcompact_dictionary.pickle
2017-10-30 17:30:34,253 : INFO : loaded clean_vcompact_dictionary.pickle
2017-10-30 17:30:34,699 : INFO : loaded corpus index from E:/Clean_Corpus/Full_Corpus_LD.mm.index
2017-10-30 17:30:34,700 : INFO : initializing corpus reader from E:/Clean_Corpus/Full_Corpus_LD.mm
2017-10-30 17:30:34,700 : INFO : accepted corpus with 3443509 documents, 66457 features, 612903679 non-zero entries
2017-10-30 17:30:34,703 : INFO : using symmetric alpha at 0.001
2017-10-30 17:30:34,703 : INFO : using symmetric eta at 1.5047323833456219e-05
2017-10-30 17:30:34,712 : INFO : using serial LDA version on this node
2017-10-30 17:36:51,844 : INFO : running online LDA training, 1000 topics, 1 passes over the supplied corpus of 3443509 documents, updating every 4000 documents, evaluating every ~40000 documents, iterating 500x with a convergence threshold of 0.001000
2017-10-30 17:36:51,849 : INFO : training LDA model using 2 processes
C:\Winpython\WinPython-64bit-3.5.4.0Qt5\python-3.5.4.amd64\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Winpython\WinPython-64bit-3.5.4.0Qt5\python-3.5.4.amd64\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2017-10-30 17:36:52,407 : INFO : PROGRESS: pass 0, dispatched chunk #0 = documents up to #2000/3443509, outstanding queue size 1
2017-10-30 17:36:52,718 : INFO : loading Dictionary object from clean_vcompact_dictionary.pickle
2017-10-30 17:36:52,718 : INFO : loading Dictionary object from clean_vcompact_dictionary.pickle
2017-10-30 17:36:52,760 : INFO : loaded clean_vcompact_dictionary.pickle
2017-10-30 17:36:52,762 : INFO : loaded clean_vcompact_dictionary.pickle
2017-10-30 17:36:53,317 : INFO : loaded corpus index from E:/Clean_Corpus/Full_Corpus_LD.mm.index
2017-10-30 17:36:53,317 : INFO : initializing corpus reader from E:/Clean_Corpus/Full_Corpus_LD.mm
2017-10-30 17:36:53,318 : INFO : accepted corpus with 3443509 documents, 66457 features, 612903679 non-zero entries
2017-10-30 17:36:53,330 : INFO : loaded corpus index from E:/Clean_Corpus/Full_Corpus_LD.mm.index
2017-10-30 17:36:53,330 : INFO : initializing corpus reader from E:/Clean_Corpus/Full_Corpus_LD.mm
2017-10-30 17:36:53,330 : INFO : accepted corpus with 3443509 documents, 66457 features, 612903679 non-zero entries
2017-10-30 17:36:54,732 : INFO : PROGRESS: pass 0, dispatched chunk #1 = documents up to #4000/3443509, outstanding queue size 2
2017-10-30 17:36:57,326 : INFO : PROGRESS: pass 0, dispatched chunk #2 = documents up to #6000/3443509, outstanding queue size 3
2017-10-30 17:36:59,845 : INFO : PROGRESS: pass 0, dispatched chunk #3 = documents up to #8000/3443509, outstanding queue size 4
2017-10-30 17:37:00,760 : INFO : PROGRESS: pass 0, dispatched chunk #4 = documents up to #10000/3443509, outstanding queue size 5
2017-10-30 17:37:01,560 : INFO : PROGRESS: pass 0, dispatched chunk #5 = documents up to #12000/3443509, outstanding queue size 6
2017-10-30 17:38:09,246 : INFO : PROGRESS: pass 0, dispatched chunk #6 = documents up to #14000/3443509, outstanding queue size 6
2017-10-30 17:38:26,611 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:38:37,241 : INFO : topic #124 (0.001): 0.009*"like" + 0.007*"alexandra" + 0.007*"came" + 0.007*"went" + 0.006*"boys" + 0.005*"mother" + 0.005*"oak" + 0.005*"mrs" + 0.005*"looked" + 0.005*"read"
2017-10-30 17:38:37,242 : INFO : topic #64 (0.001): 0.037*"shall" + 0.029*"thy" + 0.022*"lord" + 0.020*"god" + 0.018*"king" + 0.016*"unto" + 0.015*"things" + 0.015*"precious" + 0.014*"forth" + 0.014*"nephi"
2017-10-30 17:38:37,243 : INFO : topic #287 (0.001): 0.022*"lord" + 0.014*"man" + 0.010*"unto" + 0.010*"came" + 0.009*"shall" + 0.009*"god" + 0.007*"power" + 0.006*"thee" + 0.006*"gold" + 0.006*"mormon"
2017-10-30 17:38:37,244 : INFO : topic #188 (0.001): 0.048*"god" + 0.032*"unto" + 0.031*"hezekiah" + 0.026*"hand" + 0.023*"people" + 0.022*"deliver" + 0.019*"saying" + 0.014*"lord" + 0.013*"fathers" + 0.013*"king"
2017-10-30 17:38:37,244 : INFO : topic #173 (0.001): 0.043*"unto" + 0.028*"shall" + 0.019*"god" + 0.014*"things" + 0.013*"jesus" + 0.012*"came" + 0.012*"lord" + 0.012*"come" + 0.012*"people" + 0.010*"hath"
2017-10-30 17:38:37,593 : INFO : topic diff=985.619743, rho=1.000000
2017-10-30 17:38:37,656 : INFO : PROGRESS: pass 0, dispatched chunk #7 = documents up to #16000/3443509, outstanding queue size 6
2017-10-30 17:39:29,406 : INFO : PROGRESS: pass 0, dispatched chunk #8 = documents up to #18000/3443509, outstanding queue size 6
C:\Winpython\WinPython-64bit-3.5.4.0Qt5\python-3.5.4.amd64\lib\site-packages\gensim\models\ldamodel.py:728: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)
2017-10-30 17:39:50,225 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:40:00,745 : INFO : topic #153 (0.001): 0.010*"letters" + 0.009*"caps" + 0.007*"small" + 0.007*"mother" + 0.007*"book" + 0.006*"word" + 0.005*"little" + 0.005*"long" + 0.005*"anne" + 0.004*"father"
2017-10-30 17:40:00,746 : INFO : topic #398 (0.001): 0.017*"shall" + 0.012*"come" + 0.010*"man" + 0.009*"father" + 0.009*"unto" + 0.009*"like" + 0.008*"thee" + 0.008*"know" + 0.008*"think" + 0.007*"world"
2017-10-30 17:40:00,747 : INFO : topic #264 (0.001): 0.015*"father" + 0.014*"unto" + 0.010*"girl" + 0.008*"went" + 0.008*"let" + 0.008*"away" + 0.008*"came" + 0.008*"jesus" + 0.007*"tarzan" + 0.006*"shall"
2017-10-30 17:40:00,748 : INFO : topic #134 (0.001): 0.016*"shall" + 0.014*"things" + 0.012*"unto" + 0.009*"know" + 0.008*"come" + 0.007*"let" + 0.007*"thy" + 0.007*"man" + 0.006*"hath" + 0.006*"alma"
2017-10-30 17:40:00,749 : INFO : topic #917 (0.001): 0.070*"sir" + 0.018*"gareth" + 0.013*"knight" + 0.012*"smote" + 0.011*"encountered" + 0.010*"came" + 0.009*"spear" + 0.009*"king" + 0.009*"lord" + 0.008*"unto"
2017-10-30 17:40:01,201 : INFO : topic diff=inf, rho=0.333333
2017-10-30 17:40:01,292 : INFO : PROGRESS: pass 0, dispatched chunk #9 = documents up to #20000/3443509, outstanding queue size 6
2017-10-30 17:40:33,389 : INFO : PROGRESS: pass 0, dispatched chunk #10 = documents up to #22000/3443509, outstanding queue size 6
2017-10-30 17:41:06,384 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:41:17,228 : INFO : topic #35 (0.001): 0.016*"shall" + 0.009*"lord" + 0.008*"came" + 0.007*"alice" + 0.007*"anne" + 0.007*"went" + 0.007*"thee" + 0.007*"hath" + 0.006*"know" + 0.005*"little"
2017-10-30 17:41:17,230 : INFO : topic #802 (0.001): 0.027*"unto" + 0.017*"lord" + 0.017*"god" + 0.012*"shall" + 0.011*"men" + 0.010*"came" + 0.008*"hand" + 0.007*"jacob" + 0.006*"david" + 0.006*"man"
2017-10-30 17:41:17,232 : INFO : topic #713 (0.001): 0.013*"thy" + 0.008*"shall" + 0.006*"power" + 0.006*"people" + 0.005*"time" + 0.005*"lord" + 0.005*"ether" + 0.005*"thou" + 0.005*"account" + 0.004*"unto"
2017-10-30 17:41:17,234 : INFO : topic #958 (0.001): 0.013*"companions" + 0.010*"dog" + 0.009*"ape" + 0.009*"men" + 0.009*"man" + 0.008*"ancestors" + 0.008*"great" + 0.008*"preparations" + 0.008*"traveling" + 0.008*"selfish"
2017-10-30 17:41:17,235 : INFO : topic #819 (0.001): 0.012*"tom" + 0.011*"jim" + 0.009*"let" + 0.009*"cor" + 0.008*"warmed" + 0.008*"life" + 0.007*"time" + 0.006*"got" + 0.006*"come" + 0.006*"says"
2017-10-30 17:41:19,478 : INFO : topic diff=inf, rho=0.200000
2017-10-30 17:41:19,555 : INFO : PROGRESS: pass 0, dispatched chunk #11 = documents up to #24000/3443509, outstanding queue size 6
2017-10-30 17:41:21,343 : INFO : PROGRESS: pass 0, dispatched chunk #12 = documents up to #26000/3443509, outstanding queue size 6
2017-10-30 17:41:53,421 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:42:04,382 : INFO : topic #395 (0.001): 0.040*"shall" + 0.012*"unto" + 0.010*"man" + 0.007*"days" + 0.007*"king" + 0.006*"came" + 0.006*"vision" + 0.006*"lord" + 0.005*"thee" + 0.005*"hand"
2017-10-30 17:42:04,383 : INFO : topic #965 (0.001): 0.013*"unto" + 0.012*"shall" + 0.012*"god" + 0.011*"man" + 0.009*"alma" + 0.008*"came" + 0.007*"went" + 0.006*"great" + 0.006*"mosiah" + 0.006*"yea"
2017-10-30 17:42:04,384 : INFO : topic #234 (0.001): 0.028*"shall" + 0.011*"lord" + 0.009*"thy" + 0.009*"unto" + 0.008*"god" + 0.007*"nephi" + 0.007*"day" + 0.006*"hath" + 0.006*"behold" + 0.006*"know"
2017-10-30 17:42:04,385 : INFO : topic #856 (0.001): 0.046*"shall" + 0.008*"president" + 0.008*"lord" + 0.007*"king" + 0.007*"day" + 0.006*"priest" + 0.005*"like" + 0.005*"chuse" + 0.005*"man" + 0.005*"let"
2017-10-30 17:42:04,386 : INFO : topic #440 (0.001): 0.007*"heaven" + 0.005*"great" + 0.004*"far" + 0.004*"time" + 0.004*"like" + 0.004*"old" + 0.004*"place" + 0.003*"feet" + 0.003*"looked" + 0.003*"deep"
2017-10-30 17:42:06,639 : INFO : topic diff=inf, rho=0.142857
2017-10-30 17:42:06,716 : INFO : PROGRESS: pass 0, dispatched chunk #13 = documents up to #28000/3443509, outstanding queue size 6
2017-10-30 17:42:08,490 : INFO : PROGRESS: pass 0, dispatched chunk #14 = documents up to #30000/3443509, outstanding queue size 6
2017-10-30 17:42:40,719 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:42:53,718 : INFO : topic #466 (0.001): 0.008*"half" + 0.007*"state" + 0.007*"man" + 0.006*"came" + 0.006*"like" + 0.006*"apology" + 0.006*"lord" + 0.006*"thousand" + 0.006*"owes" + 0.005*"place"
2017-10-30 17:42:53,719 : INFO : topic #12 (0.001): 0.014*"lord" + 0.007*"unto" + 0.007*"people" + 0.007*"know" + 0.006*"day" + 0.005*"jacob" + 0.005*"time" + 0.005*"shall" + 0.005*"like" + 0.005*"god"
2017-10-30 17:42:53,721 : INFO : topic #460 (0.001): 0.029*"cook" + 0.017*"thee" + 0.013*"come" + 0.013*"blood" + 0.012*"place" + 0.012*"bid" + 0.011*"god" + 0.010*"emma" + 0.009*"shall" + 0.009*"things"
2017-10-30 17:42:53,722 : INFO : topic #6 (0.001): 0.095*"captain" + 0.038*"shall" + 0.015*"camp" + 0.012*"house" + 0.012*"pitch" + 0.012*"children" + 0.010*"man" + 0.010*"sanctuary" + 0.009*"lord" + 0.008*"unto"
2017-10-30 17:42:53,723 : INFO : topic #165 (0.001): 0.014*"shall" + 0.008*"day" + 0.008*"like" + 0.007*"man" + 0.007*"lord" + 0.006*"hath" + 0.006*"great" + 0.006*"come" + 0.005*"know" + 0.005*"way"
2017-10-30 17:42:54,143 : INFO : topic diff=inf, rho=0.111111
2017-10-30 17:42:54,228 : INFO : PROGRESS: pass 0, dispatched chunk #15 = documents up to #32000/3443509, outstanding queue size 6
2017-10-30 17:42:55,937 : INFO : PROGRESS: pass 0, dispatched chunk #16 = documents up to #34000/3443509, outstanding queue size 6
2017-10-30 17:43:27,559 : INFO : merging changes from 4000 documents into a model of 3443509 documents
2017-10-30 17:43:38,648 : INFO : topic #773 (0.001): 0.009*"god" + 0.007*"suburbs" + 0.006*"unto" + 0.006*"little" + 0.005*"lord" + 0.005*"man" + 0.005*"thy" + 0.005*"like" + 0.005*"shall" + 0.004*"know"
2017-10-30 17:43:38,649 : INFO : topic #834 (0.001): 0.018*"party" + 0.012*"man" + 0.010*"person" + 0.010*"property" + 0.009*"common" + 0.009*"point" + 0.009*"officer" + 0.008*"duty" + 0.008*"object" + 0.008*"case"
2017-10-30 17:43:38,650 : INFO : topic #124 (0.001): 0.010*"like" + 0.008*"got" + 0.007*"went" + 0.007*"mother" + 0.006*"come" + 0.006*"room" + 0.005*"mrs" + 0.005*"night" + 0.005*"long" + 0.005*"came"
2017-10-30 17:43:38,651 : INFO : topic #16 (0.001): 0.019*"king" + 0.010*"lord" + 0.010*"shall" + 0.008*"unto" + 0.008*"come" + 0.007*"man" + 0.007*"men" + 0.007*"let" + 0.006*"know" + 0.005*"thy"
2017-10-30 17:43:38,652 : INFO : topic #463 (0.001): 0.013*"lord" + 0.011*"house" + 0.009*"chest" + 0.008*"man" + 0.007*"otter" + 0.007*"went" + 0.006*"came" + 0.006*"money" + 0.005*"day" + 0.005*"shall"
2017-10-30 17:43:41,012 : INFO : topic diff=inf, rho=0.090909

TC-Rudel · 2017-10-31T06:05:05Z

Based on @ocsponge post, I tried modifying alpha and eta. I have found that in mine case the problem goes away if I set eta = 0.01 but persists if I set eta=0.001. With 2000 topics, default alpha, and eta=0.01, my topics were converging fine.

menshikh-iv · 2017-10-31T06:14:35Z

Thank you very much @TC-Rudel for additional information, now this problem can be reproduced.

stevemarin · 2018-04-25T22:11:53Z

Are there any updates on this issue?

menshikh-iv · 2018-04-30T08:22:13Z

@stevemarin not yet

johann-petrak · 2018-11-10T14:24:17Z

Same here, I am getting "topic diff=inf" on the log after the second merge (running multicore).
The topic diff is 25.4 after the first merge.

What does "topic diff=inf" actually mean and what are potential causes? It would be good to understand the meaning of this better in order to come up with strategies for how to avoid this. Previous comments mentioned changing the number of topics, the eta, or maybe alpha or the number of iterations, but I do not understand how those settings are related to the topic diff? Could the vocabulary size have an influence?

menshikh-iv · 2018-12-13T14:38:25Z

What does "topic diff=inf" actually mean and what are potential causes?

This means that overflow happens somewhere (typically - division by "almost-zero" value) -> model breaks (produce inf\nans). Related issue: #2115

menshikh-iv · 2019-01-17T02:01:43Z

@johann-petrak we applied some "workaround" for this, see #2308, hope that helps

horpto · 2019-01-17T02:34:00Z

@menshikh-iv I don't think that 'workaround" will solve this problem. I've had the same problem even after my patch. I can try to explore this a bit later.

menshikh-iv · 2019-01-17T02:39:35Z

@horpto I still hope that #2308 at least reduce the number of "overflow-related" errors.

I can try to explore this a bit later.

Sounds pretty useful and nice, please go ahead when you will have time!

horpto · 2019-01-18T01:49:55Z

This issue is caused by width of dtype. First of all I have had a warning on diff = np.log(self.expElogbeta) in the second m-step: RuntimeWarning: divide by zero encountered in log. So that's why inf-s appear in output (self.expElogbeta have contained zeros). get_Elogbeta() after first blend has returned something like this:

[[ -11.186146   -13.545639   -11.4461155 ... -112.541405  -112.541405
  -112.541405 ]
 [ -11.8831415  -11.548369    -9.9233265 ... -112.556595  -112.556595
  -112.556595 ]
 [ -11.561755   -10.991329   -11.953122  ... -112.475     -112.475
  -112.475    ]
 ...
 [ -11.4945545  -11.350912    -9.209938  ... -112.61384   -112.61384
  -112.61384  ]
 [ -11.081068   -12.508811   -10.777531  ... -112.40563   -112.40563
  -112.40563  ]
 [ -11.711579   -13.1611     -13.570866  ... -112.315475  -112.315475
  -112.315475 ]]

It's not obvious at a first glance (of course, all thinks that log(exp(x)) == x)) but there is a surprise:

>>> np.exp(-123)
3.817497188671175e-54
>>> np.exp(-123, dtype=np.float32)
0.0

Default dtype of LDA is np.float32. After I've changed it on np.float64 problem disappears.

piskvorky · 2019-01-18T07:46:00Z

@horpto do you see a way to use float64 precision only where needed (internal calculations), but keep the big parameter matrices in float32 (less RAM)?

IIRC the only reason for the float32 default was to save memory.

horpto · 2019-01-18T16:45:22Z

@piskvorky I guess, we can change diff = np.log(self.expElogbeta) line to diff = self.state.get_Elogbeta() due to invariant self.expElogbeta == np.exp(self.state.get_Elogbeta()) but it does not solve possible problem with zeros in self.expElogbeta instead of small values.
I can add PR if it's good enough suggestion and if you agree with it.

gauravkoradiya · 2019-09-06T14:24:14Z

division

can u suggest what should be the value of topic_diff in general?

gauravkoradiya · 2019-09-06T14:33:43Z

@horpto do you see a way to use float64 precision only where needed (internal calculations), but keep the big parameter matrices in float32 (less RAM)?

IIRC the only reason for the float32 default was to save memory.

Then is there any issue for np.float16? What happens when i changed to np.float16 because i got same thing as in np.float32.

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills need info Not enough information for reproduce an issue, need more info from author labels Oct 3, 2017

menshikh-iv removed the need info Not enough information for reproduce an issue, need more info from author label Oct 31, 2017

horpto added a commit to horpto/gensim that referenced this issue Jan 20, 2019

Fix wrong diff value in LdaModel.do_mstep piskvorky#416, piskvorky#2051

1231f00

horpto added a commit to horpto/gensim that referenced this issue Jan 20, 2019

Fix piskvorky#416, piskvorky#2051: Infinite diff in LdaModel.do_mstep

5659054

horpto mentioned this issue Jan 23, 2019

Fix infinite diff in LdaModel.do_mstep #2344

Merged

menshikh-iv closed this as completed in 179a2c1 Jan 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identical topics #416

Identical topics #416

ghost commented Aug 2, 2015

ghost commented Aug 3, 2015

ghost commented Aug 3, 2015

ghost commented Aug 4, 2015

piskvorky commented Aug 5, 2015

ghost commented Aug 5, 2015

huihuifan commented Nov 9, 2015

tmylk commented Jan 10, 2016

ghost commented Jan 11, 2016

tmylk commented Jan 11, 2016

ghost commented Jan 11, 2016

ocsponge commented Jun 12, 2017 •

edited

Loading

menshikh-iv commented Oct 3, 2017

ghost commented Oct 3, 2017

menshikh-iv commented Oct 3, 2017

ghost commented Oct 3, 2017

TC-Rudel commented Oct 30, 2017

TC-Rudel commented Oct 31, 2017

menshikh-iv commented Oct 31, 2017

stevemarin commented Apr 25, 2018

menshikh-iv commented Apr 30, 2018

johann-petrak commented Nov 10, 2018 •

edited

Loading

menshikh-iv commented Dec 13, 2018

menshikh-iv commented Jan 17, 2019

horpto commented Jan 17, 2019

menshikh-iv commented Jan 17, 2019

horpto commented Jan 18, 2019 •

edited

Loading

piskvorky commented Jan 18, 2019 •

edited

Loading

horpto commented Jan 18, 2019 •

edited

Loading

gauravkoradiya commented Sep 6, 2019

gauravkoradiya commented Sep 6, 2019

Identical topics #416

Identical topics #416

Comments

ghost commented Aug 2, 2015

ghost commented Aug 3, 2015

ghost commented Aug 3, 2015

ghost commented Aug 4, 2015

piskvorky commented Aug 5, 2015

ghost commented Aug 5, 2015

huihuifan commented Nov 9, 2015

tmylk commented Jan 10, 2016

ghost commented Jan 11, 2016

tmylk commented Jan 11, 2016

ghost commented Jan 11, 2016

ocsponge commented Jun 12, 2017 • edited Loading

menshikh-iv commented Oct 3, 2017

ghost commented Oct 3, 2017

menshikh-iv commented Oct 3, 2017

ghost commented Oct 3, 2017

TC-Rudel commented Oct 30, 2017

TC-Rudel commented Oct 31, 2017

menshikh-iv commented Oct 31, 2017

stevemarin commented Apr 25, 2018

menshikh-iv commented Apr 30, 2018

johann-petrak commented Nov 10, 2018 • edited Loading

menshikh-iv commented Dec 13, 2018

menshikh-iv commented Jan 17, 2019

horpto commented Jan 17, 2019

menshikh-iv commented Jan 17, 2019

horpto commented Jan 18, 2019 • edited Loading

piskvorky commented Jan 18, 2019 • edited Loading

horpto commented Jan 18, 2019 • edited Loading

gauravkoradiya commented Sep 6, 2019

gauravkoradiya commented Sep 6, 2019

ocsponge commented Jun 12, 2017 •

edited

Loading

johann-petrak commented Nov 10, 2018 •

edited

Loading

horpto commented Jan 18, 2019 •

edited

Loading

piskvorky commented Jan 18, 2019 •

edited

Loading

horpto commented Jan 18, 2019 •

edited

Loading