different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401

jeremybmerrill · 2017-06-08T18:39:46Z

Description

Hi friends -- I've been using gensim 0.13.4.1 to generate bigrams and trigrams (Phrase model), then to train a word2vec model on a stream of sentences run thru the trigrams phrase model. Works great.

But I've tried to upgrade to 1.0.0 and 2.1.0 and the results are way way worse -- with the exact same code. I've scoured the release notes but haven't found an explanation. Don't see deprecation warnings in 0.13.4.1 either.

I should clarify -- on gensim 2.1.0, everything "works", i.e. there are no exceptions. It's just that the answers it gives for similarity are incorrect. The classic man : king :: woman : __ analogy gives "chernow" in the 2.1.0-trained model, but "queen" (the expected answer) in the 0.13.4.1-trained model.

I'm not an expert on any of this, so this could be a silly mistake. (I posted about this in the gitter channel and was directed to open an issue.)

Steps/Code/Corpus to Reproduce

Snippet:

  model = gensim.models.Word2Vec(
                                  ngrams_models[ngrams_model](sentences) if ngrams_model else sentences, 
                                  workers=4, 
                                  min_count=min_count, 
                                  size=size, 
                                  sample=downsampling, 
                                  # sg=(1 if use_skipgrams else 0)
                                )

  model.init_sims(replace=True)
  try:
    model_name = "model_%s_%s_%s_min_count_%s_size_%s_downsampling_%s_%s.bin" % (sentences_filename.split("/")[-1].split(".")[0], "stemmed" if stemming else "raw_words", ngrams_model, min_count, size, downsampling, "sg" if use_skipgrams else "cbow")
  except:
    model_name = "model.bin"
  model.save(model_name)

  with open("most_recent_model_filename.txt", "w") as f:
    f.write(model_name)

  print(model.most_similar_cosmul(positive=['woman', 'king'], negative=['man']))

FULL training script (slightly sanitized): https://github.com/jeremybmerrill/ineffable_abracadabra_of_cleaning/blob/master/train_in_vain.py. The only difference it takes to get the different results is changing requirements.txt to specify gensim 2.1.0 and re-running pip install -r requirements.txt

Expected Results

print(model.most_similar_cosmul(positive=['woman', 'king'], negative=['man']))
should have queen as the first answer, e.g.
[(u'queen', 0.8925966620445251), (u'royal', 0.8144874572753906), (u'wales', 0.7882706522941589), (u'kingston_ontario', 0.7752816677093506), (u'helen_mirren', 0.7718634009361267), (u'bismarck', 0.7668762803077698), (u'princess', 0.7609482407569885), (u'queen_elizabeth_ii', 0.76094651222229), (u'prince_philip', 0.7592762112617493), (u'duchess', 0.7587381601333618)]

this is the result from in 0.13.4.1

Actual Results

gives chernow as the first answer and a bunch of other irrelevant stuff, ("chernow",0.8686840534210205),("anxiety_disorders",0.8328080177307129),("marlinga",0.8300415277481079), ...

(This is with a model trained in 2.1.0 and calling model.most_similar_cosmul in 2.1.0. I haven't tested what happens using a model trained in 0.13.4.1, but loading the model and calling model.most_similar_cosmul in 2.1.0.)

Versions

Linux-4.12.0-999-generic-x86_64-with-Ubuntu-17.04-zesty
>>> import sys; print("Python", sys.version)
('Python', '2.7.13 (default, Jan 19 2017, 14:48:08) \n[GCC 6.3.0 20170118]')
>>> import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.12.1')
>>> import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.19.0')
>>> import gensim; print("gensim", gensim.__version__)
('gensim', '2.1.0')
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
('FAST_VERSION', 1)

I appreciate any help y'all can offer! (Great software, love it, so much fun, :D etc. )

The text was updated successfully, but these errors were encountered:

gojomo · 2017-06-09T00:51:19Z

Certainly a concern and maybe a worrisome regression.

Without yet looking deeply into specifics of your process, some things worth checking:

do you see a similar discrepancy with plain most_similar() (as opposed to most_similar_cosmul())?
with logging on (say to INFO level), is there anything different/fishy about the reporting/progress-stats/timing when running the 2.0.1 variant – perhaps suggesting less training is happening, or the same corpus is being interpreted as a different size?
since you are able to see the error with a unigram test, what if you eliminate all the Phrases setup – do you see the same drop-off in results-quality?
if it's easy to re-try your process changing just the version of gensim, do you see the same problem in intervening versions? (Especially: 1.0.1 and 2.0.0.)
it would be interesting to test the variant you parenthetically mention: if a model trained in 0.13.4.1, save()d, but then load()ed in 2.0.1 shows the better or worse results
I don't believe there have been any significant changes in defaults between those versions (though there were earlier); still, to be certain you may want to log the actual in-model values used, for each of the things that can be specified in the model-initialization (Word2Vec.__init__() parameters as copied into model properties)

jeremybmerrill · 2017-06-09T13:08:27Z

Hi @gojomo; thanks.

it would be interesting to test the variant you parenthetically mention: if a model trained in 0.13.4.1, save()d, but then load()ed in 2.0.1 shows the better or worse results

This one is easy to check. loading a model in 2.0.1 that I saved in 0.13.4.1 gives the "right" answers. I.e. "queen" not "chernow".

I will check on the rest of the stuff.

jeremybmerrill · 2017-06-09T14:56:58Z

with logging on (say to INFO level), is there anything different/fishy about the reporting/progress-stats/timing when running the 2.0.1 variant – perhaps suggesting less training is happening, or the same corpus is being interpreted as a different size?

Yes, it turns out.

2017-06-09 09:09:36,280 : WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable).

2017-06-09 09:09:36,282 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
2017-06-09 09:09:36,282 : WARNING : supplied example count (0) did not equal expected count (27454375)

Looks like quite the hint. I train the model like this (which presumably calls train() internally):

  model = gensim.models.Word2Vec(
                                  trigrams_model[bigrams_model[sentences]], 
                                  workers=4, 
                                  min_count=min_count, 
                                  size=size, 
                                  sample=downsampling, 
                                  # sg=(1 if use_skipgrams else 0)
                                )

I'm trying right now to train it without the ngram stuff, just the plain sentences iterator.

jeremybmerrill · 2017-06-09T17:56:04Z

You are right about the ngrams -- removing that and throwing in the plain sentences iterator on 2.1.0 works fine.

Any idea what might have caused the change?

gojomo · 2017-06-09T23:42:48Z

This is helpful narrowing. It might be a Phrases regression, we'll have to take a look at recent changes there, especially around 34759be @tmylk @ELind77

A practical workaround could be to apply the (two-level) phrase-ification just once, writing the results to a new file – which is then iterated over multiple times as a simple text corpus. (This could offer performance benefits too - the repeated phrase-promotion on each training iteration can be expensive duplicate work.)

piskvorky · 2017-06-10T01:34:39Z

Thanks a lot @jeremybmerrill ! And @gojomo .

Whatever the root cause is, it implies our tests are insufficient, if they allow something like this to slip through.

jeremybmerrill · 2017-06-18T14:40:32Z

Thanks @gojomo for that tip. I load the bigrams/trigrams models, then iterate through my raw corpus, saving the resulting phrase-ified tokens to a new clean file (space-delimited tokens, basically). Then I train w2v on that new clean file.

Now it works as expected (i.e. producing "queen" from "man : king :: woman :_____").

Oddly though, it took about 3 hours to train everything, compared to about an hour before. Not a huge problem, but still odd. Is there a quicker way that you were envisioning to write the output of the phraseification to disk than this?

  sentences = SentencesToPhraseify(raw_sentences_filename) # this runs the sentences through the `trigrams[bigrams[sentence]]` thing
  with open(ngrammed_sentences_filename, 'a') as f:
    for s in sentences:
      f.write(' '.join(s))

gojomo · 2017-06-18T19:44:29Z

While the initial writing of the tokenized/phrase-combined corpus is an extra time-consuming step, I would expect each individual followup Word2Vec pass to be a little faster – because the computationally-expensive phrase-upgrading isn't being repeated each time, in a single bottleneck thread.

Can you track the timing of each individual step, by either watching logging or adding time/duration printouts, to see for sure if anything has slowed? (And, to be sure that in some way the 'faster' path isn't just inadvertently doing less, such as skipping training entirely because a non-restartable iterator only allows the vocab-discovery pass, not multiple training passes?)

That is roughly the way I'd write a tokenized corpus to disk. I would however tend to use the smart_open() utility method as used in gensim code/examples, and a filename that indicates compression is desired (ends in .gz or .bz2) so that smart_open() automatically compresses. For typical spinning-HD speeds, compression often saves more in IO time than it costs in CPU, so this step might also improve later training iteration speed. (It may have less of an impact if already using an SSD.)

jeremybmerrill · 2017-06-21T12:57:57Z

Yes, it works now. Just took 16 minutes to train, with the pre-written phraseified corpus. And the results are as expected. Thanks for the suggestion!

I'll try re-doing the writing step with timers.

gojomo · 2017-06-21T19:58:04Z

So the remaining issue here is that something in the Phrases behavior changed between 0.13.4.1 and 2.1.0, in a way that caused your otherwise working (and proper-seeming) code to stop working. Ideally we'll find a tiny test case to exhibit the difference. Leaving this issue open, with an updated title, for finding/fixing that issue.

piskvorky · 2017-06-22T04:59:30Z

CC @prakhar2b @menshikh-iv @jayantj -- let's make sure we don't transfer this newly introduced bug into the "optimized" Phrases too.

jayantj · 2017-06-22T06:29:02Z

@piskvorky okay, we'll make sure.

ELind77 · 2017-07-23T00:56:23Z

@gojomo Just do be clear, did I break something? It seems like I didn't but I want to make sure so that I can fix it if I need to.

gojomo · 2017-07-24T02:55:35Z

@ELind77 I'm not sure, but the evidence above makes it seem like something changed in Phrases-related behavior breaking @jeremybmerrill's code – and we should figure out for sure what explains the reported symptoms.

gojomo · 2017-08-31T22:17:48Z

Another report on list: https://groups.google.com/d/msg/gensim/iT-7yEbQWuQ/F395xjhJDAAJ

sj29-innovate · 2018-01-12T11:08:36Z

@menshikh-iv As per your suggestion , I would like to this up, can you provide some starting points ? Thanks a lot.

menshikh-iv · 2018-01-12T11:24:36Z

@sj29-innovate

Carefully read the bug reports (there are all the links here).
Reproduce this bug (compare how it works 0.13.4.1 and 3.2.0).
Look into "candidates for regressions", for example, different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401 (comment).
Found, why this problem happens (and how this happens).
Fix this bug & submit PR, don't forget to add tests for this case (to prevent regression in future).

sj29-innovate · 2018-01-13T09:33:47Z

@menshikh-iv I read the bug reports and tried to reproduce the error by comparing both the versions but i am getting unexpected results for both the versions , can you please suggest what corpus should i use for training the word2vec model.

menshikh-iv · 2018-01-13T12:52:30Z

@sj29-innovate this is bug in Phrases (w2v used only for demonstration, you can ignore it).
Use any corpus (no object).

also, add 2.1.0 to the comparison.

sj29-innovate · 2018-01-13T17:58:58Z

@menshikh-iv So basically , first i should find some test case such that the phrasing output is different for different versions , right ? thanks.

menshikh-iv · 2018-01-15T05:50:02Z

@sj29-innovate problem not with "concrete" dataset, problem with Phrases class

main "hint" for you here - #1401 (comment) ("train called with empty iterator")

Phrases called as trigrams_model[bigrams_model[sentences]]

You can try to reproduce this behavior with w2v and any dataset (as (1) from my list).

sj29-innovate · 2018-01-18T18:32:27Z

@menshikh-iv @gojomo I think I have found the glitch , the "_is_single" function caused this behaviour . Returning "obj" instead of iterator to obj "iter(obj)" gives correct results.
Currently it exist as ,

obj_iter = iter(obj) 
    try:  
        peek = next(obj_iter) 
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document
        return True, obj
    if isinstance(peek, string_types):
        # It's a document, return the iterator
        return True, obj_iter
    else:
        # If the first item isn't a string, assume obj is a corpus
        return False, obj_iter

Changing it to ,

obj_iter = iter(obj)
    try:
        peek = next(obj_iter)
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document
        return True, obj
    if isinstance(peek, string_types):
        # It's a document, return the iterator
        return True, obj_iter
    else:
        # If the first item isn't a string, assume obj is a corpus
        return False, list(obj_iter)

gives correct results because _is_single check is followed by "self._apply" later on and therefore returning iterator in case the object is a corpus gives an empty iterator.
I have used eng_news_2015_30K.tar.gz corpus for my purpose , my observation was that after the above change I got similar results if I train word2vec on plain sentences iterator as well as using "trigrams_models[bigrams_model[sentences]]" . Without this change i was getting quite different results for both. Though the results are not very satisfactory may be because of the limited corpus but I was able to generate same results from the two inputs with the above change and also now the results match those of the version 0.13.4.1

gojomo · 2018-01-20T03:38:39Z

@sj29-innovate The lack of line formatting makes the code hard to compare/view; can you fix in your comment? Is there a minimal synthetic corpus that illustrates the issue, and that the behavior after the proposed-change is preferable?

sj29-innovate · 2018-01-22T19:01:00Z

@menshikh-iv @gojomo Should i submit a PR with appropriate tests for the same ?

menshikh-iv · 2018-01-23T05:51:18Z

@sj29-innovate of course!

… versions , added tests for checking empty iterator

…versions , test added for empty iterator

gojomo · 2018-01-23T22:41:06Z

I don't quite understand how the proposed test applies, but: does this exact test succeed in v. 0.13.4.1, then fail in v. 3.2.0, then succeed after the fix is applied?

sj29-innovate · 2018-01-23T22:50:28Z

Yes it failed on the current version whereas worked after the fix was applied. What i deduced was that the "_is_single" function returned iterator to the object in case the object is a corpus , but this iterator was further going through "self._apply()" which again made iterator out of it , so therefore such fix worked.

menshikh-iv · 2018-02-15T11:37:05Z

Very important comment from Gordon: #1853 (comment)

* bm25 scoring function updated * Fixes #1401 , Phrases behavious now consistent on different versions , test added for empty iterator * Fixes #1401 , IS_SINGLE Function updated * Fixes #1401 , IS_SINGLE function updated * Fixes #1401 , IS_SINGLE function updated * Fixes #1401 , tests for phrasified sentences added

* bm25 scoring function updated * Fixes piskvorky#1401 , Phrases behavious now consistent on different versions , test added for empty iterator * Fixes piskvorky#1401 , IS_SINGLE Function updated * Fixes piskvorky#1401 , IS_SINGLE function updated * Fixes piskvorky#1401 , IS_SINGLE function updated * Fixes piskvorky#1401 , tests for phrasified sentences added

NikkiAdams · 2018-09-26T14:51:19Z

Sorry to be a dunce, but this conversation means this issue is now fixed, right? I have version 3.2.0 and still see this odd behavior with phrases. But perhaps I need to ask our systems admin to update, to either re-get version 3.2.0, or get a later version? In the meantime, I will work to implement the temporary fix discussed here.

menshikh-iv · 2018-09-27T06:05:15Z

@NikkiAdams fix of this issue was released in 3.4.0, feel free to update >=3.4.0.

piskvorky assigned menshikh-iv Jun 9, 2017

gojomo changed the title ~~different analogy results for same code/corpus between 0.13.4.1 and 2.1.0~~ different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 Jun 21, 2017

piskvorky added the bug Issue described a bug label Sep 1, 2017

menshikh-iv added the difficulty medium Medium issue: required good gensim understanding & python skills label Oct 2, 2017

menshikh-iv mentioned this issue Jan 12, 2018

Fix formula in gensim.summarization.bm25. Fix #1828 #1833

Merged

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 23, 2018

Fixes piskvorky#1401 , Phrases behaviour made consistent on different…

8aa39cb

… versions , added tests for checking empty iterator

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 23, 2018

Fixes piskvorky#1401 , Phrases behaviour made consistent on different…

af20e2a

… versions , added tests for checking empty iterator

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 23, 2018

Fixes piskvorky#1401 , Phrases behavious now consistent on different …

d9b098a

…versions , test added for empty iterator

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 31, 2018

Fixes piskvorky#1401 , IS_SINGLE Function updated

c25bb4b

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 31, 2018

Fixes piskvorky#1401 , IS_SINGLE function updated

f23ef6a

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 31, 2018

Fixes piskvorky#1401 , IS_SINGLE function updated

dfdd16e

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Jan 31, 2018

Fixes piskvorky#1401 , tests for phrasified sentences added

3d26f32

sj29-innovate mentioned this issue Feb 5, 2018

Fix empty output bug in Phrases. Fix #1401 #1853

Merged

menshikh-iv closed this as completed in #1853 Feb 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401

different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401

jeremybmerrill commented Jun 8, 2017

gojomo commented Jun 9, 2017

jeremybmerrill commented Jun 9, 2017 •

edited

Loading

jeremybmerrill commented Jun 9, 2017

jeremybmerrill commented Jun 9, 2017

gojomo commented Jun 9, 2017

piskvorky commented Jun 10, 2017 •

edited

Loading

jeremybmerrill commented Jun 18, 2017 •

edited

Loading

gojomo commented Jun 18, 2017

jeremybmerrill commented Jun 21, 2017

gojomo commented Jun 21, 2017

piskvorky commented Jun 22, 2017 •

edited

Loading

jayantj commented Jun 22, 2017

ELind77 commented Jul 23, 2017

gojomo commented Jul 24, 2017

gojomo commented Aug 31, 2017

sj29-innovate commented Jan 12, 2018

menshikh-iv commented Jan 12, 2018 •

edited

Loading

sj29-innovate commented Jan 13, 2018

menshikh-iv commented Jan 13, 2018 •

edited

Loading

sj29-innovate commented Jan 13, 2018 •

edited

Loading

menshikh-iv commented Jan 15, 2018

sj29-innovate commented Jan 18, 2018 •

edited

Loading

gojomo commented Jan 20, 2018

sj29-innovate commented Jan 22, 2018

menshikh-iv commented Jan 23, 2018

gojomo commented Jan 23, 2018

sj29-innovate commented Jan 23, 2018

menshikh-iv commented Feb 15, 2018

NikkiAdams commented Sep 26, 2018

menshikh-iv commented Sep 27, 2018

different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401

different Phrases behavior for same code/corpus between 0.13.4.1 and 2.1.0 #1401

Comments

jeremybmerrill commented Jun 8, 2017

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

gojomo commented Jun 9, 2017

jeremybmerrill commented Jun 9, 2017 • edited Loading

jeremybmerrill commented Jun 9, 2017

jeremybmerrill commented Jun 9, 2017

gojomo commented Jun 9, 2017

piskvorky commented Jun 10, 2017 • edited Loading

jeremybmerrill commented Jun 18, 2017 • edited Loading

gojomo commented Jun 18, 2017

jeremybmerrill commented Jun 21, 2017

gojomo commented Jun 21, 2017

piskvorky commented Jun 22, 2017 • edited Loading

jayantj commented Jun 22, 2017

ELind77 commented Jul 23, 2017

gojomo commented Jul 24, 2017

gojomo commented Aug 31, 2017

sj29-innovate commented Jan 12, 2018

menshikh-iv commented Jan 12, 2018 • edited Loading

sj29-innovate commented Jan 13, 2018

menshikh-iv commented Jan 13, 2018 • edited Loading

sj29-innovate commented Jan 13, 2018 • edited Loading

menshikh-iv commented Jan 15, 2018

sj29-innovate commented Jan 18, 2018 • edited Loading

gojomo commented Jan 20, 2018

sj29-innovate commented Jan 22, 2018

menshikh-iv commented Jan 23, 2018

gojomo commented Jan 23, 2018

sj29-innovate commented Jan 23, 2018

menshikh-iv commented Feb 15, 2018

NikkiAdams commented Sep 26, 2018

menshikh-iv commented Sep 27, 2018

jeremybmerrill commented Jun 9, 2017 •

edited

Loading

piskvorky commented Jun 10, 2017 •

edited

Loading

jeremybmerrill commented Jun 18, 2017 •

edited

Loading

piskvorky commented Jun 22, 2017 •

edited

Loading

menshikh-iv commented Jan 12, 2018 •

edited

Loading

menshikh-iv commented Jan 13, 2018 •

edited

Loading

sj29-innovate commented Jan 13, 2018 •

edited

Loading

sj29-innovate commented Jan 18, 2018 •

edited

Loading