Loading and Saving LDA Models across Python 2 and 3. #913

anmolgulati · 2016-10-03T13:37:26Z

Modified load and save methods in ldamodel.py to manage compatibility issues when loading and saving models across Python 2 and 3.
This PR tackles Isssue #853

tmylk · 2016-10-03T13:48:41Z

Thanks. Could you please add tests for these methods? Maybe we need to add 2 pickled models to the test_data folder, one for Python 2 and one for Python 3?

anmolgulati · 2016-10-03T13:50:26Z

Yes sure, I was gonna ask for suggestions on testing this actually.

anmolgulati · 2016-10-04T12:36:41Z

So, I added two saved LDA models, one in Python 2.7 and other in Python 3.5 environments in test_data folder. The method I used to create these models is in test_ldamodel.py.
Also added a test to load these models and compare(both must be the same actually).
But right now, the test Fails in Python 2.7 environment and Gives an error (same as referred in FAQ).
@tmylk @jayantj Any suggestions?

anmolgulati · 2016-10-04T18:12:11Z

So I got this working. The LDA model loads fine now across both in Python 2 and 3.
Right now, the test fails in checking expElogbeta array equality in the two saved models. But this is maybe due to precision differences, they are different when the model was created itself, i.e., the arrays are different when created across python versions. (I printed the arrays for both environments at time of creation). It's not a problem of persistence. We should probably remove this check, or add other checks for equality of models.
Otherwise, the tests pass.

tmylk · 2016-10-05T04:13:21Z

How different are they? Can you add a comparison with an epsilon of each other?

anmolgulati · 2016-10-05T07:46:19Z

Well, the epsilon is big for some indexes, I'm adding the exact arrays I printed.
(Python 2.7 expElogbeta: [[ 0.00685637, 0.01305742, 0.00683859, 0.09222717, 0.09222721,
0.09235907, 0.00893736, 0.07625897, 0.00682716, 0.14222405, 0.14247683, 0.09542658],
[ 0.10647068, 0.09757566, 0.10649878, 0.01035072, 0.01035068, 0.01023777, 0.2084948 , 0.07635718, 0.10651685, 0.00796197, 0.00776243, 0.00771203]])

(Python 3.5 expElogbeta: [[ 0.00683436, 0.00685224, 0.01325946, 0.00895329, 0.09220773, 0.0764139, 0.09233712, 0.09220773, 0.006823, 0.14213993, 0.14239321, 0.09537042]
[ 0.10656816, 0.10653988, 0.09735885, 0.20859865, 0.01032817, 0.07618502,
0.01021732, 0.01032817, 0.10658615, 0.00796839, 0.00776821, 0.00771765]])

Also, the id2word dictionary saved in both Python formats is not in the same order.

tmylk · 2016-10-05T09:17:47Z

gensim/models/ldamodel.py

+        if (isinstance(self.eta, six.string_types) and self.eta == 'auto') or len(self.eta.shape) != 1:
+            separately_explicit.append('eta')
+        # Merge separately_explicit with separately.
+        if separately is not None and separately:


if separately is enough

Made changes as suggested.

tmylk · 2016-10-05T09:20:53Z

gensim/models/ldamodel.py

@@ -1037,6 +1072,18 @@ def load(cls, fname, *args, **kwargs):
        """
        kwargs['mmap'] = kwargs.get('mmap', None)
        result = super(LdaModel, cls).load(fname, *args, **kwargs)
+        # Load the separately stored id2word dictionary saved in json.
+        id2word_fname = utils.smart_extension(fname, '.json')


please make all files for one model in a special folder, so it is easy to keep track

tmylk · 2016-10-05T09:32:57Z

This solution looks really complicated. Let's try something simpler. Let's save the Python version with the pickle. If the versions differ on load than raise an exception to use Dill. (assuming dill works across versions)

anmolgulati · 2016-10-05T16:10:01Z

I created the LDAModels with the same seed now and the tests pass.

anmolgulati · 2016-10-05T20:27:17Z

@tmylk Actually, I think no more changes need to be made for compatibility in word2vec or doc2vec models. The load/save methods work fine now for them. We could probably use the present pickle methods itself.
I've added tests now to Check compatibility for Word2Vec Models. The models load without errors. But the tests fail though in testing equality of models across python versions. I also added the code I used to create the word2vec models in test_word2vec.py (commented code). Please review once.
I'll too dig deeper if something else might be an issue.

tmylk · 2016-10-26T09:30:21Z

Add an epsilon for equality comparison.
Please git fetch, git merge develop, so the PR becomes reviewable (there are unrelated commits).

…ving LDA models across Pythong verions

… compatibility

…on 3

…and 3.5

…s Python 2/3

tmylk · 2016-11-11T11:31:43Z

gensim/test/test_word2vec.py

@@ -474,11 +487,11 @@ def testRNG(self):

    def models_equal(self, model, model2):
        self.assertEqual(len(model.vocab), len(model2.vocab))
-        self.assertTrue(numpy.allclose(model.syn0, model2.syn0))
+        self.assertTrue(numpy.allclose(model.syn0, model2.syn0, atol=1e-4))


how big is the difference on your machine?

piskvorky · 2016-12-03T01:51:25Z

gensim/test/test_word2vec.py

+    #     logging.warning("Word2Vec model saved")
+
+    def testModelCompatibilityWithPythonVersions(self):
+        fname_model_2_7 = os.path.join(os.path.dirname(__file__), 'word2vecmodel_python_2_7')


Use module_path and datapath defined above.

piskvorky · 2016-12-03T01:51:50Z

gensim/models/ldamodel.py

+        # Load the separately stored id2word dictionary saved in json.
+        id2word_fname = utils.smart_extension(fname, '.json')
+        try:
+            with utils.smart_open(id2word_fname, 'r') as fin:


Open file as binary, decode as necessary (if necessary).

piskvorky · 2016-12-03T01:52:15Z

gensim/models/ldamodel.py

+        # If id2word is not already in ignore, then saving it separately in json.
+        id2word = None
+        if self.id2word is not None and 'id2word' not in ignore:
+            id2word = dict((k,v) for k,v in self.id2word.iteritems())


PEP8: space after comma.

piskvorky · 2016-12-03T01:55:52Z

gensim/models/ldamodel.py

+        id2word_fname = utils.smart_extension(fname, '.json')   
+        try:
+            with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout:
+                json.dump(id2word, fout)


Better open the output as binary and write encoded utf8 to it.

Actually, the json module already produces binary strings in dump AFAIK, so what is this even for?

piskvorky · 2016-12-03T01:56:22Z

gensim/utils.py

        # Because of loading from S3 load can't be used (missing readline in smart_open)
-        return _pickle.loads(f.read())
-
+        if sys.version_info > (3,0):


PEP8: space after comma.

…o pickle-worker

tmylk · 2016-12-26T11:29:45Z

Merged in #1039

tmylk added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 4, 2016

tmylk reviewed Oct 5, 2016

View reviewed changes

tmylk mentioned this pull request Oct 5, 2016

Switch to dill//cloudpickle #558

Closed

tmylk closed this Oct 26, 2016

tmylk reopened this Oct 26, 2016

anmolgulati force-pushed the pickle-worker branch from c9d5be7 to a15ef90 Compare October 29, 2016 15:39

anmolgulati added 7 commits October 29, 2016 21:18

Modified load/save methods to maitain compatibility in loading and sa…

a4d214f

…ving LDA models across Pythong verions

Added saved LDA models in Python 2.7 and 3.5 environments for testing…

04a4634

… compatibility

Added test for LDA Model compatibility between Python versions

aaae5ff

Modified unpickle method to allow unpickling python 2 objects in pyth…

8b2cc42

…on 3

Created and saved LDAModels with same random_seed in both Python 2.7 …

c4c1289

…and 3.5

Added Tests to check compatibility to Load/Save Word2Vec models acros…

96f8a4a

…s Python 2/3

Minor change in commented code

63963c0

anmolgulati force-pushed the pickle-worker branch from a15ef90 to 63963c0 Compare October 29, 2016 15:48

Reduced tolerance for checking model equality

fbd5d6d

tmylk reviewed Nov 11, 2016

View reviewed changes

piskvorky requested changes Dec 3, 2016

View reviewed changes

anmolgulati force-pushed the pickle-worker branch from ce6d8b0 to fbd5d6d Compare December 6, 2016 19:00

Merge branch 'develop' of https://github.com/anmol01gulati/gensim int…

101fed1

…o pickle-worker

piskvorky added feature Issue described a new feature and removed bug Issue described a bug labels Dec 7, 2016

anmolgulati mentioned this pull request Dec 7, 2016

Lda models load/save backward compatibility across Python versions #1039

Merged

tmylk closed this Dec 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading and Saving LDA Models across Python 2 and 3. #913

Loading and Saving LDA Models across Python 2 and 3. #913

anmolgulati commented Oct 3, 2016 •

edited

Loading

tmylk commented Oct 3, 2016

anmolgulati commented Oct 3, 2016

anmolgulati commented Oct 4, 2016 •

edited by tmylk

Loading

anmolgulati commented Oct 4, 2016 •

edited

Loading

tmylk commented Oct 5, 2016

anmolgulati commented Oct 5, 2016

tmylk Oct 5, 2016

anmolgulati Oct 5, 2016

tmylk Oct 5, 2016

tmylk commented Oct 5, 2016

anmolgulati commented Oct 5, 2016

anmolgulati commented Oct 5, 2016 •

edited

Loading

tmylk commented Oct 26, 2016

tmylk Nov 11, 2016

piskvorky Dec 3, 2016

piskvorky Dec 3, 2016

piskvorky Dec 3, 2016

piskvorky Dec 3, 2016

piskvorky Dec 3, 2016

tmylk commented Dec 26, 2016

Loading and Saving LDA Models across Python 2 and 3. #913

Loading and Saving LDA Models across Python 2 and 3. #913

Conversation

anmolgulati commented Oct 3, 2016 • edited Loading

tmylk commented Oct 3, 2016

anmolgulati commented Oct 3, 2016

anmolgulati commented Oct 4, 2016 • edited by tmylk Loading

anmolgulati commented Oct 4, 2016 • edited Loading

tmylk commented Oct 5, 2016

anmolgulati commented Oct 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Oct 5, 2016

anmolgulati commented Oct 5, 2016

anmolgulati commented Oct 5, 2016 • edited Loading

tmylk commented Oct 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Dec 26, 2016

anmolgulati commented Oct 3, 2016 •

edited

Loading

anmolgulati commented Oct 4, 2016 •

edited by tmylk

Loading

anmolgulati commented Oct 4, 2016 •

edited

Loading

anmolgulati commented Oct 5, 2016 •

edited

Loading