Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding dtype to LDAModel to speed it up #1656

Merged
merged 21 commits into from
Nov 14, 2017

Conversation

xelez
Copy link
Contributor

@xelez xelez commented Oct 26, 2017

Started implementing #1576

Current state:

  • added dtype to LdaModel
  • added asserts about dtype everywhere to be sure that haven't missed any conversion
  • probably need help in handling load/save, maybe link to docs about how it works

And I need to somehow rewrite test asserts like this:
self.assertTrue(all(model.alpha == np.array([0.3, 0.3])))

Cause model.alpha is now converted to float32 (or whatever dtype) and np.array standard dtype is float64. Use np.allclose, maybe?

And I'm not sure where to discuss things, here or in the issue.

@piskvorky
Copy link
Owner

piskvorky commented Oct 26, 2017

Good idea with the asserts!

I don't think save/load need any special handling at all. save() just saves the object, and then load() loads it back (using the same types as when the object was saved).

Maybe the only tricky part is how to handle backward compatibility: should loading models saved before this change stil work?

I'd say yes. We need to test this explicitly: save an "old" model, then load it using your new code with dtypes and asserts, make sure everything continues to work as expected.

The other compatibility direction (load new model in old code) is not necessary.

@xelez
Copy link
Contributor Author

xelez commented Oct 26, 2017

Do I need to do anything to make code save my new dtype field as well?

And yes, there is a compatibility problem as tests shown me. To achieve it, I'll need to set dtype to float64 if it's not present in the saved model. I'll need some time to wrap my head around code in load/save methods.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Oct 26, 2017

Nice @xelez, my suggestions about it:

  1. Default dtype must be np.float32.
  2. As I remember, you no need to make anything else for save your dtype fields (all will be handled automatically). But you need to modify load for old models (without this field).
  3. About backward compatibility:
  • add a check, this model old or new (check that dtype is defined), if defined, no need to make anything else (this model is new)
  • if not - check what's dtype was used for matrices and fill-up new instance with this dtype. Also, add the warning about it (dtype isn't loaded, we'll use ...)

P/S similar task #1319

@xelez
Copy link
Contributor Author

xelez commented Oct 28, 2017

Implemented setting of dtype for old models.

Two things left to do:

  • modify tests to work with float32
  • cleanup asserts and TODO's that I've added

@xelez
Copy link
Contributor Author

xelez commented Oct 28, 2017

Looked up failing tests:

I'll also need to modify AuthorTopicModel and maybe other classes based on LdaModel

@xelez
Copy link
Contributor Author

xelez commented Oct 28, 2017

Fixed tests and quick-fixed AuthorTopicModel.

 * replace assert with docstring comment
 * add test to check that it really saves dtype for different inputs
@@ -354,25 +373,25 @@ def init_dir_prior(self, prior, name):

if isinstance(prior, six.string_types):
if prior == 'symmetric':
logger.info("using symmetric %s at %s", name, 1.0 / prior_shape)
init_prior = np.asarray([1.0 / self.num_topics for i in xrange(prior_shape)])
logger.info("using symmetric %s at %s", name, 1.0 / prior_shape) #TODO: prior_shape?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have feeling that it should be

"using symmetric %s at %s", name, 1.0 / self.num_topics

am I right? @menshikh-iv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @xelez, you are correct!

@@ -600,19 +600,13 @@ def jaccard_distance(set1, set2):
def dirichlet_expectation(alpha):
"""
For a vector `theta~Dir(alpha)`, compute `E[log(theta)]`.

Saves dtype of the argument.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean? Looks out of place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add some note that this function returns np.array with the same dtype as input alpha. Well, probably it's not really needed.

Copy link
Owner

@piskvorky piskvorky Oct 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the intent, it's not really apparent from the text Saves dtype of the argument.

@xelez xelez changed the title [WIP] Adding dtype to LDAModel to speed it up Adding dtype to LDAModel to speed it up Nov 1, 2017
@xelez
Copy link
Contributor Author

xelez commented Nov 1, 2017

@piskvorky , @menshikh-iv I think I've finished.

@@ -1231,7 +1230,7 @@ def load(cls, fname, *args, **kwargs):

# the same goes for dtype (except it was added later)
if not hasattr(result, 'dtype'):
result.dtype = np.float64 # float64 was used before as default in numpy
result.dtype = np.float64 # float64 was used before as default in numpy
logging.warning("dtype was not set, so using np.float64")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more concrete message please. When reading this warning, users will be left scratching their heads: set where? Why? What does this mean to me?

How about "dtype not set in saved %s file %s, assuming np.float64" % (result.__class__.__name__, fname)?
And only log at INFO or even DEBUG level, since it's an expected state when loading an old model, nothing out of ordinary.

Question: isn't it better to infer the dtype from the loaded object? Can it ever happen that it's something else, not np.float64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed message, decided info level suites better.

About inferring. Not clear how to do it. Infer from LdaState.eta and LdaState.sstats? But then we had test that their sum is np.float64, so it's safe to assume that we don't loose precision when setting dtype to np.float64 and np.float32 is not enough.

Anyway, let's imagine situation some of nd.arrays are somehow of different dtype, like np.float32 and some are np.float64. The right dtype is still np.float64.

@@ -538,7 +538,8 @@ def suggested_lda_model(self):
The num_topics is m_T (default is 150) so as to preserve the matrice shapes when we assign alpha and beta.
"""
alpha, beta = self.hdp_to_lda()
ldam = ldamodel.LdaModel(num_topics=self.m_T, alpha=alpha, id2word=self.id2word, random_state=self.random_state)
ldam = ldamodel.LdaModel(num_topics=self.m_T, alpha=alpha, id2word=self.id2word,
random_state=self.random_state, dtype=np.float64)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: no vertical indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

self.sstats = np.zeros(shape)
def __init__(self, eta, shape, dtype=np.float32):
self.eta = eta.astype(dtype, copy=False)
self.sstats = np.zeros(shape, dtype)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using positional arguments can lead to subtle bugs with numpy. Better use explicit names for keyword parameters: dtype=dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -244,7 +245,8 @@ def lda_seq_infer(self, corpus, topic_suffstats, gammas, lhoods,
vocab_len = self.vocab_len
bound = 0.0

lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=self.id2word)
lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=self.id2word,
dtype=np.float64)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: no vertical indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -419,7 +421,8 @@ def __getitem__(self, doc):
"""
Similar to the LdaModel __getitem__ function, it returns topic proportions of a document passed.
"""
lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.id2word)
lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.id2word,
dtype=np.float64)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: no vertical indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# Check if `dtype` is set after main pickle load
# if not, then it's an old model and we should set it to default `np.float64`
if not hasattr(result, 'dtype'):
result.dtype = np.float64 # float64 was used before as default in numpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old LDA used float64, really?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much everything is using float64 cause it's default dtype when creating arrays.

@@ -130,7 +130,8 @@ def __init__(self, corpus=None, time_slice=None, id2word=None, alphas=0.01, num_
if initialize == 'gensim':
lda_model = ldamodel.LdaModel(
corpus, id2word=self.id2word, num_topics=self.num_topics,
passes=passes, alpha=self.alphas, random_state=random_state
passes=passes, alpha=self.alphas, random_state=random_state,
dtype=np.float64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it will be a good idea to change default behaviour (to float32)?
CC @piskvorky @xelez

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not now, LdaSeqModel will require modifications similar to those I made in LdaModel to handle dtype properly.

@@ -48,6 +48,7 @@ def test_get_topics(self):
vocab_size = len(self.model.id2word)
for topic in topics:
self.assertTrue(isinstance(topic, np.ndarray))
self.assertEqual(topic.dtype, np.float64)
# Note: started moving to np.float32 as default
# self.assertEqual(topic.dtype, np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to enable + switch to float32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break other topic models then



class TestMatUtils(unittest.TestCase):
def test_dirichlet_expectation_keeps_precision(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that make new file is a good idea. Please move this tests to LDA tests class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add tests for the load old model with new code for all models that you changed.

Copy link
Owner

@piskvorky piskvorky Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test not just loading the old models, but also using them.

The asserts that we newly sprinkled into the code may trigger errors in various places, if something is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv That test don't involve LDA at all, so it's wrong place for this test. Haven't found any file involving testing matutils functions so I think a separate file is not that bad.

@xelez
Copy link
Contributor Author

xelez commented Nov 6, 2017

@piskvorky , @menshikh-iv see latest commit for backwards compatibility tests.

By the way, I think it's good idea to remove my asserts before merging. They were used mostly during tests to ensure that I haven't missed any place to add dtype. That way we definitely won't break old code or models.

@xelez
Copy link
Contributor Author

xelez commented Nov 7, 2017

By the way, why .npy files are ignored in .gitignore?

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great @xelez, please fix last changes, LGTM for me, wdyt @piskvorky ?

@@ -820,6 +856,7 @@ def show_topics(self, num_topics=10, num_words=10, log=False, formatted=True):

# add a little random jitter, to randomize results around the same alpha
sort_alpha = self.alpha + 0.0001 * self.random_state.rand(len(self.alpha))
# random_state.rand returns float64, but converting back to dtype won't speed up anything
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe .astype (for consistency only) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency vs one additional array copy. I'm not sure :)

# dtype could be absent in old models
if not hasattr(result, 'dtype'):
result.dtype = np.float64 # float64 was implicitly used before (cause it's default in numpy)
logging.info("dtype was not set in saved %s file %s, assuming np.float64", result.__class__.__name__, fname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe warn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1656 (comment) for discussion. Although it's expected state when loading old model, maybe a warning still a good thing.

"""
if len(alpha.shape) == 1:
result = psi(alpha) - psi(np.sum(alpha))
else:
result = psi(alpha) - psi(np.sum(alpha, 1))[:, np.newaxis]
return result.astype(alpha.dtype) # keep the same precision as input
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return astype, because
np.float32 -> np.float32
np.float64 -> np.float64
but
np.float16 -> np.float32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad, you're right!
Then tests that I added in separate file aren't needed.

@@ -242,7 +243,7 @@ def testGetDocumentTopics(self):
self.assertTrue(isinstance(topic, list))
for k, v in topic:
self.assertTrue(isinstance(k, int))
self.assertTrue(isinstance(v, float))
self.assertTrue(np.issubdtype(v, float))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple isinstance is better here (and everywhere).

Copy link
Contributor Author

@xelez xelez Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple isinstance fails cause np.float32 is not an instance of float


class TestMatUtils(unittest.TestCase):
def test_dirichlet_expectation_keeps_precision(self):
for dtype in (np.float32, np.float64, np.complex64, np.complex128):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add np.float16 and you'll see a problem

@menshikh-iv menshikh-iv merged commit 82c394a into piskvorky:develop Nov 14, 2017
@menshikh-iv
Copy link
Contributor

Thanks a lot @xelez, nice work 👍

@piskvorky
Copy link
Owner

Great feature!

@xelez how would you summarize it, for layman people who just want "the gist"? The title says Adding dtype to LDAModel to speed it up -- what was the final speed-up?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 15, 2017

@piskvorky with LdaMulticore (80k words, 100 topics) ~ 20%, I checked it yesterday.

@piskvorky
Copy link
Owner

That's neat :) Let's make sure this information makes it into the release notes / tweets etc.

@rmalouf
Copy link
Contributor

rmalouf commented Nov 18, 2017

There are a couple of places in ldamodel.py where 1e-100 gets added to phinorm to avoid division by zero. That constant will need to be adjusted depending on the precision being used.

@piskvorky
Copy link
Owner

Good point. I'd hope this would be caught by the unit tests though -- @menshikh-iv ?

@menshikh-iv
Copy link
Contributor

@piskvorky this isn't catched by unittests, because operation + 1e-100 don't change original dtype.
@rmalouf nice catch! thanks for your remark:+1:
@xelez can you fix this? (need to scale 1e-100 according to arrays dtype https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L476 and https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L487 in LDA + same fixes in another TMs).

@piskvorky
Copy link
Owner

piskvorky commented Nov 20, 2017

The unit tests should catch the operation of "add epsilon" not working, which (I presume) leads to some issues.

In other words, if the unit tests pass, what is the problem?

@menshikh-iv
Copy link
Contributor

For types < float64, it can produce problems with zeros if phinorm is zero -> division by zero -> nan

In [1]: import numpy as np
In [2]: np.float64(1e-100)
Out[2]: 1e-100

In [3]: np.float32(1e-100)
Out[3]: 0.0

For me, it looks like medium bug.

@piskvorky
Copy link
Owner

piskvorky commented Nov 20, 2017

That's not my question. My question is: do unit tests catch it?

If not, is it an issue with the unit tests (=> update unit tests), or with the algorithm (=> update gensim code)?

If yes, how come we didn't discover the bug earlier.

@menshikh-iv
Copy link
Contributor

@piskvorky unittests doesn't catch it.

@piskvorky
Copy link
Owner

piskvorky commented Nov 20, 2017

Then the unit tests should be improved, as part of the solution here -- so that we catch similar bugs automatically in the future.

@menshikh-iv
Copy link
Contributor

The problem here is not in tests at all, it's generally impossible to catch this bug in this code with unittests. Here, perhaps, we need to change the lda code itself, but I do not think this is a good idea.

@piskvorky
Copy link
Owner

I don't understand -- if there's no way to catch a bug, then there is no bug.

@rmalouf
Copy link
Contributor

rmalouf commented Nov 20, 2017

It's definitely test-for-able in principle: I noticed the bug because I started getting division by zero errors in a processing pipeline that used to work. I don't know how create a minimal corpus that triggers it, though.

@menshikh-iv
Copy link
Contributor

Summarizing: this is a bug, we need to fix it.

VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
* Add dtype to LdaModel, assert about it everywhere

* Implement loading of old models without dtype

* Use assert_allclose instead of == in LdaModel tests. Use np.issubdtype when checking if something is float.

* Fix AuthorTopicModel

* Fix matutils.dirichlet_expectation

 * replace assert with docstring comment
 * add test to check that it really saves dtype for different inputs

* Change default to np.float32 and cleanup

* Fix wrong logging message

* Remove out-of-place comment

* Cleanup PEP8

* Add dtype to sklearn LdaTransformer

* Set precision explicitly in lda model converters

* Add dtype to LdaMulticore

* Set dtype to float64 explicitly to retain backward compatibility in models using LdaModel

* Cleanup asserts and fix another place calculating in float64

* Fix import

* Fix remarks by piskvorky

* Add backward compatibility tests

* Add missing .npy files

* Fix dirichlet_expectation not working with np.float16

* Fix path to saved model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants