-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding dtype to LDAModel to speed it up #1656
Conversation
Good idea with the asserts! I don't think save/load need any special handling at all. Maybe the only tricky part is how to handle backward compatibility: should loading models saved before this change stil work? I'd say yes. We need to test this explicitly: save an "old" model, then load it using your new code with dtypes and asserts, make sure everything continues to work as expected. The other compatibility direction (load new model in old code) is not necessary. |
Do I need to do anything to make code save my new And yes, there is a compatibility problem as tests shown me. To achieve it, I'll need to set dtype to float64 if it's not present in the saved model. I'll need some time to wrap my head around code in load/save methods. |
Nice @xelez, my suggestions about it:
P/S similar task #1319 |
Implemented setting of Two things left to do:
|
Looked up failing tests: I'll also need to modify |
…e when checking if something is float.
Fixed tests and quick-fixed AuthorTopicModel. |
* replace assert with docstring comment * add test to check that it really saves dtype for different inputs
gensim/models/ldamodel.py
Outdated
@@ -354,25 +373,25 @@ def init_dir_prior(self, prior, name): | |||
|
|||
if isinstance(prior, six.string_types): | |||
if prior == 'symmetric': | |||
logger.info("using symmetric %s at %s", name, 1.0 / prior_shape) | |||
init_prior = np.asarray([1.0 / self.num_topics for i in xrange(prior_shape)]) | |||
logger.info("using symmetric %s at %s", name, 1.0 / prior_shape) #TODO: prior_shape? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have feeling that it should be
"using symmetric %s at %s", name, 1.0 / self.num_topics
am I right? @menshikh-iv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @xelez, you are correct!
gensim/matutils.py
Outdated
@@ -600,19 +600,13 @@ def jaccard_distance(set1, set2): | |||
def dirichlet_expectation(alpha): | |||
""" | |||
For a vector `theta~Dir(alpha)`, compute `E[log(theta)]`. | |||
|
|||
Saves dtype of the argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean? Looks out of place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to add some note that this function returns np.array with the same dtype as input alpha. Well, probably it's not really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the intent, it's not really apparent from the text Saves dtype of the argument
.
…odels using LdaModel
@piskvorky , @menshikh-iv I think I've finished. |
gensim/models/ldamodel.py
Outdated
@@ -1231,7 +1230,7 @@ def load(cls, fname, *args, **kwargs): | |||
|
|||
# the same goes for dtype (except it was added later) | |||
if not hasattr(result, 'dtype'): | |||
result.dtype = np.float64 # float64 was used before as default in numpy | |||
result.dtype = np.float64 # float64 was used before as default in numpy | |||
logging.warning("dtype was not set, so using np.float64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more concrete message please. When reading this warning, users will be left scratching their heads: set where? Why? What does this mean to me?
How about "dtype not set in saved %s file %s, assuming np.float64" % (result.__class__.__name__, fname)
?
And only log at INFO or even DEBUG level, since it's an expected state when loading an old model, nothing out of ordinary.
Question: isn't it better to infer the dtype from the loaded object? Can it ever happen that it's something else, not np.float64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed message, decided info level suites better.
About inferring. Not clear how to do it. Infer from LdaState.eta
and LdaState.sstats
? But then we had test that their sum is np.float64
, so it's safe to assume that we don't loose precision when setting dtype to np.float64
and np.float32
is not enough.
Anyway, let's imagine situation some of nd.arrays
are somehow of different dtype, like np.float32
and some are np.float64
. The right dtype
is still np.float64
.
gensim/models/hdpmodel.py
Outdated
@@ -538,7 +538,8 @@ def suggested_lda_model(self): | |||
The num_topics is m_T (default is 150) so as to preserve the matrice shapes when we assign alpha and beta. | |||
""" | |||
alpha, beta = self.hdp_to_lda() | |||
ldam = ldamodel.LdaModel(num_topics=self.m_T, alpha=alpha, id2word=self.id2word, random_state=self.random_state) | |||
ldam = ldamodel.LdaModel(num_topics=self.m_T, alpha=alpha, id2word=self.id2word, | |||
random_state=self.random_state, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
gensim/models/ldamodel.py
Outdated
self.sstats = np.zeros(shape) | ||
def __init__(self, eta, shape, dtype=np.float32): | ||
self.eta = eta.astype(dtype, copy=False) | ||
self.sstats = np.zeros(shape, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using positional arguments can lead to subtle bugs with numpy. Better use explicit names for keyword parameters: dtype=dtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
gensim/models/ldaseqmodel.py
Outdated
@@ -244,7 +245,8 @@ def lda_seq_infer(self, corpus, topic_suffstats, gammas, lhoods, | |||
vocab_len = self.vocab_len | |||
bound = 0.0 | |||
|
|||
lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=self.id2word) | |||
lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=self.id2word, | |||
dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
gensim/models/ldaseqmodel.py
Outdated
@@ -419,7 +421,8 @@ def __getitem__(self, doc): | |||
""" | |||
Similar to the LdaModel __getitem__ function, it returns topic proportions of a document passed. | |||
""" | |||
lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.id2word) | |||
lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.id2word, | |||
dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
gensim/models/ldamodel.py
Outdated
# Check if `dtype` is set after main pickle load | ||
# if not, then it's an old model and we should set it to default `np.float64` | ||
if not hasattr(result, 'dtype'): | ||
result.dtype = np.float64 # float64 was used before as default in numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old LDA used float64, really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much everything is using float64
cause it's default dtype when creating arrays.
@@ -130,7 +130,8 @@ def __init__(self, corpus=None, time_slice=None, id2word=None, alphas=0.01, num_ | |||
if initialize == 'gensim': | |||
lda_model = ldamodel.LdaModel( | |||
corpus, id2word=self.id2word, num_topics=self.num_topics, | |||
passes=passes, alpha=self.alphas, random_state=random_state | |||
passes=passes, alpha=self.alphas, random_state=random_state, | |||
dtype=np.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it will be a good idea to change default behaviour (to float32)?
CC @piskvorky @xelez
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not now, LdaSeqModel
will require modifications similar to those I made in LdaModel
to handle dtype properly.
@@ -48,6 +48,7 @@ def test_get_topics(self): | |||
vocab_size = len(self.model.id2word) | |||
for topic in topics: | |||
self.assertTrue(isinstance(topic, np.ndarray)) | |||
self.assertEqual(topic.dtype, np.float64) | |||
# Note: started moving to np.float32 as default | |||
# self.assertEqual(topic.dtype, np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to enable + switch to float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break other topic models then
gensim/test/test_matutils.py
Outdated
|
||
|
||
class TestMatUtils(unittest.TestCase): | ||
def test_dirichlet_expectation_keeps_precision(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that make new file is a good idea. Please move this tests to LDA tests class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please add tests for the load old model with new code for all models that you changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test not just loading the old models, but also using them.
The asserts that we newly sprinkled into the code may trigger errors in various places, if something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv That test don't involve LDA at all, so it's wrong place for this test. Haven't found any file involving testing matutils functions so I think a separate file is not that bad.
@piskvorky , @menshikh-iv see latest commit for backwards compatibility tests. By the way, I think it's good idea to remove my asserts before merging. They were used mostly during tests to ensure that I haven't missed any place to add dtype. That way we definitely won't break old code or models. |
By the way, why .npy files are ignored in .gitignore? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great @xelez, please fix last changes, LGTM for me, wdyt @piskvorky ?
@@ -820,6 +856,7 @@ def show_topics(self, num_topics=10, num_words=10, log=False, formatted=True): | |||
|
|||
# add a little random jitter, to randomize results around the same alpha | |||
sort_alpha = self.alpha + 0.0001 * self.random_state.rand(len(self.alpha)) | |||
# random_state.rand returns float64, but converting back to dtype won't speed up anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe .astype
(for consistency only) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency vs one additional array copy. I'm not sure :)
# dtype could be absent in old models | ||
if not hasattr(result, 'dtype'): | ||
result.dtype = np.float64 # float64 was implicitly used before (cause it's default in numpy) | ||
logging.info("dtype was not set in saved %s file %s, assuming np.float64", result.__class__.__name__, fname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe warn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1656 (comment) for discussion. Although it's expected state when loading old model, maybe a warning still a good thing.
gensim/matutils.py
Outdated
""" | ||
if len(alpha.shape) == 1: | ||
result = psi(alpha) - psi(np.sum(alpha)) | ||
else: | ||
result = psi(alpha) - psi(np.sum(alpha, 1))[:, np.newaxis] | ||
return result.astype(alpha.dtype) # keep the same precision as input | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please return astype
, because
np.float32 -> np.float32
np.float64 -> np.float64
but
np.float16 -> np.float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my bad, you're right!
Then tests that I added in separate file aren't needed.
@@ -242,7 +243,7 @@ def testGetDocumentTopics(self): | |||
self.assertTrue(isinstance(topic, list)) | |||
for k, v in topic: | |||
self.assertTrue(isinstance(k, int)) | |||
self.assertTrue(isinstance(v, float)) | |||
self.assertTrue(np.issubdtype(v, float)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple isinstance
is better here (and everywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple isinstance
fails cause np.float32
is not an instance of float
gensim/test/test_matutils.py
Outdated
|
||
class TestMatUtils(unittest.TestCase): | ||
def test_dirichlet_expectation_keeps_precision(self): | ||
for dtype in (np.float32, np.float64, np.complex64, np.complex128): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add np.float16 and you'll see a problem
Thanks a lot @xelez, nice work 👍 |
Great feature! @xelez how would you summarize it, for layman people who just want "the gist"? The title says |
@piskvorky with LdaMulticore (80k words, 100 topics) ~ 20%, I checked it yesterday. |
That's neat :) Let's make sure this information makes it into the release notes / tweets etc. |
There are a couple of places in ldamodel.py where 1e-100 gets added to |
Good point. I'd hope this would be caught by the unit tests though -- @menshikh-iv ? |
@piskvorky this isn't catched by unittests, because operation |
The unit tests should catch the operation of "add epsilon" not working, which (I presume) leads to some issues. In other words, if the unit tests pass, what is the problem? |
For types
For me, it looks like medium bug. |
That's not my question. My question is: do unit tests catch it? If not, is it an issue with the unit tests (=> update unit tests), or with the algorithm (=> update gensim code)? If yes, how come we didn't discover the bug earlier. |
@piskvorky unittests doesn't catch it. |
Then the unit tests should be improved, as part of the solution here -- so that we catch similar bugs automatically in the future. |
The problem here is not in tests at all, it's generally impossible to catch this bug in this code with unittests. Here, perhaps, we need to change the lda code itself, but I do not think this is a good idea. |
I don't understand -- if there's no way to catch a bug, then there is no bug. |
It's definitely test-for-able in principle: I noticed the bug because I started getting division by zero errors in a processing pipeline that used to work. I don't know how create a minimal corpus that triggers it, though. |
Summarizing: this is a bug, we need to fix it. |
* Add dtype to LdaModel, assert about it everywhere * Implement loading of old models without dtype * Use assert_allclose instead of == in LdaModel tests. Use np.issubdtype when checking if something is float. * Fix AuthorTopicModel * Fix matutils.dirichlet_expectation * replace assert with docstring comment * add test to check that it really saves dtype for different inputs * Change default to np.float32 and cleanup * Fix wrong logging message * Remove out-of-place comment * Cleanup PEP8 * Add dtype to sklearn LdaTransformer * Set precision explicitly in lda model converters * Add dtype to LdaMulticore * Set dtype to float64 explicitly to retain backward compatibility in models using LdaModel * Cleanup asserts and fix another place calculating in float64 * Fix import * Fix remarks by piskvorky * Add backward compatibility tests * Add missing .npy files * Fix dirichlet_expectation not working with np.float16 * Fix path to saved model
Started implementing #1576
Current state:
And I need to somehow rewrite test asserts like this:
self.assertTrue(all(model.alpha == np.array([0.3, 0.3])))
Cause
model.alpha
is now converted tofloat32
(or whatever dtype) andnp.array
standard dtype isfloat64
. Usenp.allclose
, maybe?And I'm not sure where to discuss things, here or in the issue.