-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cython version for "hot" functions from gensim.models.LdaModel
#1767
Conversation
fyi, I reworked this change to mirror the implementation of word2vec_inner and doc2vec_inner. The cython functions now live in ldamodel_inner.pyx and LdaModel tries to use the fast cython versions if possible, but falls back to the "slower" alternatives if it cant. I also explicitly added separate cython functions for float32 and float64, so the faster code should work with both of these dtypes now (I don't know how to handle float16s nicely in the cython code so for now that falls back to the slower code as well). I used cython's fused datatypes to make the cython code a little nicer, but the code still has a lot of duplication in it. I'm not familiar with other templating methods in cython, but I'm sure there must be a way to make the functions generic without requiring so much duplication. The _attach_inner_methods() function in LdaModel tries to pick up the right methods to use based on the availability of ldamodel_inner and the dtype. There's probably a nicer way of doing this as well. I haven't checked all the edge cases yet, but my simple tests so far seem to work. This implementation has the benefit of allowing the multicore and distributed versions to use the faster cython code as well. I also left in the older "fastldamodel.py" files just to make comparisions between the two implementations easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @arlenk!
Can you show benchmark results please (LdaModel
/LdaMulticore
with numpy/scipy stuff VS LdaModel
/LdaMulticore
with cythonized functions)
Also, please add tests for your cython functions (we should trust that all works correctly)
gensim/models/ldamodel.py
Outdated
@@ -42,7 +42,7 @@ | |||
from collections import defaultdict | |||
|
|||
from gensim import interfaces, utils, matutils | |||
from gensim.matutils import dirichlet_expectation | |||
from gensim.matutils import dirichlet_expectation as dirichlet_expectation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why import X as X
?
gensim/models/ldamodel.py
Outdated
|
||
try: | ||
# try to load fast, cythonized code if possible | ||
from gensim.models.ldamodel_inner import dirichlet_expectation_1d_f32, dirichlet_expectation_1d_f64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @piskvorky, strict typification improve performance, but not much, wdyt about distinct function for each type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you. I'd be mindful of the added complexity -- what sort of performance difference are we talking about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arlenk can you compare these variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will take me a few days but I will include some benchmark results.
The typed versions (_f32 and _f64) are not entirely for performance reasons though. The c code needs to be different for float vs. doubles. I used cython's fused types to remove some of the duplication, but I don't think there's a way to have python call a single function that will dispatch (statically) to the correct float vs. double cython function. We could check the dtypes at run time, but that does impose a pretty large penalty given how often these functions are called (eg, dirichlet_expectation is called millions of times even for a small corpus).
I can include a benchmark with the run-time dtype check to see how big the impact is though.
gensim/models/ldamodel.py
Outdated
try: | ||
# try to load fast, cythonized code if possible | ||
from gensim.models.ldamodel_inner import dirichlet_expectation_1d_f32, dirichlet_expectation_1d_f64 | ||
from gensim.models.ldamodel_inner import dirichlet_expectation_2d_f32, dirichlet_expectation_2d_f64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import all this stuff with one import
with line breaks (no needed to repeat from X import
several times)
gensim/models/ldamodel.py
Outdated
""" | ||
return np.mean(np.abs(a - b)) | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to move this section to head of file (after import section)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how to restructure this yet. The except clause needs to happen after the "slow" versions of logsumexp and mean_absolute_difference are defined, so that's why the code is where it is right now. I initially included the definition of the slow versions in the except block as well (mirroring what word2vec does), but that breaks if anyone uses the float16 dtype which requires the slow functions even if the cython versions exist (since the cython versions can only handle float32 and float 64).
might make things easier just to move the "inner" functions (logsumexp, dirichlet_expectation, and mean_abs_diff) into their own python module and have that module import the cython versions if available.
Open to any other suggestions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's leave it for now as is
gensim/models/ldamodel.py
Outdated
mean_absolute_difference_f32 = mean_absolute_difference_f64 = mean_absolute_difference | ||
|
||
|
||
def _attach_inner_methods(self, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to define needed functions on module level (not "patch" LdaModel
at runtime), this allows avoiding a problem with backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was my original intention (and that's what the original "fastldamodel.py" code did), but I am not sure how to make this work nicely with LdaModel since a user could create multiple models with different dtypes. Eg,
model0 = LdaModel(corpus=corpus, id2word=corpus.id2word, num_topics=10, dtype=np.float16)
model1 = LdaModel(corpus=corpus, id2word=corpus.id2word, num_topics=10, dtype=np.float32)
model2 = LdaModel(corpus=corpus, id2word=corpus.id2word, num_topics=10, dtype=np.float64)
each of those objects would need to use different implementations of the inner functions (unless we dispatched at run-time based on dtypes, but as mentioned before, i think that will have a decent impact on speed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, you depend on dtype
of LDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, you are right, dtype
is a really fresh feature (for LDA), I forgot about it.
Ok, probably that's variant is OK (another variant - add this as method to LDA and call in init)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should decide what dtype makes the most sense, and stick with it (drop support for all other dtypes).
It's not a parameter we can expose and ask users to choose for themselves (especially since we give no guidance, no pros/cons/recommendations about the implications of this choice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky probably float32
should be the best choice (by speed, memory consumption and quality)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but would like to see some numeric arguments (benchmarks, model quality & performance evals) to back that up.
Unless the other dtypes offer some dramatically different tradeoffs, I see no reason to complicate the code and support them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing the other dtypes would certainly make this change easier.
I was running some tests with the different dtypes, and float16 (one of the allowed options in the existing code) seems to over/underflow with some reasonable input data. For example the following code (I left out the imports) will crash with an overflow error:
np.seterr(all='raise')
text = api.load('text8')
id2word = gensim.corpora.Dictionary(text)
corpus = [ id2word.doc2bow(t) for t in text ]
LdaModel(corpus=corpus, id2word=id2word, num_topics=10, dtype=np.float16)
gensim/models/_fastldamodel.pyx
Outdated
@@ -0,0 +1,256 @@ | |||
from __future__ import division |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is needed right now (after last changes where you move all "slowest" function to cython)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, .pyd
file (windows compiled) should not be here (only pyx
and c
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, not needed. Left it in just for comparison, but removed it now.
setup.py
Outdated
@@ -13,7 +13,7 @@ | |||
import os | |||
import sys | |||
import warnings | |||
|
|||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really wired, please don't use numpy in setup.py
(by this reason all CI broken)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok removed this. Didn't realize that you could include the numpy header files for the cython code without importing numpy directly.
Here are the results for a couple different settings (only using LdaModel for now, not LdaMulticore): The cython code takes around 60% of the time as the current code. Here's the code used for the benchmarks: |
@arlenk your results look great! thanks for benchmarking (but please post it as a table in this PR for history). Can you make the same benchmark for the different dtype case? Also, don't forget about tests (this is missing here now). |
@arlenk big thanks for the concrete example 👍 , I reproduced this problem with Stacktrace
|
@arlenk generally, we need to investigate why "underflow" happens, fix it and stay only this data type. Thanks for your great work! Have you any ideas how to understand, why underflow problem happens? Probably, this is a bad design of the algorithm, but I have no idea how to debug this. |
I don't think it's an algorithm issue, but I will try digging into this some more. In this particular case, I don't think it's necessarily an issue. The underflowed value is later used in a dot product, so the term involving the underflowed value is going to be ~zero in any case. If all the values in the dot product were underflows, that would be an issue. But I don't think having a few underflows will break the algorithm. Having said that, it still may make sense to default to float64 to avoid underflows (I have not run into any yet with float64 dtype). Float32 does save memory, of course, but that could still be left as an option for anyone who is really running low on memory. |
Thank you @arlenk! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @arlenk 🔥
setup.py
Outdated
@@ -258,7 +257,10 @@ def finalize_options(self): | |||
include_dirs=[model_dir]), | |||
Extension('gensim.models.fasttext_inner', | |||
sources=['./gensim/models/fasttext_inner.c'], | |||
include_dirs=[model_dir]) | |||
include_dirs=[model_dir]), | |||
Extension('gensim.models.ldamodel_inner', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to modify MANIFEST.ini
in root of repo (add your files)
setup.py
Outdated
include_dirs=[model_dir]) | ||
include_dirs=[model_dir]), | ||
Extension('gensim.models.ldamodel_inner', | ||
sources=['./gensim/models/ldamodel_inner.c'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better name for it is _matutils
(because this isn't model at all, this only concrete math functions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, move this file on level of matutils.py
? wdyt?
gensim/models/ldamodel.py
Outdated
@@ -56,32 +55,26 @@ | |||
np.float64: 1e-100, | |||
} | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move this to matutils
too (try/exc section)
@arlenk please resolve merge conflict & make suggested fixes. I think we'll fix underflow problem not soon, what's prevents us now to merge this change? |
Ok, I made the suggested changes. I also removed the use of dirichlet_expectation_1d and _2d and instead just added a cython version of dirichlet_expectation() the calls the _1d or _2d based on the input shape. Turns out, after some profiling, the added ndim check add less than a few percent of overhead and having a single dirichlet_expectation function simplifies the code. I think this also means that hdpmodel and atmodel should benefit from the changes as well (as they call matutils.dirichlet_expectation) |
@arlenk can you fix tests please (you forgot to change tests accoding to remove |
@arlenk UPD, I'll do it myself, don't worry (anyway I need to fix docstrings) |
gensim.models.LdaModel
@arlenk beautiful! You made a very dotty and important change, which seriously affects the performance of Lda (and related models, that used LDA as the base model). Both of your PRs (MmReader and this one) will be our main features in next release 💥 Do not want to continue gensim cythonization? We would be very happy if you have time and wish for it ❤️. |
Awesome job! Thanks so much @arlenk . |
good to hear. LDA model is what I am most familiar with, but if there are other bottlenecks that have come up on the past, let me know and I am happy to take a look. |
@arlenk I have several optimization tasks
The last task need a really long time, but first looks simple and does not require so much time |
Attempt at moving some of the hotspots for ldamodel into a cython module. From some profiling, most of the time in ldamodel.LdaModel is spent in the following few spots:
dirichlet_expectation (specifically the call to psi())
calculating the mean absolute difference to determine when gamma has converged
logsumexp (even with the inline version)
I broke these functions out into a cython module. For now, I just created a fastldamodel.py (a copy of ldamodel.py but with the cython functions above) module to see if there is any interest in pursuing this. Based on some profiling on my pc (attached), the speed up is about 100% (meaning the new code takes half the time -- though this is dependent on number of topics, number of passes, etc.).
I also noticed that the current ldamodel code uses float32 datatypes, but the user can change this if desired. If we go down the cython route, we'd have to decide if we want to allow the user to change the datatype as this would require different cython functions for each data type (I believe). Alternatively, we could use the cython code when the datatype is float32 (and perhaps float64) and fall back to the slower versions otherwise.