-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory consumption in AuthorTopicModel #2122
Conversation
gensim/models/atmodel.py
Outdated
train_corpus_idx.extend(doc_ids) | ||
# Collect all documents of authors. | ||
for doc_ids in self.author2doc.values(): | ||
train_corpus_idx.extend(doc_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why keep train_corpus_idx
as list
in the first place? It looks like it's converted to a set right below, so maybe better to make it a set
from the start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, indeed. I would propose something like this:
# Train on all documents of authors in input_corpus.
train_corpus_idx = set()
# Collect all documents of authors.
for doc_ids in self.author2doc.values():
train_corpus_idx.update(doc_ids)
train_corpus_idx = list(train_corpus_idx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the algo, but for the sake of reproducibility, I'd replace list(train_corpus_idx)
with sorted(train_corpus_idx)
. That will remove randomness from the ordering of values, while still producing a list.
Nice! The original code looks strange indeed -- @olavurmortensen any particular reason for that nested quadratic loop? |
I updated the loop to use a set and to sort the resulting list. At least on my current dataset this is also a little bit faster :), and reduces the temporary memory overhead even further. |
I can't imagine there was a reason for that nested loop, must have just slipped my mind. I only tested scalability w.r.t. running time, if I'd tested memory consumption as well I should have caught this. There is some stuff in my thesis about asymptotic complexity of memory consumption (pdf, section 2.4.2.6). The algorithm doesn't scale terribly well w.r.t. memory consumption. The empirical results showed that running time scaled as expected, compared to the theoretical scalability, but as I said I didn't test memory consumption. I'm glad this problem was caught, hopefully it fixes the issues people are having. Thanks @philipphager. |
No problem! And also, awesome job done on the implementation of the AuthorTopicModel @olavurmortensen, the API is an absolute pleasure to work with 👍 ! And thank you both for being so responsive on this PR :)! |
Thanks @philipphager . What happens next: we'll wait for @menshikh-iv to come back from holiday, so he can review and merge this 👍 |
Actually, let me merge right away. The fix is simple enough that hopefully @menshikh-iv won't be angry :) |
@menshikh-iv
I submitted this PR to address the memory consumption issue faced when using the
AuthorTopicModel
as also recognized in #1947. We had to fix this issue for a research project and maybe you are interested to use this simple fix in the main project. The concatenation of the entire corpus for all authors and then removing unique values (resulting in all unique values in the corpus), can be reduced to this, and the results did not change.This fix reduced the memory consumption on our project with ≈400.000 docs from 32GB to 2GB for the entire duration of the training.