Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory consumption in AuthorTopicModel #2122

Merged
merged 3 commits into from
Jul 13, 2018
Merged

Fix memory consumption in AuthorTopicModel #2122

merged 3 commits into from
Jul 13, 2018

Conversation

philipphager
Copy link
Contributor

@philipphager philipphager commented Jul 7, 2018

@menshikh-iv
I submitted this PR to address the memory consumption issue faced when using the AuthorTopicModel as also recognized in #1947. We had to fix this issue for a research project and maybe you are interested to use this simple fix in the main project. The concatenation of the entire corpus for all authors and then removing unique values (resulting in all unique values in the corpus), can be reduced to this, and the results did not change.

This fix reduced the memory consumption on our project with ≈400.000 docs from 32GB to 2GB for the entire duration of the training.

train_corpus_idx.extend(doc_ids)
# Collect all documents of authors.
for doc_ids in self.author2doc.values():
train_corpus_idx.extend(doc_ids)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why keep train_corpus_idx as list in the first place? It looks like it's converted to a set right below, so maybe better to make it a set from the start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, indeed. I would propose something like this:

# Train on all documents of authors in input_corpus.
train_corpus_idx = set()
# Collect all documents of authors.
for doc_ids in self.author2doc.values():
    train_corpus_idx.update(doc_ids)

train_corpus_idx = list(train_corpus_idx)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the algo, but for the sake of reproducibility, I'd replace list(train_corpus_idx) with sorted(train_corpus_idx). That will remove randomness from the ordering of values, while still producing a list.

@piskvorky
Copy link
Owner

Nice! The original code looks strange indeed -- @olavurmortensen any particular reason for that nested quadratic loop?

@philipphager
Copy link
Contributor Author

I updated the loop to use a set and to sort the resulting list. At least on my current dataset this is also a little bit faster :), and reduces the temporary memory overhead even further.

@olavurmortensen
Copy link
Contributor

I can't imagine there was a reason for that nested loop, must have just slipped my mind. I only tested scalability w.r.t. running time, if I'd tested memory consumption as well I should have caught this.

There is some stuff in my thesis about asymptotic complexity of memory consumption (pdf, section 2.4.2.6). The algorithm doesn't scale terribly well w.r.t. memory consumption. The empirical results showed that running time scaled as expected, compared to the theoretical scalability, but as I said I didn't test memory consumption.

I'm glad this problem was caught, hopefully it fixes the issues people are having. Thanks @philipphager.

@philipphager
Copy link
Contributor Author

No problem! And also, awesome job done on the implementation of the AuthorTopicModel @olavurmortensen, the API is an absolute pleasure to work with 👍 ! And thank you both for being so responsive on this PR :)!

@piskvorky
Copy link
Owner

Thanks @philipphager . What happens next: we'll wait for @menshikh-iv to come back from holiday, so he can review and merge this 👍

@piskvorky
Copy link
Owner

Actually, let me merge right away. The fix is simple enough that hopefully @menshikh-iv won't be angry :)

@piskvorky piskvorky merged commit 96444a7 into piskvorky:develop Jul 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants