Fix memory consumption in AuthorTopicModel #2122

philipphager · 2018-07-07T14:09:46Z

@menshikh-iv
I submitted this PR to address the memory consumption issue faced when using the AuthorTopicModel as also recognized in #1947. We had to fix this issue for a research project and maybe you are interested to use this simple fix in the main project. The concatenation of the entire corpus for all authors and then removing unique values (resulting in all unique values in the corpus), can be reduced to this, and the results did not change.

This fix reduced the memory consumption on our project with ≈400.000 docs from 32GB to 2GB for the entire duration of the training.

piskvorky · 2018-07-08T19:13:54Z

gensim/models/atmodel.py

-                    train_corpus_idx.extend(doc_ids)
+            # Collect all documents of authors.
+            for doc_ids in self.author2doc.values():
+                train_corpus_idx.extend(doc_ids)


Why keep train_corpus_idx as list in the first place? It looks like it's converted to a set right below, so maybe better to make it a set from the start?

That is true, indeed. I would propose something like this:

# Train on all documents of authors in input_corpus. train_corpus_idx = set() # Collect all documents of authors. for doc_ids in self.author2doc.values(): train_corpus_idx.update(doc_ids) train_corpus_idx = list(train_corpus_idx)

I'm not familiar with the algo, but for the sake of reproducibility, I'd replace list(train_corpus_idx) with sorted(train_corpus_idx). That will remove randomness from the ordering of values, while still producing a list.

piskvorky · 2018-07-08T19:20:32Z

Nice! The original code looks strange indeed -- @olavurmortensen any particular reason for that nested quadratic loop?

philipphager · 2018-07-09T14:02:19Z

I updated the loop to use a set and to sort the resulting list. At least on my current dataset this is also a little bit faster :), and reduces the temporary memory overhead even further.

olavurmortensen · 2018-07-10T19:17:47Z

I can't imagine there was a reason for that nested loop, must have just slipped my mind. I only tested scalability w.r.t. running time, if I'd tested memory consumption as well I should have caught this.

There is some stuff in my thesis about asymptotic complexity of memory consumption (pdf, section 2.4.2.6). The algorithm doesn't scale terribly well w.r.t. memory consumption. The empirical results showed that running time scaled as expected, compared to the theoretical scalability, but as I said I didn't test memory consumption.

I'm glad this problem was caught, hopefully it fixes the issues people are having. Thanks @philipphager.

philipphager · 2018-07-11T05:40:01Z

No problem! And also, awesome job done on the implementation of the AuthorTopicModel @olavurmortensen, the API is an absolute pleasure to work with 👍 ! And thank you both for being so responsive on this PR :)!

piskvorky · 2018-07-11T07:38:43Z

Thanks @philipphager . What happens next: we'll wait for @menshikh-iv to come back from holiday, so he can review and merge this 👍

piskvorky · 2018-07-13T19:41:15Z

Actually, let me merge right away. The fix is simple enough that hopefully @menshikh-iv won't be angry :)

philipphager added 2 commits July 7, 2018 15:49

Fix quadratic iteration

656d9b4

Update comment

0d0c69e

piskvorky reviewed Jul 8, 2018

View reviewed changes

Move from list to set

9ba25e7

piskvorky merged commit 96444a7 into piskvorky:develop Jul 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory consumption in AuthorTopicModel #2122

Fix memory consumption in AuthorTopicModel #2122

philipphager commented Jul 7, 2018 •

edited

Loading

piskvorky Jul 8, 2018

philipphager Jul 8, 2018

piskvorky Jul 8, 2018

piskvorky commented Jul 8, 2018

philipphager commented Jul 9, 2018

olavurmortensen commented Jul 10, 2018

philipphager commented Jul 11, 2018

piskvorky commented Jul 11, 2018

piskvorky commented Jul 13, 2018

Fix memory consumption in AuthorTopicModel #2122

Fix memory consumption in AuthorTopicModel #2122

Conversation

philipphager commented Jul 7, 2018 • edited Loading

piskvorky Jul 8, 2018

Choose a reason for hiding this comment

philipphager Jul 8, 2018

Choose a reason for hiding this comment

piskvorky Jul 8, 2018

Choose a reason for hiding this comment

piskvorky commented Jul 8, 2018

philipphager commented Jul 9, 2018

olavurmortensen commented Jul 10, 2018

philipphager commented Jul 11, 2018

piskvorky commented Jul 11, 2018

piskvorky commented Jul 13, 2018

philipphager commented Jul 7, 2018 •

edited

Loading