Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LsiModel.docs_processed attribute #763

Merged
merged 17 commits into from
Jun 30, 2016
Merged

LsiModel.docs_processed attribute #763

merged 17 commits into from
Jun 30, 2016

Conversation

hobson
Copy link
Contributor

@hobson hobson commented Jun 29, 2016

unittests in test_lsimodel.py and test_corpora.py
works for test MmCorpus in unittests
also works for large custom corpora in production app

@hobson hobson changed the title Populates unused LsiModel.docs_processed attribute LsiModel.docs_processed attribute Jun 29, 2016
@@ -395,6 +396,7 @@ def add_documents(self, corpus, chunksize=None, decay=None):
if self.dispatcher:
logger.info("reached the end of input; now waiting for all remaining jobs to finish")
self.projection = self.dispatcher.getstate()
self.docs_processed += len(corpus) if hasattr(corpus, '__len__') else doc_no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we always use doc_no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I guess so, for this line.

@@ -16,14 +16,17 @@

import numpy

from gensim.utils import to_unicode, smart_extension
from gensim.utils import to_unicode # , smart_extension
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove import if no longer needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Side Note: There are many more unused imports throughout gensim. They can be dangerous to remove, though, for someone like me unfamiliar with the internals of those packages being imported. For example import seaborn has side-effects, and obviously from future import division does too.

@piskvorky
Copy link
Owner

Nice! These type of fixes are very valuable.

What was your motivation for this PR @hobson ?

@hobson
Copy link
Contributor Author

hobson commented Jun 30, 2016

docs_processed += getattr(corpus, '__len__', int)()

Seems less explicit, less pythonic to me, but happy to do it if you like.

I was debugging my training on an iterable QuerySetCorpus class for a
variable-length database query result. The existing attribute was always
zero, confusing. Need to know exactly how many records trained my model.

On Wed, Jun 29, 2016, 8:55 PM Radim Řehůřek notifications@github.com
wrote:

Nice! These type of fixes are very valuable.

What was your motivation for this PR @hobson https://github.com/hobson ?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#763 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAR39z6KYm-0b1wlOw_ZjalaL2kPmaMiks5qQz4-gaJpZM4JBewY
.

@tmylk
Copy link
Contributor

tmylk commented Jun 30, 2016

Agree.
Thanks for the PR!

@tmylk
Copy link
Contributor

tmylk commented Jun 30, 2016

Just a line in the CHANGELOG and then will merge.

@tmylk tmylk merged commit 9f7fee2 into piskvorky:develop Jun 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants