-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423
Conversation
added method models.word2vec.LineSentencePath method to read an entire directory's files in the same style as models.word2vec.LineSentence
initial attempt at test, including files. test just splits the lee_background.cor file into two parts and puts them in a directory, then makes sure they match the unsplit file as loaded by word2vec.LineSentence
no longer sensitive to an input without a trailing os-specific slash
Thanks for the contrib! The failing automatic check looks like some style issues - you can click the 'Details' link or red 'X' to view the failure logs for more hints. More substantive comments inline. |
gensim/models/word2vec.py
Outdated
@@ -1521,6 +1521,54 @@ def __iter__(self): | |||
i += self.max_sentence_length | |||
|
|||
|
|||
class LineSentencePath(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd consider the name PathLineSentences
more typical and descriptive, but other may have an even better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to comment. I'll get to these next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change made, will be reflected in next pull request
gensim/models/word2vec.py
Outdated
self.limit = limit | ||
|
||
try: | ||
self.source = os.path.join(source, '') # ensures os-specific slash is at end of path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on coverage of all related needs:
- perhaps this should accept a path to a single file, too, and still work in that case?
- by deferring the actual resolution of initialization parameters to the beginning of
__iter__()
, the object might be more robust for cases where files are arriving in the target directory between instantiation & 1st iteration. OTOH, that would also mean repeated iterations – as in the common Word2Vec/Doc2Vec multi-pass training, could find different files each time. No strong opinion yet on which approach is better – just pointing out the choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change made to accept a single file and still work, including an additional test case.
I think it is better to resolve the initialization parameters in the __init__()
. While there could be some use in not requiring the files to all be present when the object is initialized, I think that possibly changing the files processed every time a new iteration starts is likely to cause confusion. It seems more natural that the default behavior would be to get a list of files and not change them as long as the object is used. This would match the behavior of LineSentence--if you change the contents of the file between iterations, you'll get different results, but you can't change the reference to the file after the object has been created.
I would personally be caught off guard if the files changed between iterations. While this could be useful in some cases, I think it is a risky default behavior. Adding some capabilities to do this, however, may make sense. But I'd rather not do that unless a compelling use case is presented.
What I've done instead is log the list of files read when the object is created at the info level, so there's some sort of explicit record available of what the object is reading.
gensim/test/test_word2vec.py
Outdated
"""Does LineSentencePath work with a path argument?""" | ||
logging.debug(word2vec) | ||
with utils.smart_open(datapath('lee_background.cor')) as orig: | ||
sentences = word2vec.LineSentencePath(datapath('LineSentencePath')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rather than creating a custom new split (and duplication) of the lee_background.cor
, a new subdir of test-data lee
could be added, with just the lee.cor
and lee_background.cor
files. The test would make sure this new class on the directory yields the same set of docs as the two other files read individually. And eventually, the duplication of these files in two places could be eliminated by making all the other tests/demos/tutorials just grab the lee* files from this new canonical place. (That is: aim for less duplication in the long run by re-using the same files everywhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but I think making a lee directory may be worse than having a special PathLineSentences directory.
It seems there are quite a few lee* files in test_data already. I could move those two files you mentioned, but then we'd have a bunch of lee files still in the main test_data directory. Those files could be moved as well, except some of them aren't compatible with PathLineSentences (.bin, .vec) which meant that they couldn't be moved without breaking the test. Then we would have lee* files in two places, which could lead to confusion later on. There is also the issue that something that makes sense to add to this lee directory in the future could break the test.
Given that the PathLinesSentences class operates on a full directory (which makes it somewhat unique) I think it makes more sense to keep it's test dependent on a directory that isn't overloaded for use with other tests.
What I will do is use smaller files that aren't duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change made, put in a new tiny test corpus, and changed the test to read the files and combine them manually and compare to PathLineSentences, rather than having a single file duplicating the contents of the PathLineSentences folder in test_data
in word2vec.py . Test updated as well
in models.word2vec . Tests also updated
had only 1 space before an inline comment, flagged by travis CI build
Removed LineSentencePath directory, created PathLineSentences lee corpus duplicates were in LineSentencePath, was wasting space made new small corpus to test PathLineSentences, put in directory changed test to read both files manually, combine, and compare to PathLineSentences (rather than having a separate single file to match the entire contents of the PathLineSentences test_data directory
changed PathLineSentences to support a single file in addition to a directory, raises a warning to use LineSentence when a single file is given as a parameter. added corresponding test.
I think I've addressed all the comments. Please re-review. I'm having some problems with figuring out what travis-ci is angry about. I did find an inline comment missing a whitespace and fixed that. But I think the other comments are due to extra spaces in [i : i +...] line and that's elsewhere in the files. I also see errors with the new corpus file--not sure how to tell travis-ci to not treat those files as code. I'll review after the current style check is done and see if I can get to the bottom of things. |
Nevermind, looks like travis just checks my changes -- not code already in the repo. But I did fix another style issue elsewhere in the code. I think the style check should pass now. |
resolved test_word2vec.py manually
Cleaning up my noob git branching fail, no more dangling branches not actually getting used. |
@gojomo or other maintainers--Any other changes? Any questions? Please let me know. Thank you. |
Looks good for me, thank you @michaelwsherman, congratz with your first PR:1st_place_medal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I know I'm late with the review.
Adding some minor code style comments. @menshikh-iv
self.limit = limit | ||
|
||
if os.path.isfile(self.source): | ||
logging.warning('single file read, better to use models.word2vec.LineSentence') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be logger
(each module has its own logger in gensim).
Applies here and elsewhere in this PR.
self.source = os.path.join(self.source, '') # ensures os-specific slash at end of path | ||
logging.debug('reading directory ' + self.source) | ||
self.input_files = os.listdir(self.source) | ||
self.input_files = [self.source + file for file in self.input_files] # make full paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file
is a reserved keyword in Python. Better use filename
or something like that.
else: # not a file or a directory, then we can't do anything with it | ||
raise ValueError('input is neither a file nor a path') | ||
|
||
logging.info('files read into PathLineSentences:' + '\n'.join(self.input_files)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to pass the formatting arguments as arguments: logger.info("%s", y)
instead of logger.info("%s" % y)
.
Here and elsewhere.
""" | ||
`source` should be a path to a directory (as a string) where all files can be opened by the | ||
LineSentence class. Each file will be read up to | ||
`limit` lines (or no clipped if limit is None, the default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
=> not
.
Fixes from @piskvorky in PR #1573 . |
* initial commit of fixes in comments of #1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring #1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring
…iskvorky#1573) * initial commit of fixes in comments of piskvorky#1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring
* replace open->smart_open in annoy tutorial * style fixes for lda model diff * fix for piskvorky#1390 * fix for piskvorky#1423 * fix doc in Phrases
word2vec.LineSentencePath(path) will read all the files in a directory in the same fashion as word2vec.LineSentence reads a file. This provides an easy way to use a corpus of multiple files when training a word2vec model (or any model compatible with word2vec.LineSentence).
Minimal exeception handling, but some logging.