-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new options for load/intersect_word2vec_format #817
Conversation
@piskvorky @tmylk Some small improvements for loading word2vec.c-format vectors, and a tiny fix for issue I hit when comparing w/ fastText. Please ack for merge before drifts-into-conflict? |
@@ -1120,7 +1135,8 @@ def add_word(word, weights): | |||
weights = fromstring(fin.read(binary_len), dtype=REAL) | |||
add_word(word, weights) | |||
else: | |||
for line_no, line in enumerate(fin): | |||
for line_no in xrange(vocab_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it, any user-specified limit
cap on vocab_size
won't be respected for binary=False
-format reads.
(Without a limit
, the only way this would prevent a line from being read is if the top-of-file declared count is smaller than the following number of lines... and in such case, reading exactly the declared number of vectors is still arguably a reasonable thing to do, and would also match the behavior of the binary=True
branch.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; I meant why the line split.
islice
& enumerate
seem more idiomatic and also safer (in case vocab_size > len(fin)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the moment, the xrange
approach was to mirror the way the binary-branch worked, where changing vocab_size
to limit
already had the desired effect. Considering it against islice
, it's probably better for a mismatch to generate an error, than just silently accept fewer-than-the-declared count of vectors.
But, thinking of this made me realize the code wasn't yet handling a limit
> the available count (so that's now been fixed), and the binary
path could reach a busy-hang if reaching EOF while trying to read its token char-by-char (so both it and the text-path now recognize when a read returns nothing, indicating file-end-before-declared-count-reached, and raise an EOFError).
The change is desirable and code looks fine to me, thanks. Final decision and merge up to @tmylk (I know he's planning a new release, don't know whether there's a "merge freeze"). |
Resolved conflicts, squashed trivial commits and merged in 15ee57f |
load_word2vec_format:
limit
to only read 1st N vectors from filedatatype
to force smaller datatype (risky/slow but can save memory)intersect_word2vec_format:
lockf
argument to set whether imported vectors are locked-against-changes (0.0) or not (1.0)also: tiny fix in Text8Corpus, ensuring final chunk fragment gets same unicode-treatment as all other created-lines.