new options for load/intersect_word2vec_format #817

gojomo · 2016-08-09T01:05:59Z

load_word2vec_format:

limit to only read 1st N vectors from file
datatype to force smaller datatype (risky/slow but can save memory)

intersect_word2vec_format:

lockf argument to set whether imported vectors are locked-against-changes (0.0) or not (1.0)

also: tiny fix in Text8Corpus, ensuring final chunk fragment gets same unicode-treatment as all other created-lines.

gojomo · 2016-08-10T07:05:58Z

@piskvorky @tmylk Some small improvements for loading word2vec.c-format vectors, and a tiny fix for issue I hit when comparing w/ fastText. Please ack for merge before drifts-into-conflict?

piskvorky · 2016-08-10T14:12:07Z

gensim/models/word2vec.py

@@ -1120,7 +1135,8 @@ def add_word(word, weights):
                    weights = fromstring(fin.read(binary_len), dtype=REAL)
                    add_word(word, weights)
            else:
-                for line_no, line in enumerate(fin):
+                for line_no in xrange(vocab_size):


Why this change?

Without it, any user-specified limit cap on vocab_size won't be respected for binary=False-format reads.

(Without a limit, the only way this would prevent a line from being read is if the top-of-file declared count is smaller than the following number of lines... and in such case, reading exactly the declared number of vectors is still arguably a reasonable thing to do, and would also match the behavior of the binary=True branch.)

Yes; I meant why the line split.

islice & enumerate seem more idiomatic and also safer (in case vocab_size > len(fin)).

In the moment, the xrange approach was to mirror the way the binary-branch worked, where changing vocab_size to limit already had the desired effect. Considering it against islice, it's probably better for a mismatch to generate an error, than just silently accept fewer-than-the-declared count of vectors.

But, thinking of this made me realize the code wasn't yet handling a limit > the available count (so that's now been fixed), and the binary path could reach a busy-hang if reaching EOF while trying to read its token char-by-char (so both it and the text-path now recognize when a read returns nothing, indicating file-end-before-declared-count-reached, and raise an EOFError).

piskvorky · 2016-08-10T14:13:20Z

The change is desirable and code looks fine to me, thanks.

Final decision and merge up to @tmylk (I know he's planning a new release, don't know whether there's a "merge freeze").

tmylk · 2016-08-14T19:03:04Z

Resolved conflicts, squashed trivial commits and merged in 15ee57f

gojomo added 3 commits August 8, 2016 17:53

limit, datatype params for load_word2vec_format()

b481b2b

lockf param for intersect_word2vec_format()

def0581

consistently unicode last fragment

5b1a72d

gojomo changed the title ~~new load_word2vec_format, intersect_word2vec_format options~~ new options for load/intersect_word2vec_format Aug 9, 2016

changes for piskvorky#817

1e8700c

piskvorky reviewed Aug 10, 2016
View reviewed changes

gojomo added 2 commits August 10, 2016 08:31

comment typo

1a61e0a

handle overlong declared counts, limit

cc737bc

tmylk pushed a commit that referenced this pull request Aug 14, 2016

changes for #817

c0cca66

tmylk closed this Aug 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new options for load/intersect_word2vec_format #817

new options for load/intersect_word2vec_format #817

gojomo commented Aug 9, 2016

gojomo commented Aug 10, 2016

piskvorky Aug 10, 2016

gojomo Aug 10, 2016

piskvorky Aug 10, 2016 •

edited

Loading

gojomo Aug 10, 2016

piskvorky commented Aug 10, 2016 •

edited

Loading

tmylk commented Aug 14, 2016

new options for load/intersect_word2vec_format #817

new options for load/intersect_word2vec_format #817

Conversation

gojomo commented Aug 9, 2016

gojomo commented Aug 10, 2016

piskvorky Aug 10, 2016

Choose a reason for hiding this comment

gojomo Aug 10, 2016

Choose a reason for hiding this comment

piskvorky Aug 10, 2016 • edited Loading

Choose a reason for hiding this comment

gojomo Aug 10, 2016

Choose a reason for hiding this comment

piskvorky commented Aug 10, 2016 • edited Loading

tmylk commented Aug 14, 2016

piskvorky Aug 10, 2016 •

edited

Loading

piskvorky commented Aug 10, 2016 •

edited

Loading