Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955

gojomo · 2020-09-23T19:44:43Z

There's been some useful ability of these classes to work from mmemapped underlying numpy arrrays - but such functionality is not deeply tested in unit-tests, and I thus strongly suspect it has regressed a bit from recent refactorings. (In particular, #2944 when applied removes any pretense that the still-present memmap_path parameter has any effect.)

We should inventory what the classes should be reasonably expected to do.

For example: in what cases should they be initializable to use mmapping from the get-go, or is it sufficient that they do so only when loaded from a prior save? Will they smartly maintain mmapping when undergoing some of the newer vocab-expansion options? That's a tricky thing to do efficiently - I think existing code has just ignored that case, which could just be documented as a limitation. Model classes have multiple underlying arrays that presumably could each be memmapped or not - do we expand method signatures to specify multiple options/paths, or just make it all-or-nothing per model, and in which cases are the mmapped paths specified explicitly ot just mechanistically calculated from a model's 'root name' (as with saving .npy files).

With a reasonable set of officially-supported capabilities decided, we should ensure tests cover exactly those cases, so we catch any current or future collateral damage.

The text was updated successfully, but these errors were encountered:

gojomo · 2020-10-06T08:39:58Z

The question I'm struggling with is whether we only support mmap when loading a saved model, via the SaveLoad mechanism/option. That's pretty easy.

But, starting in Doc2Vec to support one particular usage scenario – so many doc-vectors you'd rather do initial training/allocation from an mmapped file – there's also been another init option to specify a mmap file for this purpose from the start. And then for generic KeyedVectors to support that, it got an init option for that (though under-tested). And, maintaining that generic mmap-up-front option for any caller (when in fact most won't need it) was a little awkward after the cleanup in #2944.

So that up-front mmap feature needs to be fixed (with tests) or cut.

If I could think of a clear way to hack in the same benefit in the Doc2Vec case just when rarely needed, I'd then be very happy to cut the formal parameter/feature – & just put that hack approach as a recipe somewhere. Trying to figure what that recipe could be currently.

gojomo · 2020-10-10T22:52:11Z

After some thought, I think the right balance of functionality/simplicity/continuity-with-past-options would be following:

the main way any of these models would wind up with any memmapped arrays would be when they come from a previously-saved model, with separate non-compressed arrays, and loaded with SaveLoad.load(..., mmap=True). That's the oldest method in the code, and a narrow/clear way to get that effect.
but, for expert users who want a fresh model to start (& initially train) with memmapped backing arrays will get a new optional path parameter to the new allocate_model() half of build_vocab() inside the #2975 refactoring. So, while Doc2Vecwill no longer have adv_mapfileinitialization parameter, andKeyedVectors` might lose its mmap-related initialization parameter, expert users following a few-line code recipe will be able to replace the usual all-in-RAM structures with mmap-backed ones instead. (This recipe would appear or be linked from the migration notes about the removed init parameters.)

This leaves the most common use of memmapping (deploying frozen models) unchanged, and hides the rare use in some expert optional parameters to inside steps, that don't overcomplicate the __init__ parameters/code.

piskvorky · 2020-10-11T19:02:30Z

@gojomo to be clear – does SaveLoad.load(..., mmap=True) work now or not?

We're ready to release 4.0.0 rc1 / beta, just need clarity on this.

The motivation for the other new option, "for expert users who want a fresh model to start (& initially train) with memmapped backing array" is unclear to me and hopefully not blocking. I'd strongly prefer to split these two use-cases, and release beta as soon as we get the first one (in case we're not there yet).

gojomo · 2020-10-11T20:12:45Z

@gojomo to be clear – does SaveLoad.load(..., mmap=True) work now or not?

I don't think any prior work, or the work in #2944 that FIXME-comments out a newer KeyedVectors-specific mmapping option, will have impacted the SaveLoad mmap=True functionality. So I suspect it works, but the real check should be if any unit-tests exercising it are passing.

We're ready to release 4.0.0 rc1 / beta, just need clarity on this.

I really think #2944 should be merged before a test release, but also I'm still generally in favor of packaging & pushing a test/prerelease anytime all unit-tests are passing, as long as the caveats/limitations of such release are adequately described in the accompanying announcements. To the extent there are doubts about regressions or likely further changes, just be sure to call it an 'alpha' or 'beta' rather than 'release candidate'.

The motivation for the other new option, "for expert users who want a fresh model to start (& initially train) with memmapped backing array" is unclear to me and hopefully not blocking. I'd strongly prefer to split these two use-cases, and release beta as soon as we get the first one (in case we're not there yet).

These are "more recent" but not completely "new" options. (An equivalent of dv_mapfile for Doc2Vec goes way back, including in the Doc2Vec-specific KeyedVectors of 3.x versions.) But I don't think they were well-covered by unit-tests. I think that story/testing should be cleared up before 4.0.0-final, per my proposal above, but also per reasoning above, I don't think a properly-described optional 'beta'/etc prerelease need ever be blocked by anything.

(These sorts of general "history-of-4.0.0 release-stage decisions" discussions would work way better in one "4.0.0" issue, and then be easier to understand in retrospect years later, than scattered across many issues/PRs.)

piskvorky · 2020-10-11T21:12:07Z

OK thanks, I'll double check whether mmap=True works / is tested.

Of course, #2944 goes in too.

The question was about mmap = this ticket. General discussion can go elsewhere, yes.

piskvorky · 2021-01-18T18:55:10Z

@gojomo how do you feel about this ticket?

I'm in favour of keeping just the "simple variant" (mmap='r'), which I verified already works.

Is the "expert mode" something you're interested in adding / cleaning up – whether now for 4.0.0 or later otherwise?

piskvorky · 2021-02-25T07:24:32Z

@gojomo removing this ticket from the 4.0 milestone, but let me know if you'd like to get it in.

This was referenced Sep 23, 2020

[MRG] fix save_facebook_model failure after update-vocab & other initialization streamlining #2944

Merged

Consider: dropping custom SaveLoad in favor of Pickle v.5 or joblib.dump? #2848

Open

piskvorky mentioned this issue Oct 6, 2020

Adopting a (narrow) backward-compatibility standard; implications for 4.0.0 #2967

Open

piskvorky added this to the 4.0.0 milestone Oct 6, 2020

gojomo mentioned this issue Oct 13, 2020

[WIP] Implement position-dependent weighting to fastText #2905

Draft

17 tasks

piskvorky assigned gojomo Oct 16, 2020

piskvorky removed this from the 4.0.0 milestone Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955

Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955

gojomo commented Sep 23, 2020

gojomo commented Oct 6, 2020

gojomo commented Oct 10, 2020 •

edited

Loading

piskvorky commented Oct 11, 2020 •

edited

Loading

gojomo commented Oct 11, 2020 •

edited

Loading

piskvorky commented Oct 11, 2020 •

edited

Loading

piskvorky commented Jan 18, 2021

piskvorky commented Feb 25, 2021

Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955

Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955

Comments

gojomo commented Sep 23, 2020

gojomo commented Oct 6, 2020

gojomo commented Oct 10, 2020 • edited Loading

piskvorky commented Oct 11, 2020 • edited Loading

gojomo commented Oct 11, 2020 • edited Loading

piskvorky commented Oct 11, 2020 • edited Loading

piskvorky commented Jan 18, 2021

piskvorky commented Feb 25, 2021

gojomo commented Oct 10, 2020 •

edited

Loading

piskvorky commented Oct 11, 2020 •

edited

Loading

gojomo commented Oct 11, 2020 •

edited

Loading

piskvorky commented Oct 11, 2020 •

edited

Loading