Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955

Open
gojomo opened this issue Sep 23, 2020 · 7 comments
Assignees

Comments

@gojomo
Copy link
Collaborator

gojomo commented Sep 23, 2020

There's been some useful ability of these classes to work from mmemapped underlying numpy arrrays - but such functionality is not deeply tested in unit-tests, and I thus strongly suspect it has regressed a bit from recent refactorings. (In particular, #2944 when applied removes any pretense that the still-present memmap_path parameter has any effect.)

We should inventory what the classes should be reasonably expected to do.

For example: in what cases should they be initializable to use mmapping from the get-go, or is it sufficient that they do so only when loaded from a prior save? Will they smartly maintain mmapping when undergoing some of the newer vocab-expansion options? That's a tricky thing to do efficiently - I think existing code has just ignored that case, which could just be documented as a limitation. Model classes have multiple underlying arrays that presumably could each be memmapped or not - do we expand method signatures to specify multiple options/paths, or just make it all-or-nothing per model, and in which cases are the mmapped paths specified explicitly ot just mechanistically calculated from a model's 'root name' (as with saving .npy files).

With a reasonable set of officially-supported capabilities decided, we should ensure tests cover exactly those cases, so we catch any current or future collateral damage.

@gojomo
Copy link
Collaborator Author

gojomo commented Oct 6, 2020

The question I'm struggling with is whether we only support mmap when loading a saved model, via the SaveLoad mechanism/option. That's pretty easy.

But, starting in Doc2Vec to support one particular usage scenario – so many doc-vectors you'd rather do initial training/allocation from an mmapped file – there's also been another init option to specify a mmap file for this purpose from the start. And then for generic KeyedVectors to support that, it got an init option for that (though under-tested). And, maintaining that generic mmap-up-front option for any caller (when in fact most won't need it) was a little awkward after the cleanup in #2944.

So that up-front mmap feature needs to be fixed (with tests) or cut.

If I could think of a clear way to hack in the same benefit in the Doc2Vec case just when rarely needed, I'd then be very happy to cut the formal parameter/feature – & just put that hack approach as a recipe somewhere. Trying to figure what that recipe could be currently.

@gojomo
Copy link
Collaborator Author

gojomo commented Oct 10, 2020

After some thought, I think the right balance of functionality/simplicity/continuity-with-past-options would be following:

  • the main way any of these models would wind up with any memmapped arrays would be when they come from a previously-saved model, with separate non-compressed arrays, and loaded with SaveLoad.load(..., mmap=True). That's the oldest method in the code, and a narrow/clear way to get that effect.
  • but, for expert users who want a fresh model to start (& initially train) with memmapped backing arrays will get a new optional path parameter to the new allocate_model() half of build_vocab() inside the #2975 refactoring. So, while Doc2Vecwill no longer have adv_mapfileinitialization parameter, andKeyedVectors` might lose its mmap-related initialization parameter, expert users following a few-line code recipe will be able to replace the usual all-in-RAM structures with mmap-backed ones instead. (This recipe would appear or be linked from the migration notes about the removed init parameters.)

This leaves the most common use of memmapping (deploying frozen models) unchanged, and hides the rare use in some expert optional parameters to inside steps, that don't overcomplicate the __init__ parameters/code.

@piskvorky
Copy link
Owner

piskvorky commented Oct 11, 2020

@gojomo to be clear – does SaveLoad.load(..., mmap=True) work now or not?

We're ready to release 4.0.0 rc1 / beta, just need clarity on this.

The motivation for the other new option, "for expert users who want a fresh model to start (& initially train) with memmapped backing array" is unclear to me and hopefully not blocking. I'd strongly prefer to split these two use-cases, and release beta as soon as we get the first one (in case we're not there yet).

@gojomo
Copy link
Collaborator Author

gojomo commented Oct 11, 2020

@gojomo to be clear – does SaveLoad.load(..., mmap=True) work now or not?

I don't think any prior work, or the work in #2944 that FIXME-comments out a newer KeyedVectors-specific mmapping option, will have impacted the SaveLoad mmap=True functionality. So I suspect it works, but the real check should be if any unit-tests exercising it are passing.

We're ready to release 4.0.0 rc1 / beta, just need clarity on this.

I really think #2944 should be merged before a test release, but also I'm still generally in favor of packaging & pushing a test/prerelease anytime all unit-tests are passing, as long as the caveats/limitations of such release are adequately described in the accompanying announcements. To the extent there are doubts about regressions or likely further changes, just be sure to call it an 'alpha' or 'beta' rather than 'release candidate'.

The motivation for the other new option, "for expert users who want a fresh model to start (& initially train) with memmapped backing array" is unclear to me and hopefully not blocking. I'd strongly prefer to split these two use-cases, and release beta as soon as we get the first one (in case we're not there yet).

These are "more recent" but not completely "new" options. (An equivalent of dv_mapfile for Doc2Vec goes way back, including in the Doc2Vec-specific KeyedVectors of 3.x versions.) But I don't think they were well-covered by unit-tests. I think that story/testing should be cleared up before 4.0.0-final, per my proposal above, but also per reasoning above, I don't think a properly-described optional 'beta'/etc prerelease need ever be blocked by anything.

(These sorts of general "history-of-4.0.0 release-stage decisions" discussions would work way better in one "4.0.0" issue, and then be easier to understand in retrospect years later, than scattered across many issues/PRs.)

@piskvorky
Copy link
Owner

piskvorky commented Oct 11, 2020

OK thanks, I'll double check whether mmap=True works / is tested.

Of course, #2944 goes in too.

The question was about mmap = this ticket. General discussion can go elsewhere, yes.

@piskvorky
Copy link
Owner

@gojomo how do you feel about this ticket?

I'm in favour of keeping just the "simple variant" (mmap='r'), which I verified already works.

Is the "expert mode" something you're interested in adding / cleaning up – whether now for 4.0.0 or later otherwise?

@piskvorky piskvorky removed this from the 4.0.0 milestone Feb 25, 2021
@piskvorky
Copy link
Owner

@gojomo removing this ticket from the 4.0 milestone, but let me know if you'd like to get it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants