-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure 2Vec classes (KeyedVectors, Word2Vec, Doc2Vec, FastText) support desired level of mmap support #2955
Comments
The question I'm struggling with is whether we only support mmap when loading a saved model, via the But, starting in So that up-front mmap feature needs to be fixed (with tests) or cut. If I could think of a clear way to hack in the same benefit in the |
After some thought, I think the right balance of functionality/simplicity/continuity-with-past-options would be following:
This leaves the most common use of memmapping (deploying frozen models) unchanged, and hides the rare use in some expert optional parameters to inside steps, that don't overcomplicate the |
@gojomo to be clear – does We're ready to release 4.0.0 rc1 / beta, just need clarity on this. The motivation for the other new option, "for expert users who want a fresh model to start (& initially train) with memmapped backing array" is unclear to me and hopefully not blocking. I'd strongly prefer to split these two use-cases, and release beta as soon as we get the first one (in case we're not there yet). |
I don't think any prior work, or the work in #2944 that FIXME-comments out a newer
I really think #2944 should be merged before a test release, but also I'm still generally in favor of packaging & pushing a test/prerelease anytime all unit-tests are passing, as long as the caveats/limitations of such release are adequately described in the accompanying announcements. To the extent there are doubts about regressions or likely further changes, just be sure to call it an 'alpha' or 'beta' rather than 'release candidate'.
These are "more recent" but not completely "new" options. (An equivalent of (These sorts of general "history-of-4.0.0 release-stage decisions" discussions would work way better in one "4.0.0" issue, and then be easier to understand in retrospect years later, than scattered across many issues/PRs.) |
OK thanks, I'll double check whether Of course, #2944 goes in too. The question was about |
@gojomo how do you feel about this ticket? I'm in favour of keeping just the "simple variant" (mmap='r'), which I verified already works. Is the "expert mode" something you're interested in adding / cleaning up – whether now for 4.0.0 or later otherwise? |
@gojomo removing this ticket from the 4.0 milestone, but let me know if you'd like to get it in. |
There's been some useful ability of these classes to work from mmemapped underlying numpy arrrays - but such functionality is not deeply tested in unit-tests, and I thus strongly suspect it has regressed a bit from recent refactorings. (In particular, #2944 when applied removes any pretense that the still-present
memmap_path
parameter has any effect.)We should inventory what the classes should be reasonably expected to do.
For example: in what cases should they be initializable to use mmapping from the get-go, or is it sufficient that they do so only when loaded from a prior save? Will they smartly maintain mmapping when undergoing some of the newer vocab-expansion options? That's a tricky thing to do efficiently - I think existing code has just ignored that case, which could just be documented as a limitation. Model classes have multiple underlying arrays that presumably could each be memmapped or not - do we expand method signatures to specify multiple options/paths, or just make it all-or-nothing per model, and in which cases are the mmapped paths specified explicitly ot just mechanistically calculated from a model's 'root name' (as with saving
.npy
files).With a reasonable set of officially-supported capabilities decided, we should ensure tests cover exactly those cases, so we catch any current or future collateral damage.
The text was updated successfully, but these errors were encountered: