Vocabulary size is much smaller than requested #3500

DavidNemeskey · 2023-10-09T07:30:21Z

Problem description

I was training a w2v model on a rather large corpus (about 35B tokens). I set the min_count to 50 and max_vocab_size to 250,000. I expected at the end of the training to have a vocabulary of 250k words. Instead, I got one at around 70k.

The logs are telling:

PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
PROGRESS: at sentence #10000, processed 1149505 words, keeping 149469 word types
PROGRESS: at sentence #20000, processed 2287292 words, keeping 232917 word types
pruned out 0 tokens with count <=1 (before 250001, after 250001)
pruned out 140618 tokens with count <=2 (before 250007, after 109389)
PROGRESS: at sentence #30000, processed 3442707 words, keeping 179514 word types
pruned out 148589 tokens with count <=3 (before 250005, after 101416)
...
pruned out 179627 tokens with count <=16330 (before 250006, after 70379)
PROGRESS: at sentence #301310000, processed 35302183879 words, keeping 92987 word types
collected 112874 word types from a corpus of 35302368099 raw words and 301311561 sentences
Creating a fresh vocabulary
Word2Vec lifecycle event {'msg': 'effective_min_count=50 retains 70380 unique words (62.35% of original 112874, drops 42494)', 'datetime':        '2023-09-26T15:19:19.866236', 'gensim': '4.3.2', 'python': '3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]', 'platform': 'Linux-5.4.0-150-generic-x86_64-with-glibc2.31', 'event'       : 'prepare_vocab'}
Word2Vec lifecycle event {'msg': 'effective_min_count=50 leaves 30437195857 word corpus (100.00% of original 30437248987, drops 53130)', '       datetime': '2023-09-26T15:19:19.870161', 'gensim': '4.3.2', 'python': '3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]', 'platform': 'Linux-5.4.0-150-generic-x86_64-with-glibc2.31       ', 'event': 'prepare_vocab'}
deleting the raw counts dictionary of 112874 items
sample=0.001 downsamples 21 most-common words
Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 24482242512.167667 word corpus (80.4%% of prior 30437195857)', 'datetime':        '2023-09-26T15:19:20.211104', 'gensim': '4.3.2', 'python': '3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]', 'platform': 'Linux-5.4.0-150-generic-x86_64-with-glibc2.31', 'event'       : 'prepare_vocab'}
estimated required memory for 70380 words and 300 dimensions: 204102000 bytes

So it seems as if min_count is only taken into consideration after the vocabulary has been pruned with a continuously increasing threshold. However, the threshold throws away a lot of words that otherwise should be in the vocabulary.

A few observations about this:

I am not sure thresholding works really well, especially in the latter stages: how could a word amass 16,000 occurrences if its previous (say, 15,998) occurrences have been pruned previously? Even if it occurs 100 times in the new batch, it will just be pruned again.
The log mentions effective_min_count=50, which then manages to prune 20k words at the end. I mean, if the final threshold was over 16,000, how could a threshold of 50 result in any more pruning?
The documentation only says this about min_count: Ignores all words with total frequency lower than this. Which is clearly not what it does.

So the questions that naturally follow:

Can we switch of the increasing thresholding?
What does min_count actually do?

Steps/code/corpus to reproduce

model = models.Word2Vec(
    sentences=corpus, vector_size=300, min_count=50,
    max_vocab_size=250000, workers=processes, epochs=1,
    compute_loss=True, sg=int(args.sg)
)

Versions

Linux-5.4.0-150-generic-x86_64-with-glibc2.31
Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
Bits 64
NumPy 1.25.2
SciPy 1.11.2
gensim 4.3.2
FAST_VERSION 0

The text was updated successfully, but these errors were encountered:

gojomo · 2023-10-13T01:12:16Z

max_vocab_size is an awfully-named parameter with unintuitive effects during its mid-count-pruning that in general should not be used unless there's no other way to complete the vocabulary-scan in your RAM without it. And, even then, you'll want to set max_vocab_size to as large of a value as is possible with your RAM, to minimize the count-fouling effects of mid-count pruning, rather than anything near the final sized vocabulary you want.

As you note, the escalating-floor simply means the next pruning will automatically discard all knowledge of tokens with fewer occurrences, at that pruning. Still, at the very end of the vocabulary-survey, there could be any number of tokens with tallied occurrences less than that threshold that have been noted since the last prune. And then only those with frequency less than the effective_min_count (here just your specified min_count) will be ignored for the final surviving vocabulary.

So many tokens with interim counts up to that rising threshold will have been pruned during the many in-count prunes. And further, tokens with true counts far higher than that threshold, but below that threshold at one or more of the prunings, will have artifically-lower counts (because earlier interim tallies were discarded) and may wind up being ignored entirely (if their final interim count is below min_count).

Why is such a confusing parameter available? It matches the name, and behavior, of the same parameter in Google's original word2vec.c code release, upon which Gensim's implementation was originally closely based.

If you'd prefer a precise vocabulary count (and that won't exhaust your RAM), look into the max_final_vocab parameter instead. It only applies at the end of the vocabulary survey, by choosing an effective_min_count that's large enough to come in just-under your chosen max_final_vocab (instead of arbitrarily far-lower as is common with the max_vocab_size parameter's pruning). Still, though, if you specified an explicit min_count higher than the effective_min_count that max_final_vocab required, your higher explicit min_count will be applied.

Two other unrelated things to watch out for, given your shown parameters:

even if you have more CPU cores, worker values higher than about 6-12 usually have lower-throughput than some value in that range, due to Python GIL contention & limitations of Gensim's iterable-corpus-mode "master-reader-thread that fans batches out to many worker threads" approach. Finding the exact number of threads that achieves optimal training throughput is a matter of trial-and-error, as it's affected by other parameters as well. But, as long as your corpus's token patterns are uniform throughout, the first few minutes of a run should reflect, in the logged reported rates, a consistent rate for the full run.
the losses tracked by compute_loss have a bunch of caveats; see Fix, improve, complete 'training loss' computation for *2Vec models #2617 for an overview of what's amiss.

DavidNemeskey · 2023-10-17T08:40:52Z

@gojomo Thanks for the very detailed reply. Now I feel completely stupid: I skimmed the function arguments up to max_worker_size, was happy that I found it, and never managed to take a look at the argument right below it. Ungh. In any case, I think it would have made sense to make semantic hyper-parameters separate from implementation details (i.e. max_final_vocab, epochs, etc. from max_vocab_size and the like), but no use crying over spilt milk, I guess.

I have also since realized why min_count was applied at the end, so that part indeed works as it should.

Thank you for the additional heads up about worker count and the loss; I was aware of the first one, but not the second.

DavidNemeskey closed this as completed Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary size is much smaller than requested #3500

Vocabulary size is much smaller than requested #3500

DavidNemeskey commented Oct 9, 2023

gojomo commented Oct 13, 2023

DavidNemeskey commented Oct 17, 2023

Vocabulary size is much smaller than requested #3500

Vocabulary size is much smaller than requested #3500

Comments

DavidNemeskey commented Oct 9, 2023

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Oct 13, 2023

DavidNemeskey commented Oct 17, 2023