-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocabulary size is much smaller than requested #3500
Comments
As you note, the escalating-floor simply means the next pruning will automatically discard all knowledge of tokens with fewer occurrences, at that pruning. Still, at the very end of the vocabulary-survey, there could be any number of tokens with tallied occurrences less than that threshold that have been noted since the last prune. And then only those with frequency less than the So many tokens with interim counts up to that rising threshold will have been pruned during the many in-count prunes. And further, tokens with true counts far higher than that threshold, but below that threshold at one or more of the prunings, will have artifically-lower counts (because earlier interim tallies were discarded) and may wind up being ignored entirely (if their final interim count is below Why is such a confusing parameter available? It matches the name, and behavior, of the same parameter in Google's original If you'd prefer a precise vocabulary count (and that won't exhaust your RAM), look into the Two other unrelated things to watch out for, given your shown parameters:
|
@gojomo Thanks for the very detailed reply. Now I feel completely stupid: I skimmed the function arguments up to I have also since realized why Thank you for the additional heads up about worker count and the loss; I was aware of the first one, but not the second. |
Problem description
I was training a w2v model on a rather large corpus (about 35B tokens). I set the
min_count
to 50 andmax_vocab_size
to 250,000. I expected at the end of the training to have a vocabulary of 250k words. Instead, I got one at around 70k.The logs are telling:
So it seems as if
min_count
is only taken into consideration after the vocabulary has been pruned with a continuously increasing threshold. However, the threshold throws away a lot of words that otherwise should be in the vocabulary.A few observations about this:
effective_min_count=50
, which then manages to prune 20k words at the end. I mean, if the final threshold was over 16,000, how could a threshold of 50 result in any more pruning?min_count
: Ignores all words with total frequency lower than this. Which is clearly not what it does.So the questions that naturally follow:
min_count
actually do?Steps/code/corpus to reproduce
Versions
The text was updated successfully, but these errors were encountered: