-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bpe_gpt2_preprocess
#5446
Fix bpe_gpt2_preprocess
#5446
Conversation
There is a tokenization test that is failing with this change:
I.e. the string |
This comment was marked as resolved.
This comment was marked as resolved.
It's odd, but this PR seems to undo the bug fix introduced in PR #3567, which aimed to address the issue reported in Issue #3502. I'll check if reverting to the version before PR #3567 eliminates the error encountered in Edit: After rolling back to the version prior to PR #3567, the situation become even worse. The error rate surged to |
We might need to add some additional code to ensure our tokenizer aligns with the Falcon tokenizer's behavior. It seems there aren't any major issues with this pull request. Echoing @apage43's comments in this discussion, to match the Falcon tokenizer exactly, we should consider developing specific functions aimed at this requirement, rather than altering the |
This time, I will conduct some stricter tests, using a test set that is Edit: Great news. No errors were detected! |
Ok, got it. We should however fix the test - maybe just use the new set of tokens and add a short comment explaining why there is a difference (or just link to this discussion) |
This comment was marked as resolved.
This comment was marked as resolved.
Btw, #5464 claims to fix some issues in the BPE preprocess function - might be useful |
I attempted to build the PR you mentioned, but unfortunately, it didn't succeed.
|
Good, nearly all of Falcon's tokenizer tests have failed, which indicates that our implementation of |
This is a byproduct of the work on whisper.cpp. Could anyone conduct a few tests to verify if it resolves the issue detailed at #3502? @ggerganov
In the development of whisper.cpp, I've performed tests on a relatively large dataset, and the results have been quite promising. You can find more details about this testing at ggerganov/whisper.cpp#1854.