Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tokenizer] Add Fast Tokenizer #8832

Merged
merged 16 commits into from
Aug 19, 2024

Conversation

DrownFish19
Copy link
Collaborator

PR types

New features

PR changes

APIs

Description

Add Fast Tokenizer.

  • Take the tokenizers as the backend of new fast tokenziers.
  • Compatible with the current tokenizers and new fast tokenizers.
  • LLaMA3.1 and LLaMA3 can use PretrainedTokenizerFast to achieve better performance. LLaMA 1 and LLaMA 2 also can use LlamaTokenizerFast to improve tokenization performance.

Copy link

paddle-bot bot commented Jul 30, 2024

Thanks for your contribution!

paddlenlp/utils/versions.py Outdated Show resolved Hide resolved
@DrownFish19 DrownFish19 force-pushed the dev_add_tokenizer_fast branch 3 times, most recently from 5b8dc52 to 5355615 Compare August 2, 2024 12:06
Copy link

codecov bot commented Aug 2, 2024

Codecov Report

Attention: Patch coverage is 49.03537% with 317 lines in your changes missing coverage. Please review.

Project coverage is 54.81%. Comparing base (e0d2809) to head (e63092e).
Report is 3 commits behind head on develop.

Files Patch % Lines
paddlenlp/transformers/convert_slow_tokenizer.py 21.73% 126 Missing ⚠️
paddlenlp/transformers/tokenizer_utils_fast.py 60.93% 125 Missing ⚠️
paddlenlp/transformers/llama/tokenizer_fast.py 43.58% 44 Missing ⚠️
paddlenlp/transformers/tokenizer_utils_base.py 58.69% 19 Missing ⚠️
paddlenlp/transformers/tokenizer_utils.py 70.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8832      +/-   ##
===========================================
+ Coverage    54.79%   54.81%   +0.01%     
===========================================
  Files          636      639       +3     
  Lines        99876   100475     +599     
===========================================
+ Hits         54732    55079     +347     
- Misses       45144    45396     +252     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ZHUI ZHUI merged commit d2d4d92 into PaddlePaddle:develop Aug 19, 2024
9 of 12 checks passed
@DrownFish19 DrownFish19 deleted the dev_add_tokenizer_fast branch August 19, 2024 03:12
Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024
* add fast tokenizer

* add convert slow tokenizer method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants