Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tokenizers] Fixes memory leak when there is overflowing tokens #3317

Merged
merged 2 commits into from
Jul 10, 2024
Merged

[tokenizers] Fixes memory leak when there is overflowing tokens #3317

merged 2 commits into from
Jul 10, 2024

Conversation

baldersheim
Copy link
Contributor

@baldersheim baldersheim commented Jul 10, 2024

Fixes #3316

If you call TokenizersLibrary.LIB.getOverflowing you must also clean up all overflow encodings.

If withOverflowingTokens was false no Encodings where generated leaving jni Encoding handles that would not be properly deleted.

This introduces a new native method where you can inquire about number of overflow tokens without using any jni resources. And you will now only call TokenizersLibrary.LIB.getOverflowing(encoding) if withOverflowingTokens is true.

Description

Brief description of what this PR is about

  • If this change is a backward incompatible change, why must this change be made?
  • Interesting edge cases to note here

…up all overflow encodings.

If withOverflowingTokens was false no Encodings where generated leaving jni Encoding handles that would not be properly deleted.

This introduces a new native method where you can inquire about number of overflow tokens without using any jni resources.
And you will now only call TokenizersLibrary.LIB.getOverflowing(encoding) if withOverflowingTokens is true.
@baldersheim baldersheim requested review from zachgk, frankfliu and a team as code owners July 10, 2024 11:40
@baldersheim
Copy link
Contributor Author

@frankfliu This PR suggests a solution to issue #3316

@frankfliu frankfliu changed the title If you call TokenizersLibrary.LIB.getOverflowing you must also clean … [tokenizers] Fixes memory leak when there is overflowing tokens Jul 10, 2024
@frankfliu frankfliu merged commit f5c9a82 into deepjavalibrary:master Jul 10, 2024
5 checks passed
@baldersheim baldersheim deleted the baldersheim/prevent-leaking-memory-when-withOverflowingTokens-is-false branch July 10, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak in HF tokenizer when using truncation and optWithOverflowingTokens(false)
2 participants