Replace += with StringBuilder for whitespace to speed up Tokenizer #615
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a potential fix for issue #614 - whitespace is processed very slowly in the PDF Tokenizer. This became a problem when we had some files with >100MB of whitespace. Though the PDFs were not valid, JHOVE ran for days without getting to the end of the whitespace on a 160MB file. The fix uses a
StringBuilder
instead of doing+=
on a String. It is now suspiciously fast, only taking a few seconds - which makes me wonder if I've missed something in the logic!Testing (edited): I have created a test file that can be used to replicate/test this issue and attached it to issue #614. I confirmed that this change reduces the processing time to seconds on both the original that we had the problem with (which I can't share) and my manufactured test file. I've submitted this change without a test for now - please let me know if I should add a test using my test file or if it needs to be added to a test corpus elsewhere.
Note: The issue that this relates to is newly logged and hasn't been evaluated yet - please let me know if it would be best to withdraw this PR until the issue is reviewed. My apologies if I've done things out of order!