Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace += with StringBuilder for whitespace to speed up Tokenizer #615

Merged

Conversation

karenhanson
Copy link
Contributor

@karenhanson karenhanson commented May 29, 2020

This is a potential fix for issue #614 - whitespace is processed very slowly in the PDF Tokenizer. This became a problem when we had some files with >100MB of whitespace. Though the PDFs were not valid, JHOVE ran for days without getting to the end of the whitespace on a 160MB file. The fix uses a StringBuilder instead of doing += on a String. It is now suspiciously fast, only taking a few seconds - which makes me wonder if I've missed something in the logic!

Testing (edited): I have created a test file that can be used to replicate/test this issue and attached it to issue #614. I confirmed that this change reduces the processing time to seconds on both the original that we had the problem with (which I can't share) and my manufactured test file. I've submitted this change without a test for now - please let me know if I should add a test using my test file or if it needs to be added to a test corpus elsewhere.

Note: The issue that this relates to is newly logged and hasn't been evaluated yet - please let me know if it would be best to withdraw this PR until the issue is reviewed. My apologies if I've done things out of order!

Previously the token buffer used StringBuilder but not the whitespace buffer - for large chunks of whitespace this resulted in very slow processing. This replaces concatenating charactres to `_wsString` with a StringBuilder
@codecov
Copy link

codecov bot commented May 29, 2020

Codecov Report

Merging #615 (9c17f89) into integration (1329b50) will increase coverage by 0.00%.
The diff coverage is 66.66%.

@@              Coverage Diff               @@
##             integration     #615   +/-   ##
==============================================
  Coverage          45.63%   45.63%           
- Complexity          1046     1047    +1     
==============================================
  Files                 57       57           
  Lines               9144     9146    +2     
  Branches            1682     1683    +1     
==============================================
+ Hits                4173     4174    +1     
  Misses              4422     4422           
- Partials             549      550    +1     
Impacted Files Coverage Δ
.../harvard/hul/ois/jhove/messages/JhoveMessages.java 47.36% <66.66%> (+0.30%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e2cb86...9c17f89. Read the comment docs.

@carlwilson carlwilson linked an issue Jun 4, 2020 that may be closed by this pull request
@carlwilson carlwilson added feature New functionality to be developed P2 Medium priority issues to be scheduled in a future release RC1.26 labels Apr 6, 2022
@carlwilson carlwilson added this to the JHOVE 1.26 milestone Apr 6, 2022
@carlwilson carlwilson merged commit e412867 into openpreserve:integration Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality to be developed P2 Medium priority issues to be scheduled in a future release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDF with a large amount of whitespace takes days to process
3 participants