-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF: Parentheses not handled properly in document information dictionary #358
Comments
Tokenizer.java handles special characters through a series of if-else statements within the getNext method. Line 248 handles open parenthesis but there's no correlating check for close parenthesis. Suggest adding close parenthesis to this statement to handle both characters: |
Close, but I think you might be digging in the wrong place. Line 248, which checks for opening parentheses, is only executed when the If we look at line 250 we can see what happens when the tokenizer first encounters an opening parenthesis from whitespace — it actually updates the state of the tokenizer to be ...and I'm just now noticing that there already seems to be a pull request above to fix this issue. So for anyone who'd like to skip to the end of the exercise, the solution can be found in PR #359 :p |
Thanks, @david-russo - this issue is a bit above my current expertise with Java, especially regarding the parsing of special characters in literals. For what it's worth, I think it's worth investigating if the issue may be specific to the way the close-parenthesis is encoded in the sample pdf. As mentioned in the issue, the file passes when the close parenthesis is removed; I can verify that it also passes if the ')' is replaced. |
Sorry for not being clearer: it seems this issue has already been fixed in a pull request from the original author of the issue. You can see the fix they provided in #359. It hasn't been merged into the JHOVE codebase yet though, which is why you're still able to experience the issue. If you apply the changes in #359 locally, you should then see that JHOVE is able to parse the sample PDF. And since it's already been fixed, this issue should probably be removed from the list of eligible hackathon tasks, and good first issues. |
Hi @david-russo and @deanforsmith, sorry my weekend was too eventful to keep up on GH notifications and Slack this and working my way through. Sorry, this is my bad. I'd assumed that somebody starting would pick up on the PR, but neither made it explicit. It was also a pretty lousy "first issue" choice. @deanforsmith thanks for looking at this. The result is what I was after, establishing that this PR does fix the issue at least. Will review for style and merge. |
FIX: Count opening parentheses in literals (#358)
Dev Effort
0.5D
Description
This PDF-file fails validation with the following error:
This happens because the file's document information dictionary contains two entries with parentheses in them:
Removing the parentheses from the Creator and Producer values (or just the closing parentheses for that matter) allows the file to be validated successfully. According to my interpretation of the PDF reference (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf) the entries in the dictionary are just regular text strings and balanced parentheses should thus be allowed.
The text was updated successfully, but these errors were encountered: