Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF: Parentheses not handled properly in document information dictionary #358

Closed
T3rj3 opened this issue Sep 5, 2018 · 6 comments · Fixed by #359 or #602
Closed

PDF: Parentheses not handled properly in document information dictionary #358

T3rj3 opened this issue Sep 5, 2018 · 6 comments · Fixed by #359 or #602
Assignees
Labels
bug A product defect that needs fixing good-first-issue Issue suitable for inexperienced developers P2 Medium priority issues to be scheduled in a future release
Milestone

Comments

@T3rj3
Copy link
Contributor

T3rj3 commented Sep 5, 2018

Dev Effort

0.5D

Description

This PDF-file fails validation with the following error:

Jhove (Rel. 1.20.1, 2018-03-29)
Date: 2018-09-05 10:07:25 CEST
RepresentationInformation: /home/sample.pdf
ReportingModule: PDF-hul, Rel. 1.11 (2018-03-29)
LastModified: 2018-08-30 10:10:10 CEST
Size: 2736147
Format: PDF
Status: Not well-formed
SignatureMatches:
PDF-hul
ErrorMessage: Lexical error
Offset: 2731501
MIMEtype: application/pdf

This happens because the file's document information dictionary contains two entries with parentheses in them:

/CreationDate (20180606085937+01'00')
/Creator (QuarkXPress(R) 8.16r2)
/ModDate (20180606085937+01'00')
/Producer (QuarkXPress(R) 8.16r2)
/Title <feff0063006d0079006b002d003200300030>

Removing the parentheses from the Creator and Producer values (or just the closing parentheses for that matter) allows the file to be validated successfully. According to my interpretation of the PDF reference (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf) the entries in the dictionary are just regular text strings and balanced parentheses should thus be allowed.

T3rj3 pushed a commit to T3rj3/jhove that referenced this issue Sep 5, 2018
@carlwilson carlwilson added the bug A product defect that needs fixing label Dec 11, 2018
@carlwilson carlwilson added this to the v1.22-m4 Release milestone Dec 11, 2018
@carlwilson carlwilson added the P2 Medium priority issues to be scheduled in a future release label Dec 19, 2018
@carlwilson carlwilson added the good-first-issue Issue suitable for inexperienced developers label Apr 23, 2020
@deanforsmith
Copy link
Contributor

deanforsmith commented Apr 30, 2020

Tokenizer.java handles special characters through a series of if-else statements within the getNext method. Line 248 handles open parenthesis but there's no correlating check for close parenthesis. Suggest adding close parenthesis to this statement to handle both characters:
else if (_ch == '(') {
would become
else if (_ch == '(' || _ch == ')') {

@david-russo
Copy link
Member

david-russo commented Apr 30, 2020

Close, but I think you might be digging in the wrong place. Line 248, which checks for opening parentheses, is only executed when the _state of the tokenizer is WHITESPACE (line 200), i.e. only when the tokenizer has been parsing whitespace/no other discernible object.

If we look at line 250 we can see what happens when the tokenizer first encounters an opening parenthesis from whitespace — it actually updates the state of the tokenizer to be LITERAL, meaning that it's now in the midst of parsing a Literal String of characters. So you'll want to trace the path of what happens to a byte when the state is LITERAL instead, assuming the problem has something to do with parsing parentheses in string literals as the issue suggests.

...and I'm just now noticing that there already seems to be a pull request above to fix this issue. So for anyone who'd like to skip to the end of the exercise, the solution can be found in PR #359 :p

@deanforsmith
Copy link
Contributor

Thanks, @david-russo - this issue is a bit above my current expertise with Java, especially regarding the parsing of special characters in literals. For what it's worth, I think it's worth investigating if the issue may be specific to the way the close-parenthesis is encoded in the sample pdf. As mentioned in the issue, the file passes when the close parenthesis is removed; I can verify that it also passes if the ')' is replaced.

@david-russo
Copy link
Member

Sorry for not being clearer: it seems this issue has already been fixed in a pull request from the original author of the issue. You can see the fix they provided in #359. It hasn't been merged into the JHOVE codebase yet though, which is why you're still able to experience the issue. If you apply the changes in #359 locally, you should then see that JHOVE is able to parse the sample PDF.

And since it's already been fixed, this issue should probably be removed from the list of eligible hackathon tasks, and good first issues.

@carlwilson
Copy link
Member

Hi @david-russo and @deanforsmith, sorry my weekend was too eventful to keep up on GH notifications and Slack this and working my way through. Sorry, this is my bad. I'd assumed that somebody starting would pick up on the PR, but neither made it explicit. It was also a pretty lousy "first issue" choice. @deanforsmith thanks for looking at this. The result is what I was after, establishing that this PR does fix the issue at least. Will review for style and merge.

carlwilson added a commit that referenced this issue May 14, 2020
FIX: Count opening parentheses in literals (#358)
carlwilson added a commit that referenced this issue Apr 6, 2022
- added regression test files for #148, #358. #375 and #531; and
- patched results files for fixes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing good-first-issue Issue suitable for inexperienced developers P2 Medium priority issues to be scheduled in a future release
Projects
None yet
5 participants