PDF: Parentheses not handled properly in document information dictionary #358

T3rj3 · 2018-09-05T08:49:14Z

Dev Effort

0.5D

Description

This PDF-file fails validation with the following error:

Jhove (Rel. 1.20.1, 2018-03-29)
Date: 2018-09-05 10:07:25 CEST
RepresentationInformation: /home/sample.pdf
ReportingModule: PDF-hul, Rel. 1.11 (2018-03-29)
LastModified: 2018-08-30 10:10:10 CEST
Size: 2736147
Format: PDF
Status: Not well-formed
SignatureMatches:
PDF-hul
ErrorMessage: Lexical error
Offset: 2731501
MIMEtype: application/pdf

This happens because the file's document information dictionary contains two entries with parentheses in them:

/CreationDate (20180606085937+01'00')
/Creator (QuarkXPress(R) 8.16r2)
/ModDate (20180606085937+01'00')
/Producer (QuarkXPress(R) 8.16r2)
/Title <feff0063006d0079006b002d003200300030>

Removing the parentheses from the Creator and Producer values (or just the closing parentheses for that matter) allows the file to be validated successfully. According to my interpretation of the PDF reference (https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf) the entries in the dictionary are just regular text strings and balanced parentheses should thus be allowed.

deanforsmith · 2020-04-30T12:36:11Z

Tokenizer.java handles special characters through a series of if-else statements within the getNext method. Line 248 handles open parenthesis but there's no correlating check for close parenthesis. Suggest adding close parenthesis to this statement to handle both characters:
else if (_ch == '(') {
would become
else if (_ch == '(' || _ch == ')') {

david-russo · 2020-04-30T15:16:13Z

Close, but I think you might be digging in the wrong place. Line 248, which checks for opening parentheses, is only executed when the _state of the tokenizer is WHITESPACE (line 200), i.e. only when the tokenizer has been parsing whitespace/no other discernible object.

If we look at line 250 we can see what happens when the tokenizer first encounters an opening parenthesis from whitespace — it actually updates the state of the tokenizer to be LITERAL, meaning that it's now in the midst of parsing a Literal String of characters. So you'll want to trace the path of what happens to a byte when the state is LITERAL instead, assuming the problem has something to do with parsing parentheses in string literals as the issue suggests.

...and I'm just now noticing that there already seems to be a pull request above to fix this issue. So for anyone who'd like to skip to the end of the exercise, the solution can be found in PR #359 :p

deanforsmith · 2020-04-30T16:58:55Z

Thanks, @david-russo - this issue is a bit above my current expertise with Java, especially regarding the parsing of special characters in literals. For what it's worth, I think it's worth investigating if the issue may be specific to the way the close-parenthesis is encoded in the sample pdf. As mentioned in the issue, the file passes when the close parenthesis is removed; I can verify that it also passes if the ')' is replaced.

david-russo · 2020-05-01T09:57:45Z

Sorry for not being clearer: it seems this issue has already been fixed in a pull request from the original author of the issue. You can see the fix they provided in #359. It hasn't been merged into the JHOVE codebase yet though, which is why you're still able to experience the issue. If you apply the changes in #359 locally, you should then see that JHOVE is able to parse the sample PDF.

And since it's already been fixed, this issue should probably be removed from the list of eligible hackathon tasks, and good first issues.

carlwilson · 2020-05-05T19:29:58Z

Hi @david-russo and @deanforsmith, sorry my weekend was too eventful to keep up on GH notifications and Slack this and working my way through. Sorry, this is my bad. I'd assumed that somebody starting would pick up on the PR, but neither made it explicit. It was also a pretty lousy "first issue" choice. @deanforsmith thanks for looking at this. The result is what I was after, establishing that this PR does fix the issue at least. Will review for style and merge.

MartinSpeller · 2020-05-06T08:40:09Z

PDF: Parentheses not handled properly in document information dictionary #358 - Assigned to deanforsmith

FIX: Count opening parentheses in literals (#358)

- added regression test files for #148, #358. #375 and #531; and - patched results files for fixes.

T3rj3 pushed a commit to T3rj3/jhove that referenced this issue Sep 5, 2018

FIX - Count opening parentheses in literals (openpreserve#358)

4d19413

T3rj3 mentioned this issue Sep 5, 2018

FIX - Count opening parentheses in literals (#358) #359

Merged

carlwilson assigned T3rj3 Dec 11, 2018

carlwilson added the bug A product defect that needs fixing label Dec 11, 2018

carlwilson added this to the v1.22-m4 Release milestone Dec 11, 2018

carlwilson added the P2 Medium priority issues to be scheduled in a future release label Dec 19, 2018

carlwilson modified the milestones: v1.22-m4 Release, Hack week tasks Apr 14, 2020

carlwilson added the good-first-issue Issue suitable for inexperienced developers label Apr 23, 2020

carlwilson assigned deanforsmith Apr 29, 2020

carlwilson mentioned this issue May 14, 2020

FIX: Count opening parentheses in literals (#358) #602

Merged

carlwilson closed this as completed in #602 May 14, 2020

carlwilson added a commit that referenced this issue May 14, 2020

Merge pull request #602 from openpreserve/T3rj3-integration

73552bb

FIX: Count opening parentheses in literals (#358)

carlwilson modified the milestones: Hackathon tasks , JHOVE 1.26 Apr 6, 2022

carlwilson added a commit that referenced this issue Apr 6, 2022

FIX: Regression tests for PDF and PNG

7d0d599

- added regression test files for #148, #358. #375 and #531; and - patched results files for fixes.

carlwilson mentioned this issue Apr 6, 2022

FIX: Regression tests for PDF and PNG #717

Merged

karenhanson mentioned this issue Apr 14, 2022

Move _rawbytes.add() to end of loop in processLiteral() so does not add close paren character #734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF: Parentheses not handled properly in document information dictionary #358

PDF: Parentheses not handled properly in document information dictionary #358

T3rj3 commented Sep 5, 2018 •

edited by carlwilson

Loading

deanforsmith commented Apr 30, 2020 •

edited by david-russo

Loading

david-russo commented Apr 30, 2020 •

edited

Loading

deanforsmith commented Apr 30, 2020

david-russo commented May 1, 2020

carlwilson commented May 5, 2020

MartinSpeller commented May 6, 2020

PDF: Parentheses not handled properly in document information dictionary #358

PDF: Parentheses not handled properly in document information dictionary #358

Comments

T3rj3 commented Sep 5, 2018 • edited by carlwilson Loading

Dev Effort

Description

deanforsmith commented Apr 30, 2020 • edited by david-russo Loading

david-russo commented Apr 30, 2020 • edited Loading

deanforsmith commented Apr 30, 2020

david-russo commented May 1, 2020

carlwilson commented May 5, 2020

MartinSpeller commented May 6, 2020

T3rj3 commented Sep 5, 2018 •

edited by carlwilson

Loading

deanforsmith commented Apr 30, 2020 •

edited by david-russo

Loading

david-russo commented Apr 30, 2020 •

edited

Loading