Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF: Problem parsing Unicode text strings #277

Open
david-russo opened this issue Oct 1, 2017 · 8 comments
Open

PDF: Problem parsing Unicode text strings #277

david-russo opened this issue Oct 1, 2017 · 8 comments
Assignees
Labels
bug A product defect that needs fixing good-first-issue Issue suitable for inexperienced developers P3 Low priority bugs

Comments

@david-russo
Copy link
Member

david-russo commented Oct 1, 2017

Dev Effort

1D

Description

The Title field of this document's information dictionary contains a byte sequence which causes the PDF parser to read beyond the end of the text string and interpret the succeeding bytes as additional Unicode characters, eventually leading to a NullPointerException. The problematic byte sequence starts at offset 146AA4, and may be part of a backslash escape sequence.

Problem found in JHOVE 1.16.7, PDF-hul 1.9.

@ghost ghost added this to the Release v1.20 milestone Mar 9, 2018
@ghost ghost added the bug A product defect that needs fixing label Mar 9, 2018
@ghost ghost removed this from the Release v1.20 milestone Mar 9, 2018
@ghost ghost added P2 Medium priority issues to be scheduled in a future release P3 Low priority bugs and removed P2 Medium priority issues to be scheduled in a future release labels Feb 28, 2019
@ghost ghost added this to the Hack week initiation milestone Feb 28, 2019
rgfeldman added a commit to rgfeldman/jhove that referenced this issue Apr 10, 2019
@carlwilson carlwilson added the good-first-issue Issue suitable for inexperienced developers label Apr 23, 2020
@samalloing
Copy link
Collaborator

Hi @carlwilson ,

Can I take this issue next?

Sam

@samalloing
Copy link
Collaborator

Hi Carl,

I want to report back on this issue. I found a solution, but I'm not happy with it yet, because I wonder if there is a more generic solution. I didn't do a pull request yet, but left the commit on my github: samalloing@e12937d.
The problem is that after the backslash at the position indicated in the issue description there are two characters that are single byte instead of double byte. This mess up the processing of the string after that. The first character is an escaped enter. I translated it to 0 and continue the state to LITERAL_UTF16_2. This makes it work.
So in code this means in the readbackslash method I add an extra case for the 0xd character and return 0 as it is an error. On the processLiteral method where the UTF-16 is processed I set the state for invalid cases to _state = State.LITERAL_UTF16_2.
So I'm not sure if this fix is good to add as a solution for the issue.

I also corrected two other problems. One is that the PDF_HUL_149 error wasn't showing up in the error messages. What happens is the error message is added to PDF_HUL_122. So the ID from PDF_HUL_149 is removed, but the message is added to PDF_HUL_122. I don't understand why this is happening. It is also not clear what the purpose is of this. Because you don't need PDF_HUL_149 then and this is the only occurrence. The solution could be to remove PDF_HUL_149 or just to add PDF_HUL_149 as an ID. I choose to add PDF_HUL_149 as an ID. This leaves the problem that because an invalid pdf exception is thrown that we also come in to a catch part of addDestination method and the message (not the ID) is added to PDF_HUL_122 as well. So I would suggest removing the invalidPdfException and leave only the PDF_HUL_149 error. I also added RepInfo to resolveIndirectDest for this to work.

The last fix is in the output of jhove in Outlines - Item - Destination the java object is printed instead of the contents of the Destination. I added the getStringValue to dest.getIndirectDest()

. This is just a bug. So this is save to add. I'm not sure about the other fixes.

Thanks for any feedback on these fixes.

Sam

@david-russo
Copy link
Member Author

david-russo commented May 5, 2020

The relevant part of the PDF (1.7) spec. seems to be section 7.3.4.2 "Literal Strings", which includes:

A conforming writer may split a literal string across multiple lines. The REVERSE SOLIDUS (5Ch) (backslash character) at the end of a line shall be used to indicate that the string continues on the following line. A conforming reader shall disregard the REVERSE SOLIDUS and the end-of-line marker following it when reading the string; the resulting string value shall be identical to that which would be read if the string were not split.

EXAMPLE 2  (These \
           two strings \
           are the same.)
           (These two strings are the same.)

Which seems to indicate that if we come across an escaped newline (5C0A or 5C0D) in a string, we should disregard it and continue on as if it hadn't been there.

The problematic newline escape sequence (5C0D) appears to have been inserted right in the middle of what should be a UTF-16-encoded "a" (0061) and part of the word "at" (0061 0074):

... 005C 0D61 0074 ...

So if we have the parser discard any escaped newlines, that should allow the recombination of the a's first byte (00) with its last byte (61) to form a valid UTF-16 character.

The key thing to recognize here is that the leading 00s don't belong to the 5C ("\") because the "\" isn't a (double-byte) UTF-16-encoded character in the string, but instead part of a PDF-specific escape syntax apparently encoded in (single-byte) ASCII. If the escape sequence were encoded in UTF-16, it would have been 005C 000D.

@carlwilson
Copy link
Member

Am just completing a sweep up the notifications and it's too late to get my head into this so leaving this one for a less sleepy head tomorrow.

@MartinSpeller
Copy link

@samalloing
Copy link
Collaborator

samalloing commented May 6, 2020

Thanks @david-russo , this helps a lot. I'll look into this!

@carlwilson, no problem. Take your time

@MartinSpeller , do you expect me to do something with this trello task? Because I don't have access to the Trello board.

Sam

@jackdos
Copy link

jackdos commented May 11, 2020

@david-russo I wouldn't expect that an escaped new-line character to be written in the middle of another character, which is what you seem to be suggesting here? That sounds like the writer treating the UTF-16 string as UTF-8 when re-formatting something? Is that potentially an error in formatting that the parser should be recognising? UTF-16 parsing should find 00 5c as a single backslash character, could we say that that character followed by an immediate single-byte LF or CR is actually an error?

@david-russo
Copy link
Member Author

At first I thought it might've been an error in the writer as well, but Adobe's own Acrobat Distiller 8 & 9 seem to write titles like this (based on Producer metadata), and Adobe Reader has no trouble correctly displaying any of those I've come across, so I suspect this may not be an invalid way of encoding escape sequences in UTF-16 string literals.

Any characters may appear in a string except unbalanced parentheses and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this subclause...

Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in "Table 3 — Escape sequences in literal strings". If the character following the REVERSE SOLIDUS is not one of those shown in "Table 3 ...", the REVERSE SOLIDUS shall be ignored.

Table 3 shows the escape sequences and their meanings, eg: "\n = LINE FEED (0Ah) (LF)". From that part of the spec. and its associated table, it seems like any appearance of a 5C byte should start an escape sequence of some kind, or be ignored.

Interpreting the spec. with the assumption that these encodings were valid led me to my earlier reasoning that the escape sequence characters (and the characters they encode) are always written as 8-bits each, and should be removed from the character sequence entirely where appropriate (new lines), or replaced with their intended 8-bit character (such as an escaped left parenthesis (28h)), before rendering for display.

Of course replacing two 8-bit characters with one 8-bit character (or none) doesn't make an entire 16-bit UTF-16 character. But because all (?) valid escape sequences seem to result in characters that fit in a single byte, I'm guessing the PDF writer may automatically write the first UTF-16 byte as 00h, which would make it compatible with anything that follows in the ASCII range. This should allow the reader to just throw out all 5Cs (and any following newlines) as it comes across them, leaving the remaining 8-bit character that was escaped to complete the last byte of the UTF-16 character.

It does seem a little convoluted, but it might be because string literals can be in a number of different encodings (e.g. UTF, PDFDoc), and these escape sequences appear to apply to all types of encoding, so it might be expected that all string literals are subject to a two-part process of stripping and replacing escape sequences from what is essentially a binary stream before trying to decode that stream as characters.

Argh — too. many. words.

@carlwilson carlwilson reopened this May 12, 2020
@carlwilson carlwilson modified the milestones: Hackathon tasks , OPF Hackathon 2023 Tasks Jun 21, 2023
@carlwilson carlwilson removed this from the OPF Hackathon 2023 Tasks milestone Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing good-first-issue Issue suitable for inexperienced developers P3 Low priority bugs
Projects
Status: No status
Development

No branches or pull requests

5 participants