PDF: Problem parsing Unicode text strings #277

david-russo · 2017-10-01T14:06:21Z

Dev Effort

1D

Description

The Title field of this document's information dictionary contains a byte sequence which causes the PDF parser to read beyond the end of the text string and interpret the succeeding bytes as additional Unicode characters, eventually leading to a NullPointerException. The problematic byte sequence starts at offset 146AA4, and may be part of a backslash escape sequence.

Problem found in JHOVE 1.16.7, PDF-hul 1.9.

The text was updated successfully, but these errors were encountered:

samalloing · 2020-04-29T09:18:59Z

Hi @carlwilson ,

Can I take this issue next?

Sam

samalloing · 2020-05-05T08:51:35Z

Hi Carl,

I want to report back on this issue. I found a solution, but I'm not happy with it yet, because I wonder if there is a more generic solution. I didn't do a pull request yet, but left the commit on my github: samalloing@e12937d.
The problem is that after the backslash at the position indicated in the issue description there are two characters that are single byte instead of double byte. This mess up the processing of the string after that. The first character is an escaped enter. I translated it to 0 and continue the state to LITERAL_UTF16_2. This makes it work.
So in code this means in the readbackslash method I add an extra case for the 0xd character and return 0 as it is an error. On the processLiteral method where the UTF-16 is processed I set the state for invalid cases to _state = State.LITERAL_UTF16_2.
So I'm not sure if this fix is good to add as a solution for the issue.

I also corrected two other problems. One is that the PDF_HUL_149 error wasn't showing up in the error messages. What happens is the error message is added to PDF_HUL_122. So the ID from PDF_HUL_149 is removed, but the message is added to PDF_HUL_122. I don't understand why this is happening. It is also not clear what the purpose is of this. Because you don't need PDF_HUL_149 then and this is the only occurrence. The solution could be to remove PDF_HUL_149 or just to add PDF_HUL_149 as an ID. I choose to add PDF_HUL_149 as an ID. This leaves the problem that because an invalid pdf exception is thrown that we also come in to a catch part of addDestination method and the message (not the ID) is added to PDF_HUL_122 as well. So I would suggest removing the invalidPdfException and leave only the PDF_HUL_149 error. I also added RepInfo to resolveIndirectDest for this to work.

The last fix is in the output of jhove in Outlines - Item - Destination the java object is printed instead of the contents of the Destination. I added the getStringValue to dest.getIndirectDest()

jhove/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java

Line 4041 in de1f5fd

PropertyType.STRING, dest.getIndirectDest()));

. This is just a bug. So this is save to add. I'm not sure about the other fixes.

Thanks for any feedback on these fixes.

Sam

david-russo · 2020-05-05T15:32:53Z

The relevant part of the PDF (1.7) spec. seems to be section 7.3.4.2 "Literal Strings", which includes:

A conforming writer may split a literal string across multiple lines. The REVERSE SOLIDUS (5Ch) (backslash character) at the end of a line shall be used to indicate that the string continues on the following line. A conforming reader shall disregard the REVERSE SOLIDUS and the end-of-line marker following it when reading the string; the resulting string value shall be identical to that which would be read if the string were not split.
EXAMPLE 2  (These \
           two strings \
           are the same.)
           (These two strings are the same.)

Which seems to indicate that if we come across an escaped newline (5C0A or 5C0D) in a string, we should disregard it and continue on as if it hadn't been there.

The problematic newline escape sequence (5C0D) appears to have been inserted right in the middle of what should be a UTF-16-encoded "a" (0061) and part of the word "at" (0061 0074):

... 005C 0D61 0074 ...

So if we have the parser discard any escaped newlines, that should allow the recombination of the a's first byte (00) with its last byte (61) to form a valid UTF-16 character.

The key thing to recognize here is that the leading 00s don't belong to the 5C ("\") because the "\" isn't a (double-byte) UTF-16-encoded character in the string, but instead part of a PDF-specific escape syntax apparently encoded in (single-byte) ASCII. If the escape sequence were encoded in UTF-16, it would have been 005C 000D.

carlwilson · 2020-05-05T23:20:37Z

Am just completing a sweep up the notifications and it's too late to get my head into this so leaving this one for a less sleepy head tomorrow.

MartinSpeller · 2020-05-06T08:45:41Z

PDF: Problem parsing Unicode text strings #277 - Assigned to samalloing

samalloing · 2020-05-06T12:05:59Z

Thanks @david-russo , this helps a lot. I'll look into this!

@carlwilson, no problem. Take your time

@MartinSpeller , do you expect me to do something with this trello task? Because I don't have access to the Trello board.

Sam

jackdos · 2020-05-11T13:07:38Z

@david-russo I wouldn't expect that an escaped new-line character to be written in the middle of another character, which is what you seem to be suggesting here? That sounds like the writer treating the UTF-16 string as UTF-8 when re-formatting something? Is that potentially an error in formatting that the parser should be recognising? UTF-16 parsing should find 00 5c as a single backslash character, could we say that that character followed by an immediate single-byte LF or CR is actually an error?

david-russo · 2020-05-11T17:40:07Z

At first I thought it might've been an error in the writer as well, but Adobe's own Acrobat Distiller 8 & 9 seem to write titles like this (based on Producer metadata), and Adobe Reader has no trouble correctly displaying any of those I've come across, so I suspect this may not be an invalid way of encoding escape sequences in UTF-16 string literals.

Any characters may appear in a string except unbalanced parentheses and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this subclause...

Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in "Table 3 — Escape sequences in literal strings". If the character following the REVERSE SOLIDUS is not one of those shown in "Table 3 ...", the REVERSE SOLIDUS shall be ignored.

Table 3 shows the escape sequences and their meanings, eg: "\n = LINE FEED (0Ah) (LF)". From that part of the spec. and its associated table, it seems like any appearance of a 5C byte should start an escape sequence of some kind, or be ignored.

Interpreting the spec. with the assumption that these encodings were valid led me to my earlier reasoning that the escape sequence characters (and the characters they encode) are always written as 8-bits each, and should be removed from the character sequence entirely where appropriate (new lines), or replaced with their intended 8-bit character (such as an escaped left parenthesis (28h)), before rendering for display.

Of course replacing two 8-bit characters with one 8-bit character (or none) doesn't make an entire 16-bit UTF-16 character. But because all (?) valid escape sequences seem to result in characters that fit in a single byte, I'm guessing the PDF writer may automatically write the first UTF-16 byte as 00h, which would make it compatible with anything that follows in the ASCII range. This should allow the reader to just throw out all 5Cs (and any following newlines) as it comes across them, leaving the remaining 8-bit character that was escaped to complete the last byte of the UTF-16 character.

It does seem a little convoluted, but it might be because string literals can be in a number of different encodings (e.g. UTF, PDFDoc), and these escape sequences appear to apply to all types of encoding, so it might be expected that all string literals are subject to a two-part process of stripping and replacing escape sequences from what is essentially a binary stream before trying to decode that stream as characters.

Argh — too. many. words.

ghost added this to the Release v1.20 milestone Mar 9, 2018

ghost added the bug A product defect that needs fixing label Mar 9, 2018

ghost removed this from the Release v1.20 milestone Mar 9, 2018

ghost added P2 Medium priority issues to be scheduled in a future release P3 Low priority bugs and removed P2 Medium priority issues to be scheduled in a future release labels Feb 28, 2019

ghost added this to the Hack week initiation milestone Feb 28, 2019

rgfeldman added a commit to rgfeldman/jhove that referenced this issue Apr 10, 2019

openpreserve#277

0b00cac

carlwilson added the good-first-issue Issue suitable for inexperienced developers label Apr 23, 2020

carlwilson assigned samalloing Apr 29, 2020

david-russo mentioned this issue May 11, 2020

validation process with specific pdf stuck in infinite loop #473

Open

carlwilson closed this as completed in e12937d May 11, 2020

carlwilson reopened this May 12, 2020

This was referenced Jan 19, 2022

Fix issues with backslash sequence #652

Merged

Safely exit infinite loops on AProfile.outlinesOK / checkItemOutline #704

Merged

carlwilson added this to PDF Module Improvements Nov 25, 2022

karenhanson mentioned this issue Mar 20, 2023

ErrorMessage repeated many times #165

Open

carlwilson modified the milestones: Hackathon tasks , OPF Hackathon 2023 Tasks Jun 21, 2023

karenhanson mentioned this issue Aug 31, 2023

Broken XML-Result #880

Open

carlwilson removed this from the OPF Hackathon 2023 Tasks milestone Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF: Problem parsing Unicode text strings #277

PDF: Problem parsing Unicode text strings #277

david-russo commented Oct 1, 2017 •

edited by ghost

Loading

samalloing commented Apr 29, 2020

samalloing commented May 5, 2020

david-russo commented May 5, 2020 •

edited

Loading

carlwilson commented May 5, 2020

MartinSpeller commented May 6, 2020

samalloing commented May 6, 2020 •

edited

Loading

jackdos commented May 11, 2020

david-russo commented May 11, 2020

PDF: Problem parsing Unicode text strings #277

PDF: Problem parsing Unicode text strings #277

Comments

david-russo commented Oct 1, 2017 • edited by ghost Loading

Dev Effort

Description

samalloing commented Apr 29, 2020

samalloing commented May 5, 2020

david-russo commented May 5, 2020 • edited Loading

carlwilson commented May 5, 2020

MartinSpeller commented May 6, 2020

samalloing commented May 6, 2020 • edited Loading

jackdos commented May 11, 2020

david-russo commented May 11, 2020

david-russo commented Oct 1, 2017 •

edited by ghost

Loading

david-russo commented May 5, 2020 •

edited

Loading

samalloing commented May 6, 2020 •

edited

Loading