-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF: Problem parsing Unicode text strings #277
Comments
Hi @carlwilson , Can I take this issue next? Sam |
Hi Carl, I want to report back on this issue. I found a solution, but I'm not happy with it yet, because I wonder if there is a more generic solution. I didn't do a pull request yet, but left the commit on my github: samalloing@e12937d. I also corrected two other problems. One is that the PDF_HUL_149 error wasn't showing up in the error messages. What happens is the error message is added to PDF_HUL_122. So the ID from PDF_HUL_149 is removed, but the message is added to PDF_HUL_122. I don't understand why this is happening. It is also not clear what the purpose is of this. Because you don't need PDF_HUL_149 then and this is the only occurrence. The solution could be to remove PDF_HUL_149 or just to add PDF_HUL_149 as an ID. I choose to add PDF_HUL_149 as an ID. This leaves the problem that because an invalid pdf exception is thrown that we also come in to a catch part of addDestination method and the message (not the ID) is added to PDF_HUL_122 as well. So I would suggest removing the invalidPdfException and leave only the PDF_HUL_149 error. I also added RepInfo to resolveIndirectDest for this to work. The last fix is in the output of jhove in Outlines - Item - Destination the java object is printed instead of the contents of the Destination. I added the getStringValue to dest.getIndirectDest() jhove/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java Line 4041 in de1f5fd
Thanks for any feedback on these fixes. Sam |
The relevant part of the PDF (1.7) spec. seems to be section 7.3.4.2 "Literal Strings", which includes:
Which seems to indicate that if we come across an escaped newline ( The problematic newline escape sequence (
So if we have the parser discard any escaped newlines, that should allow the recombination of the a's first byte ( The key thing to recognize here is that the leading |
Am just completing a sweep up the notifications and it's too late to get my head into this so leaving this one for a less sleepy head tomorrow. |
Thanks @david-russo , this helps a lot. I'll look into this! @carlwilson, no problem. Take your time @MartinSpeller , do you expect me to do something with this trello task? Because I don't have access to the Trello board. Sam |
@david-russo I wouldn't expect that an escaped new-line character to be written in the middle of another character, which is what you seem to be suggesting here? That sounds like the writer treating the UTF-16 string as UTF-8 when re-formatting something? Is that potentially an error in formatting that the parser should be recognising? UTF-16 parsing should find 00 5c as a single backslash character, could we say that that character followed by an immediate single-byte LF or CR is actually an error? |
At first I thought it might've been an error in the writer as well, but Adobe's own Acrobat Distiller 8 & 9 seem to write titles like this (based on Producer metadata), and Adobe Reader has no trouble correctly displaying any of those I've come across, so I suspect this may not be an invalid way of encoding escape sequences in UTF-16 string literals.
Table 3 shows the escape sequences and their meanings, eg: "\n = LINE FEED (0Ah) (LF)". From that part of the spec. and its associated table, it seems like any appearance of a 5C byte should start an escape sequence of some kind, or be ignored. Interpreting the spec. with the assumption that these encodings were valid led me to my earlier reasoning that the escape sequence characters (and the characters they encode) are always written as 8-bits each, and should be removed from the character sequence entirely where appropriate (new lines), or replaced with their intended 8-bit character (such as an escaped left parenthesis (28h)), before rendering for display. Of course replacing two 8-bit characters with one 8-bit character (or none) doesn't make an entire 16-bit UTF-16 character. But because all (?) valid escape sequences seem to result in characters that fit in a single byte, I'm guessing the PDF writer may automatically write the first UTF-16 byte as 00h, which would make it compatible with anything that follows in the ASCII range. This should allow the reader to just throw out all 5Cs (and any following newlines) as it comes across them, leaving the remaining 8-bit character that was escaped to complete the last byte of the UTF-16 character. It does seem a little convoluted, but it might be because string literals can be in a number of different encodings (e.g. UTF, PDFDoc), and these escape sequences appear to apply to all types of encoding, so it might be expected that all string literals are subject to a two-part process of stripping and replacing escape sequences from what is essentially a binary stream before trying to decode that stream as characters. Argh — too. many. words. |
Dev Effort
1D
Description
The
Title
field of this document's information dictionary contains a byte sequence which causes the PDF parser to read beyond the end of the text string and interpret the succeeding bytes as additional Unicode characters, eventually leading to aNullPointerException
. The problematic byte sequence starts at offset 146AA4, and may be part of a backslash escape sequence.Problem found in JHOVE 1.16.7, PDF-hul 1.9.
The text was updated successfully, but these errors were encountered: