Enable PDFDocEncoding support for metadata #611

GreyWyvern · 2023-07-04T16:02:34Z

Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252 (WinAnsiEncoding).

For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are damaged and must be repaired.

This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8.

It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue #609

Edit: My apologies for the failing tests. I don't think I'm running PHPStan properly since I'm using the phar instead of composer. My env doesn't have direct/full access to the internet, so I can't really use composer.

Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252. For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired. This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8. It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue smalot#609: smalot#609

I hope I am not assuming too much by adding myself as the author of this file!

GreyWyvern · 2023-07-04T19:23:57Z

I just did a quick, glance-thru, but I can confirm that this change resolves issues #366 and #79

Edit: Or I should say, my previous PR resolves 79 and this one resolves 366. :)

k00ni

@GreyWyvern Thank you again for taking the time improving this library.

I have a few remarks, but nothing big.

src/Smalot/PdfParser/Document.php

src/Smalot/PdfParser/Encoding/PDFDocEncoding.php

Add comments in Document.php Use plain class PDFDocEncoding, do not extend AbstractEncoding array() => [] Break up class functions into one that returns the code table, and another that uses the table to perform the conversion

k00ni

Looks good to me.

GreyWyvern added 2 commits July 4, 2023 11:50

Update PDFDocEncoding.php

e7c30c5

I hope I am not assuming too much by adding myself as the author of this file!

k00ni added enhancement de-/encoding issue labels Jul 5, 2023

This was referenced Jul 6, 2023

Weird characters and missing text in multiline metadata text #366

Closed

Problem with multiple authors #79

Closed

This was linked to issues Jul 6, 2023

Problem with multiple authors #79

Closed

Weird characters and missing text in multiline metadata text #366

Closed

k00ni requested changes Jul 6, 2023

View reviewed changes

k00ni removed a link to an issue Jul 6, 2023

Problem with multiple authors #79

Closed

GreyWyvern and others added 3 commits July 6, 2023 09:29

PR smalot#611 suggested changes

d660b77

Add comments in Document.php Use plain class PDFDocEncoding, do not extend AbstractEncoding array() => [] Break up class functions into one that returns the code table, and another that uses the table to perform the conversion

fixed coding style issues in Document.php

a029772

fixed coding style issue in PDFDocEncoding.php

66e8d9e

k00ni approved these changes Jul 10, 2023

View reviewed changes

k00ni self-assigned this Jul 10, 2023

k00ni merged commit d03ef96 into smalot:master Jul 11, 2023
26 checks passed

k00ni mentioned this pull request Jul 11, 2023

Missing PDFDocEncoding #609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable PDFDocEncoding support for metadata #611

Enable PDFDocEncoding support for metadata #611

GreyWyvern commented Jul 4, 2023 •

edited

Loading

GreyWyvern commented Jul 4, 2023 •

edited

Loading

k00ni left a comment

k00ni left a comment

Enable PDFDocEncoding support for metadata #611

Enable PDFDocEncoding support for metadata #611

Conversation

GreyWyvern commented Jul 4, 2023 • edited Loading

GreyWyvern commented Jul 4, 2023 • edited Loading

k00ni left a comment

Choose a reason for hiding this comment

k00ni left a comment

Choose a reason for hiding this comment

GreyWyvern commented Jul 4, 2023 •

edited

Loading

GreyWyvern commented Jul 4, 2023 •

edited

Loading