Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable PDFDocEncoding support for metadata #611

Merged
merged 5 commits into from
Jul 11, 2023
Merged

Conversation

GreyWyvern
Copy link
Contributor

@GreyWyvern GreyWyvern commented Jul 4, 2023

Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252 (WinAnsiEncoding).

For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are damaged and must be repaired.

This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8.

It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue #609

Edit: My apologies for the failing tests. I don't think I'm running PHPStan properly since I'm using the phar instead of composer. My env doesn't have direct/full access to the internet, so I can't really use composer.

Regular PDF metadata (outside of XMP), depending on the characters it includes, can be encoded in UTF-8 escaped (or binary) bytes, or using a proprietary Adobe encoding PDFDocEncoding which is similar to, but not exactly like CP1252.

For more information on the PDFDocEncoding character set, see: https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

Another issue is that regardless of the storage encoding used, Adobe Acrobat will attempt to add a slash-linefeed (\r) to metadata text to avoid long line-lengths (~127 bytes) in the saved PDF data. Unfortunately, the method to do this does not seem binary-safe, resulting in UTF-8 saved bytes that are destroyed and must be repaired.

This commit enables decoding PDF metadata using PDFDocEncoding, and also repairs added line-feeds in both PDFDocEncoding and UTF-8.

It also adds a sample file "Issue609.pdf" containing both UTF-8 and PDFDocEncoding encoded metadata fields for testing. The name of the file references PDFParser issue smalot#609: smalot#609
I hope I am not assuming too much by adding myself as the author of this file!
@GreyWyvern
Copy link
Contributor Author

GreyWyvern commented Jul 4, 2023

I just did a quick, glance-thru, but I can confirm that this change resolves issues #366 and #79

Edit: Or I should say, my previous PR resolves 79 and this one resolves 366. :)

Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GreyWyvern Thank you again for taking the time improving this library.

I have a few remarks, but nothing big.

src/Smalot/PdfParser/Document.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Encoding/PDFDocEncoding.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Encoding/PDFDocEncoding.php Outdated Show resolved Hide resolved
src/Smalot/PdfParser/Encoding/PDFDocEncoding.php Outdated Show resolved Hide resolved
@k00ni k00ni removed a link to an issue Jul 6, 2023
GreyWyvern and others added 3 commits July 6, 2023 09:29
Add comments in Document.php
Use plain class PDFDocEncoding, do not extend AbstractEncoding
array() => []
Break up class functions into one that returns the code table, and another that uses the table to perform the conversion
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@k00ni k00ni self-assigned this Jul 10, 2023
@k00ni k00ni merged commit d03ef96 into smalot:master Jul 11, 2023
26 checks passed
@k00ni k00ni mentioned this pull request Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weird characters and missing text in multiline metadata text
2 participants