Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some characters in the text layer are wrong compared to what is displayed #13260

Closed
calixteman opened this issue Apr 18, 2021 · 6 comments · Fixed by #13277
Closed

Some characters in the text layer are wrong compared to what is displayed #13260

calixteman opened this issue Apr 18, 2021 · 6 comments · Fixed by #13277

Comments

@calixteman
Copy link
Contributor

Attach (recommended) or Link to PDF file here:
https://web.archive.org/web/20091223123331/http://sci2s.ugr.es/keel/pdf/specific/articulo/alp99.pdf
(aka tests/pdfs/issue1936.pdf)

When I copy/paste the multiplication sign on the first line of the first page, I got a £.
And when I do the same in evince I got a × (the same in chrome).
The font descriptor is:

<< /Ascent 0 /CapHeight 749 /CharSet (/multiply) /Descent 0 /Flags 262212 /FontBBox [ -27 -940 1332 825 ] /FontFile3 59 0 R /FontName /KHPEPJ+CMBSY10 /ItalicAngle -14.035 /StemV 0 /Type /FontDescriptor >>

So it should be possible to guess the correct character in using the CharSet entry:
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=291

@calixteman
Copy link
Contributor Author

I greped CharSet in poppler repo and nothing, so the information is likely in the font itself.

@Snuffleupagus
Copy link
Collaborator

Please note that Adobe Reader, i.e. the reference implementation, is "wrong" as well here. (The real bug is in the incomplete /Encoding-data of the font in question.)

So it should be possible to guess the correct character in using the CharSet entry:
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=291
[...]
I greped CharSet in poppler repo and nothing, so the information is likely in the font itself.

Given that the specification mentions "The names may appear in any order." for CharSet, that one is probably not all that useful.

@calixteman
Copy link
Contributor Author

I noticed the sentence "The names may appear in any order." too and I don't really understand how CharSet can be used.

@calixteman
Copy link
Contributor Author

I extracted the font from the pdf and I ran:

fc-query --format='%{charset}\n' foo.ttf

and I got d7 which is the code for a multiply sign in unicode.
I dumped in the console the CFFFont for it and cff.charset.charset is [".notdef", "multiply"] so we've really the info in the font itself.

I'm not sure there is an encoding issue: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=271
it seems that it's ok to not have a BaseEncoding when there is a Differences.

@Snuffleupagus
Copy link
Collaborator

it seems that it's ok to not have a BaseEncoding when there is a Differences.

Sure, but the Differences array really ought to have had a "multiply" entry as far as I'm concerned.

@Snuffleupagus
Copy link
Collaborator

I've got a potential patch for this locally, but I still need to run all tests and think through the various edge-cases involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants