Some characters in the text layer are wrong compared to what is displayed #13260

calixteman · 2021-04-18T14:49:43Z

Attach (recommended) or Link to PDF file here:
https://web.archive.org/web/20091223123331/http://sci2s.ugr.es/keel/pdf/specific/articulo/alp99.pdf
(aka tests/pdfs/issue1936.pdf)

When I copy/paste the multiplication sign on the first line of the first page, I got a £.
And when I do the same in evince I got a × (the same in chrome).
The font descriptor is:

<< /Ascent 0 /CapHeight 749 /CharSet (/multiply) /Descent 0 /Flags 262212 /FontBBox [ -27 -940 1332 825 ] /FontFile3 59 0 R /FontName /KHPEPJ+CMBSY10 /ItalicAngle -14.035 /StemV 0 /Type /FontDescriptor >>

So it should be possible to guess the correct character in using the CharSet entry:
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=291

The text was updated successfully, but these errors were encountered:

calixteman · 2021-04-18T14:58:17Z

I greped CharSet in poppler repo and nothing, so the information is likely in the font itself.

Snuffleupagus · 2021-04-18T15:10:03Z

Please note that Adobe Reader, i.e. the reference implementation, is "wrong" as well here. (The real bug is in the incomplete /Encoding-data of the font in question.)

So it should be possible to guess the correct character in using the CharSet entry:
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=291
[...]
I greped CharSet in poppler repo and nothing, so the information is likely in the font itself.

Given that the specification mentions "The names may appear in any order." for CharSet, that one is probably not all that useful.

calixteman · 2021-04-18T15:12:37Z

I noticed the sentence "The names may appear in any order." too and I don't really understand how CharSet can be used.

calixteman · 2021-04-18T16:13:13Z

I extracted the font from the pdf and I ran:

fc-query --format='%{charset}\n' foo.ttf

and I got d7 which is the code for a multiply sign in unicode.
I dumped in the console the CFFFont for it and cff.charset.charset is [".notdef", "multiply"] so we've really the info in the font itself.

I'm not sure there is an encoding issue: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=271
it seems that it's ok to not have a BaseEncoding when there is a Differences.

Snuffleupagus · 2021-04-18T16:17:41Z

it seems that it's ok to not have a BaseEncoding when there is a Differences.

Sure, but the Differences array really ought to have had a "multiply" entry as far as I'm concerned.

Snuffleupagus · 2021-04-20T13:05:41Z

I've got a potential patch for this locally, but I still need to run all tests and think through the various edge-cases involved.

timvandermeij added the font-conversion label Apr 18, 2021

Snuffleupagus mentioned this issue Apr 20, 2021

For CFF fonts without proper ToUnicode/Encoding data, utilize the "charset"/"Encoding"-data from the font file to improve text-selection (issue 13260) #13277

Merged

brendandahl closed this as completed in #13277 Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some characters in the text layer are wrong compared to what is displayed #13260

Some characters in the text layer are wrong compared to what is displayed #13260

calixteman commented Apr 18, 2021

calixteman commented Apr 18, 2021

Snuffleupagus commented Apr 18, 2021

calixteman commented Apr 18, 2021

calixteman commented Apr 18, 2021

Snuffleupagus commented Apr 18, 2021

Snuffleupagus commented Apr 20, 2021

Some characters in the text layer are wrong compared to what is displayed #13260

Some characters in the text layer are wrong compared to what is displayed #13260

Comments

calixteman commented Apr 18, 2021

calixteman commented Apr 18, 2021

Snuffleupagus commented Apr 18, 2021

calixteman commented Apr 18, 2021

calixteman commented Apr 18, 2021

Snuffleupagus commented Apr 18, 2021

Snuffleupagus commented Apr 20, 2021