-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ligature issue when converting PDF to text #1351
Comments
I did a quick analysis on the first page.
the following codes are transcoded and added (ut of:
when using sumatrapdf and pdfminer.six, I'm getting the same results with '\x00'. The only tool which seems to report properly (using copy-paste) is Acrobat Reader but I don't know where it is getting the results. Help to analysis this case would be welcomed (@MartinThoma can you set the labels in accordance) |
Also of note - this tool seems to be able to convert the PDF successfully without using any sort of OCR. |
I resolved it like this, 'ff' case not work like other, that's why I replace it by page.extract_text().translate(str.maketrans({chr(0): 'ff', 0xFB01: 'fi', 0xFB02: 'fl', 0xFB03: 'ffi', 0xFB04: 'ffl'})) |
The above method seems to replace every ligature with 'ff'. I also noticed my original PDF does not load so here it is again. |
This is a fix for the problem that occurred when #2882 was changed. The string length of characters was checked after conversion by cmap, but after cmap conversion, there is a pattern where the string length is more than one character, and it cannot be measured accurately. This is necessary, for example, when considering whether to measure the distance from the ligature or the base character corresponding to the ligature in fixing #1351. The change in handle_tj is because it cannot pass Ruff's check. Error: PLR0915 Too many statements (nnn > 176) The following code is only used to get the character code for a space. However, I think it would be better to split the code into parts for obtaining the character code. Style changes are considered in another PR.
I am having a ligature issue with this PDF.
'fi', 'fl' and 'ff' characters are returning NULL
#598 is similar to this issue.
MVCE: Code + PDF
PDF
The text was updated successfully, but these errors were encountered: