Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic script is backwards and improperly aligned in output searchable PDFs #718

Closed
Mennaruuk opened this issue Jan 18, 2021 · 19 comments
Closed
Labels
third party issue Problem with a third party dependency

Comments

@Mennaruuk
Copy link

Describe the bug
Output PDF files do not properly OCR Arabic text. It is backwards. For example, a word like orange is displayed as egnaro. Also, text is improperly aligned in PDF files.

To Reproduce
What command line or API call were you trying to run?

ocrmypdf -l ara --sidecar output.txt input.png output.pdf --image-dpi 300

(I had OCRmyPDF work on an input image. I reproduced the same results with an input PDF.)

Logs

C:\Users\COMPUTER\Desktop>ocrmypdf -l ara --sidecar output.txt arabic1.png output.pdf --image-dpi 300 -v1
ocrmypdf 11.5.0
Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '--list-langs']
stdout/stderr = List of available languages (4):
ara
eng
osd
script/Arabic

Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '--version']
Found tesseract 5.0.0-alpha.20201127
Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '-l', 'ara', '--print-parameters', 'pdf']
Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '--version']
Found gs 9.53.3
pikepdf mmap disabled
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
imgformat = PNG
input dpi = 96 x 96
rotation = 0°
input colorspace = RGB
width x height = 671px x 949px
read_images() embeds a PNG
Successfully converted to PDF, processing...
pikepdf mmap disabled
Scanning contents: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.10page/s]
Using Tesseract OpenMP thread limit 3
pikepdf mmap disabled
    1 Rasterize with png16m, rotation 0
    1 Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r300.000000x300.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\origin.pdf']
    1 STREAM b'IHDR' 16 13
    1 STREAM b'iCCP' 41 2354
    1 iCCP profile name b'default_rgb.icc'
    1 Compression method 0
    1 STREAM b'pHYs' 2407 9
    1 STREAM b'tEXt' 2428 31
    1 STREAM b'IDAT' 2471 8192
    1 Rotating output by 0
    1 STREAM b'IHDR' 16 13
    1 STREAM b'iCCP' 41 2350
    1 iCCP profile name b'ICC Profile'
    1 Compression method 0
    1 STREAM b'pHYs' 2403 9
    1 STREAM b'IDAT' 2424 65536
    1 resolution (300, 300)
    1 Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '-l', 'ara', '-c', 'textonly_pdf=1', WindowsPath('C:/Users/COMPUTER/AppData/Local/Temp/ocrmypdf.io.x9r_gmfd/000001_ocr.png'), 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\000001_ocr_tess', 'pdf', 'txt']
    1 [tesseract] lots of diacritics - possibly poor OCR
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
OCR: 100%|█████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:02<00:00,  2.50s/page]
C:\Users\COMPUTER\AppData\Local\Temp\ocrmypdf.io.x9r_gmfd\sidecar.txt -> output.txt
Postprocessing...
Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\fix_docinfo.pdf', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Treating 18 as an optimization candidate
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.14page/s]
XrefExt(xref=18, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
C:\Users\COMPUTER\AppData\Local\Temp\ocrmypdf.io.x9r_gmfd\optimize.pdf -> output.pdf
Output file is a PDF/A-2B (as expected)

Example file
image

Expected behavior

  • Text should be displayed, for example, as I like OCR. Instead, it's being displayed as RCO ekil I.

  • Another thing is OCR isn't being properly aligned. For example, in this screenshot, I'm selecting one word. However, when I right-click it, not only is the text backwards, but it's a different word. The word matches the one right above the one I selected. You can see that the red boxes are highlighting how the OCR alignment is incorrect.
    image
    Another screenshot to showcase improper alignment: this is what happens when I select everything. You can see the blue is prominently below, not on, each word.
    image

  • It's important to mention that OCRmyPDF properly performs text recognition for the output TXT files. This seems to be happening mainly in PDF files.

System

  • OS: Windows 10
  • OCRmyPDF Version: 11.5.0
  • Tesseract version: v5.0.0-alpha.20201127
  • How did you install ocrmypdf: pip
@jbarlow83
Copy link
Collaborator

I see no difference between the output of
tesseract -l ara input.png tess_arabic pdf and your file. That is, whatever is happening here happens for all files sent to the Tesseract OCR engine. Please report the issue there.

Your issue may also be related to the program that is used to view the PDF. Some PDF viewers may not handle RTL text correctly at all.

@Mennaruuk
Copy link
Author

I will post this issue to Tesseract. I opened the output PDF file in Adobe Reader DC, and while the alignment is now proper, the selection is not calibrated. For example, when I double-click to select then copy a word, my computer does copy correctly the whole word. Visibly, however, the blue selection box isn't going over the entire word. I put a red underline under the part of the word that isn't getting selected. It is typically the last few letters of every word (reading from right to left).
image

The same example in English would be if I were to select the word computer. The selection only highlights compute and leaves the r unselected. However, the copying of that still retains the last letter. I'm not sure if this is an issue with OCRmyPDF or Tesseract.

image

tl;dr: computer highlights only part of a word but thinks it's the whole word when it's copying it.

@jbarlow83
Copy link
Collaborator

ocrmypdf copies the output of Tesseract into the PDF essentially without modification.

@jbarlow83 jbarlow83 added the third party issue Problem with a third party dependency label Jan 19, 2021
@Mennaruuk
Copy link
Author

Okay, I see, looks like two issues with Tesseract. I'll get in touch with them about these. Maybe it's the alpha version that isn't doing well, although when I downloaded Tesseract, one of the mirrors said, "We don't provide an installer for Tesseract 4.1.0 because we think that the latest version 5.0.0-alpha is better for most Windows users in many aspects (functionality, speed, stability)." I'll test Tesseract 4 and see if the issue can be reproduced there. Thank you for your assistance!

@Mennaruuk
Copy link
Author

I tested this on Tesseract 4.1.0 and I was able to reproduce both issues. I'll create an issue at Tesseract's GitHub page, hope it could be looked into.

@jbarlow83
Copy link
Collaborator

@Mennaruuk Would you mind linking to the Tesseract here?

@Mennaruuk
Copy link
Author

Mennaruuk commented Feb 15, 2021

@Mennaruuk Would you mind linking to the Tesseract here?

https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.1.0-elag2019.exe

I opened the file with Sumatra PDF, and the text direction was proper, so I was able to copy and paste just fine. So this seems to be largely a problem with the PDF readers (such as Adobe Acrobat Reader DC, Firefox, and Okular). Tesseract has had an issue about this since 2016:
tesseract-ocr/tesseract#238

@rehamashrafshouman
Copy link

Is the mirroring issue solved, please ?

@Mennaruuk
Copy link
Author

Is the mirroring issue solved, please ?

I’m positive the problem isn’t with OCRmyPDF. It’s with whatever PDF reader you have. Some of them just suck at reading Arabic OCR. Try opening output PDF with Sumatra PDF if you have Windows and see if the issue is resolved.

@rehamashrafshouman
Copy link

Is the mirroring issue solved, please ?

I’m positive the problem isn’t with OCRmyPDF. It’s with whatever PDF reader you have. Some of them just suck at reading Arabic OCR. Try opening output PDF with Sumatra PDF if you have Windows and see if the issue is resolved.

I downloaded it, the issue is the same unfortunately.

@Mennaruuk
Copy link
Author

Is the mirroring issue solved, please ?

I’m positive the problem isn’t with OCRmyPDF. It’s with whatever PDF reader you have. Some of them just suck at reading Arabic OCR. Try opening output PDF with Sumatra PDF if you have Windows and see if the issue is resolved.

I downloaded it, the issue is the same unfortunately.

Sorry for the late response. Would you mind if you may share a sample PDF? Also what OS are you running?

@rehamashrafshouman
Copy link

@Mennaruuk الحمد لله
The issue has been solved, it was a decoding and encoding problem,thank you for the follow-up.

@0lm
Copy link

0lm commented Jun 22, 2021

@Mennaruuk الحمد لله
The issue has been solved, it was a decoding and encoding problem,thank you for the follow-up.

Hello

I have the same issue and I have the latest Tesseract 5.0 beta installed (20210506). But this issue still exists for me. In which version was this resolved?

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Jun 22, 2021

Please add a sample file and note what PDF viewer you are using.

@0lm
Copy link

0lm commented Jun 22, 2021

Please add a sample file and note what PDF viewer you are using.

Unfortunately I dont have a sample file right now. But I can tell you what I did:
first of all, I used scantailor-universal to prepare images. (to make them black&white for better reading. Output file format was .tif). then i used img2pdf (the following command in cmd in the location folder where the images are saved: imf2pdf *tif --output my.pdf). after that, i used ocrmypdf (command: ocrmypdf -l deu+ara my.pdf output.pdf)
(img2pdf and ocrmypdf are installed via python 3.9.5 on windows 10 64bit)

tesseract is normally installed via the exe file (version: tesseract-ocr-w64-setup-v5.0.0-alpha.20210506.exe) i also made it download all languages files and language scripts. i basically kept everything default.

the pdf viewers i tested are okular and adobe reader. both display the same issue: reversed text search (and also reversed text copy) for arabic text.

EDIT: as an example file, you can also use the one which was already posted in here. i tested it, same issue with the png file already posted here ( needed to do convert s.png -background white -alpha remove -alpha off output.png |or else img2pdf wouldnt turn it into a pdf)

@jbarlow83
Copy link
Collaborator

May try ocrmypdf -l ara --sidecar output.txt my.pdf output.pdf ? Then look at output.txt in a RTL capable text editor.
There's not much I can do without a test file.

@0lm
Copy link

0lm commented Jun 22, 2021

May try ocrmypdf -l ara --sidecar output.txt my.pdf output.pdf ? Then look at output.txt in a RTL capable text editor.
There's not much I can do without a test file.

I just took the image from this issue and turned it into a pdf. I also tested your methoed and the output.txt was properly without errors. and when i copy from output.txt it copies them the right way.

but when i copy from the output.pdf, it copies all teyt reversed. im a bit confused, what oculd be the reason? since it seem to work in the output.txt

my.pdf
output.pdf
output.txt

EDIT: just tested sumatra gain (i thought i already tested it) and in sumatra it works properly. i can properly copy the text without the reverse issue. i wonder why it doesnt work in okular and adobe read, because when i open original arabic files and copy them, it works. but it doesnt work for my ocr files.

EDIT 2: I mean, is this a general issue with adobe reader and okular? Or do i need to change their settings, so they recognize arabic letters better?

EDIT 3: and also thank you very much for taking time for me and helping me out. So, the conclusion is, that ocrmypdf, tesseract and img2pdf seem to work properly. but there is an issue with my pdf viewers giving me reversed copied-text, while sumatrapdf seem to be able to handle it.
by any chance, do you have an idea how to fix this on okular? I especially like to use okular as my main pdf viewer, because of the easy annotation tools. (while sumatra also has them, they re more or less hidden since the UI is kept very minimal in sumatra)

@jbarlow83
Copy link
Collaborator

I don't believe this is any different from the known and unresolved issues in tesseract-ocr/tesseract#238

It seems that issue, in turn, is contending with the fact that many PDF generators don't generate RTL properly, many PDF viewers don't handle it properly, and your operating system's clipboard may not handle it properly.

Until it is solved in tesseract, there is not much I can do.

@0lm
Copy link

0lm commented Jun 23, 2021

I see. Thank you.
So it was just luck, tha sumatra has some kind of feature that handles it properly, even if the arabic text was mirrored by tesseract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third party issue Problem with a third party dependency
Projects
None yet
Development

No branches or pull requests

4 participants