Arabic script is backwards and improperly aligned in output searchable PDFs #718

Mennaruuk · 2021-01-18T06:07:25Z

Describe the bug
Output PDF files do not properly OCR Arabic text. It is backwards. For example, a word like orange is displayed as egnaro. Also, text is improperly aligned in PDF files.

To Reproduce
What command line or API call were you trying to run?

ocrmypdf -l ara --sidecar output.txt input.png output.pdf --image-dpi 300

(I had OCRmyPDF work on an input image. I reproduced the same results with an input PDF.)

Logs

C:\Users\COMPUTER\Desktop>ocrmypdf -l ara --sidecar output.txt arabic1.png output.pdf --image-dpi 300 -v1
ocrmypdf 11.5.0
Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '--list-langs']
stdout/stderr = List of available languages (4):
ara
eng
osd
script/Arabic

Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '--version']
Found tesseract 5.0.0-alpha.20201127
Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '-l', 'ara', '--print-parameters', 'pdf']
Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '--version']
Found gs 9.53.3
pikepdf mmap disabled
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
imgformat = PNG
input dpi = 96 x 96
rotation = 0°
input colorspace = RGB
width x height = 671px x 949px
read_images() embeds a PNG
Successfully converted to PDF, processing...
pikepdf mmap disabled
Scanning contents: 100%|███████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.10page/s]
Using Tesseract OpenMP thread limit 3
pikepdf mmap disabled
    1 Rasterize with png16m, rotation 0
    1 Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r300.000000x300.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\origin.pdf']
    1 STREAM b'IHDR' 16 13
    1 STREAM b'iCCP' 41 2354
    1 iCCP profile name b'default_rgb.icc'
    1 Compression method 0
    1 STREAM b'pHYs' 2407 9
    1 STREAM b'tEXt' 2428 31
    1 STREAM b'IDAT' 2471 8192
    1 Rotating output by 0
    1 STREAM b'IHDR' 16 13
    1 STREAM b'iCCP' 41 2350
    1 iCCP profile name b'ICC Profile'
    1 Compression method 0
    1 STREAM b'pHYs' 2403 9
    1 STREAM b'IDAT' 2424 65536
    1 resolution (300, 300)
    1 Running: ['C:\\Users\\COMPUTER\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.EXE', '-l', 'ara', '-c', 'textonly_pdf=1', WindowsPath('C:/Users/COMPUTER/AppData/Local/Temp/ocrmypdf.io.x9r_gmfd/000001_ocr.png'), 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\000001_ocr_tess', 'pdf', 'txt']
    1 [tesseract] lots of diacritics - possibly poor OCR
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0
OCR: 100%|█████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:02<00:00,  2.50s/page]
C:\Users\COMPUTER\AppData\Local\Temp\ocrmypdf.io.x9r_gmfd\sidecar.txt -> output.txt
Postprocessing...
Running: ['C:\\Program Files\\gs\\gs9.53.3\\bin\\gswin64c.EXE', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\fix_docinfo.pdf', 'C:\\Users\\COMPUTER\\AppData\\Local\\Temp\\ocrmypdf.io.x9r_gmfd\\pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Treating 18 as an optimization candidate
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.14page/s]
XrefExt(xref=18, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
JPEGs: 0image [00:00, ?image/s]
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
C:\Users\COMPUTER\AppData\Local\Temp\ocrmypdf.io.x9r_gmfd\optimize.pdf -> output.pdf
Output file is a PDF/A-2B (as expected)

Example file

Expected behavior

Text should be displayed, for example, as I like OCR. Instead, it's being displayed as RCO ekil I.
Another thing is OCR isn't being properly aligned. For example, in this screenshot, I'm selecting one word. However, when I right-click it, not only is the text backwards, but it's a different word. The word matches the one right above the one I selected. You can see that the red boxes are highlighting how the OCR alignment is incorrect.

Another screenshot to showcase improper alignment: this is what happens when I select everything. You can see the blue is prominently below, not on, each word.
It's important to mention that OCRmyPDF properly performs text recognition for the output TXT files. This seems to be happening mainly in PDF files.

System

OS: Windows 10
OCRmyPDF Version: 11.5.0
Tesseract version: v5.0.0-alpha.20201127
How did you install ocrmypdf: pip

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2021-01-19T20:19:21Z

I see no difference between the output of
tesseract -l ara input.png tess_arabic pdf and your file. That is, whatever is happening here happens for all files sent to the Tesseract OCR engine. Please report the issue there.

Your issue may also be related to the program that is used to view the PDF. Some PDF viewers may not handle RTL text correctly at all.

Mennaruuk · 2021-01-19T21:40:48Z

I will post this issue to Tesseract. I opened the output PDF file in Adobe Reader DC, and while the alignment is now proper, the selection is not calibrated. For example, when I double-click to select then copy a word, my computer does copy correctly the whole word. Visibly, however, the blue selection box isn't going over the entire word. I put a red underline under the part of the word that isn't getting selected. It is typically the last few letters of every word (reading from right to left).

The same example in English would be if I were to select the word computer. The selection only highlights compute and leaves the r unselected. However, the copying of that still retains the last letter. I'm not sure if this is an issue with OCRmyPDF or Tesseract.

tl;dr: computer highlights only part of a word but thinks it's the whole word when it's copying it.

jbarlow83 · 2021-01-19T21:43:06Z

ocrmypdf copies the output of Tesseract into the PDF essentially without modification.

Mennaruuk · 2021-01-19T21:45:41Z

Okay, I see, looks like two issues with Tesseract. I'll get in touch with them about these. Maybe it's the alpha version that isn't doing well, although when I downloaded Tesseract, one of the mirrors said, "We don't provide an installer for Tesseract 4.1.0 because we think that the latest version 5.0.0-alpha is better for most Windows users in many aspects (functionality, speed, stability)." I'll test Tesseract 4 and see if the issue can be reproduced there. Thank you for your assistance!

Mennaruuk · 2021-01-19T21:57:01Z

I tested this on Tesseract 4.1.0 and I was able to reproduce both issues. I'll create an issue at Tesseract's GitHub page, hope it could be looked into.

jbarlow83 · 2021-02-14T09:47:57Z

@Mennaruuk Would you mind linking to the Tesseract here?

Mennaruuk · 2021-02-15T05:21:28Z

@Mennaruuk Would you mind linking to the Tesseract here?

https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.1.0-elag2019.exe

I opened the file with Sumatra PDF, and the text direction was proper, so I was able to copy and paste just fine. So this seems to be largely a problem with the PDF readers (such as Adobe Acrobat Reader DC, Firefox, and Okular). Tesseract has had an issue about this since 2016:
tesseract-ocr/tesseract#238

rehamashrafshouman · 2021-02-21T13:50:21Z

Is the mirroring issue solved, please ?

Mennaruuk · 2021-02-21T14:37:22Z

Is the mirroring issue solved, please ?

I’m positive the problem isn’t with OCRmyPDF. It’s with whatever PDF reader you have. Some of them just suck at reading Arabic OCR. Try opening output PDF with Sumatra PDF if you have Windows and see if the issue is resolved.

rehamashrafshouman · 2021-02-22T10:26:27Z

Is the mirroring issue solved, please ?

I’m positive the problem isn’t with OCRmyPDF. It’s with whatever PDF reader you have. Some of them just suck at reading Arabic OCR. Try opening output PDF with Sumatra PDF if you have Windows and see if the issue is resolved.

I downloaded it, the issue is the same unfortunately.

Mennaruuk · 2021-03-20T05:00:21Z

Is the mirroring issue solved, please ?

I’m positive the problem isn’t with OCRmyPDF. It’s with whatever PDF reader you have. Some of them just suck at reading Arabic OCR. Try opening output PDF with Sumatra PDF if you have Windows and see if the issue is resolved.

I downloaded it, the issue is the same unfortunately.

Sorry for the late response. Would you mind if you may share a sample PDF? Also what OS are you running?

rehamashrafshouman · 2021-03-21T07:34:23Z

@Mennaruuk الحمد لله
The issue has been solved, it was a decoding and encoding problem,thank you for the follow-up.

0lm · 2021-06-22T22:47:21Z

@Mennaruuk الحمد لله
The issue has been solved, it was a decoding and encoding problem,thank you for the follow-up.

Hello

I have the same issue and I have the latest Tesseract 5.0 beta installed (20210506). But this issue still exists for me. In which version was this resolved?

jbarlow83 · 2021-06-22T22:50:02Z

Please add a sample file and note what PDF viewer you are using.

0lm · 2021-06-22T23:03:09Z

Please add a sample file and note what PDF viewer you are using.

Unfortunately I dont have a sample file right now. But I can tell you what I did:
first of all, I used scantailor-universal to prepare images. (to make them black&white for better reading. Output file format was .tif). then i used img2pdf (the following command in cmd in the location folder where the images are saved: imf2pdf *tif --output my.pdf). after that, i used ocrmypdf (command: ocrmypdf -l deu+ara my.pdf output.pdf)
(img2pdf and ocrmypdf are installed via python 3.9.5 on windows 10 64bit)

tesseract is normally installed via the exe file (version: tesseract-ocr-w64-setup-v5.0.0-alpha.20210506.exe) i also made it download all languages files and language scripts. i basically kept everything default.

the pdf viewers i tested are okular and adobe reader. both display the same issue: reversed text search (and also reversed text copy) for arabic text.

EDIT: as an example file, you can also use the one which was already posted in here. i tested it, same issue with the png file already posted here ( needed to do convert s.png -background white -alpha remove -alpha off output.png |or else img2pdf wouldnt turn it into a pdf)

jbarlow83 · 2021-06-22T23:09:16Z

May try ocrmypdf -l ara --sidecar output.txt my.pdf output.pdf ? Then look at output.txt in a RTL capable text editor.
There's not much I can do without a test file.

0lm · 2021-06-22T23:13:24Z

May try ocrmypdf -l ara --sidecar output.txt my.pdf output.pdf ? Then look at output.txt in a RTL capable text editor.
There's not much I can do without a test file.

I just took the image from this issue and turned it into a pdf. I also tested your methoed and the output.txt was properly without errors. and when i copy from output.txt it copies them the right way.

but when i copy from the output.pdf, it copies all teyt reversed. im a bit confused, what oculd be the reason? since it seem to work in the output.txt

my.pdf
output.pdf
output.txt

EDIT: just tested sumatra gain (i thought i already tested it) and in sumatra it works properly. i can properly copy the text without the reverse issue. i wonder why it doesnt work in okular and adobe read, because when i open original arabic files and copy them, it works. but it doesnt work for my ocr files.

EDIT 2: I mean, is this a general issue with adobe reader and okular? Or do i need to change their settings, so they recognize arabic letters better?

EDIT 3: and also thank you very much for taking time for me and helping me out. So, the conclusion is, that ocrmypdf, tesseract and img2pdf seem to work properly. but there is an issue with my pdf viewers giving me reversed copied-text, while sumatrapdf seem to be able to handle it.
by any chance, do you have an idea how to fix this on okular? I especially like to use okular as my main pdf viewer, because of the easy annotation tools. (while sumatra also has them, they re more or less hidden since the UI is kept very minimal in sumatra)

jbarlow83 · 2021-06-23T07:42:50Z

I don't believe this is any different from the known and unresolved issues in tesseract-ocr/tesseract#238

It seems that issue, in turn, is contending with the fact that many PDF generators don't generate RTL properly, many PDF viewers don't handle it properly, and your operating system's clipboard may not handle it properly.

Until it is solved in tesseract, there is not much I can do.

0lm · 2021-06-23T11:46:46Z

I see. Thank you.
So it was just luck, tha sumatra has some kind of feature that handles it properly, even if the arabic text was mirrored by tesseract.

jbarlow83 added the third party issue Problem with a third party dependency label Jan 19, 2021

Mennaruuk closed this as completed Feb 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic script is backwards and improperly aligned in output searchable PDFs #718

Arabic script is backwards and improperly aligned in output searchable PDFs #718

Mennaruuk commented Jan 18, 2021

jbarlow83 commented Jan 19, 2021

Mennaruuk commented Jan 19, 2021

jbarlow83 commented Jan 19, 2021

Mennaruuk commented Jan 19, 2021

Mennaruuk commented Jan 19, 2021

jbarlow83 commented Feb 14, 2021

Mennaruuk commented Feb 15, 2021 •

edited

Loading

rehamashrafshouman commented Feb 21, 2021

Mennaruuk commented Feb 21, 2021

rehamashrafshouman commented Feb 22, 2021

Mennaruuk commented Mar 20, 2021

rehamashrafshouman commented Mar 21, 2021

0lm commented Jun 22, 2021 •

edited

Loading

jbarlow83 commented Jun 22, 2021 •

edited

Loading

0lm commented Jun 22, 2021 •

edited

Loading

jbarlow83 commented Jun 22, 2021

0lm commented Jun 22, 2021 •

edited

Loading

jbarlow83 commented Jun 23, 2021

0lm commented Jun 23, 2021 •

edited

Loading

Arabic script is backwards and improperly aligned in output searchable PDFs #718

Arabic script is backwards and improperly aligned in output searchable PDFs #718

Comments

Mennaruuk commented Jan 18, 2021

jbarlow83 commented Jan 19, 2021

Mennaruuk commented Jan 19, 2021

jbarlow83 commented Jan 19, 2021

Mennaruuk commented Jan 19, 2021

Mennaruuk commented Jan 19, 2021

jbarlow83 commented Feb 14, 2021

Mennaruuk commented Feb 15, 2021 • edited Loading

rehamashrafshouman commented Feb 21, 2021

Mennaruuk commented Feb 21, 2021

rehamashrafshouman commented Feb 22, 2021

Mennaruuk commented Mar 20, 2021

rehamashrafshouman commented Mar 21, 2021

0lm commented Jun 22, 2021 • edited Loading

jbarlow83 commented Jun 22, 2021 • edited Loading

0lm commented Jun 22, 2021 • edited Loading

jbarlow83 commented Jun 22, 2021

0lm commented Jun 22, 2021 • edited Loading

jbarlow83 commented Jun 23, 2021

0lm commented Jun 23, 2021 • edited Loading

Mennaruuk commented Feb 15, 2021 •

edited

Loading

0lm commented Jun 22, 2021 •

edited

Loading

jbarlow83 commented Jun 22, 2021 •

edited

Loading

0lm commented Jun 22, 2021 •

edited

Loading

0lm commented Jun 22, 2021 •

edited

Loading

0lm commented Jun 23, 2021 •

edited

Loading