Arabic language (right to left in writing) stored (left to right) after create PDF Searchable #238

tbadran · 2016-02-25T12:16:31Z

I have tested latest release 3.05 on windows platform to OCR Arabic document to PDF (searchable) and when choose text from output PDF file it seems stored in opposite (left to right) and letters should be stored from (Right to left)!!!

i.e. original text In Arabic is
مرحبا
Stored in PDF as text as
ابحرم

roozgar · 2016-02-25T12:38:38Z

please put your sample file and the command you used for ocr job

tbadran · 2016-02-25T12:53:29Z

This is the command:

tesseract c:\temp\test_ara.jpg -l ara -psm 3 c:\temp\test_ara pdf

Files are attached (source JPG and output PDF)

test_ara.pdf

please check original word
أنحاء
output inside PDF is
ءاحنا

tbadran · 2016-02-25T13:16:55Z

Command and Samples are attached now in the previous comment

amitdo · 2016-02-26T18:22:16Z

Which program are you using to view the PDF?

amitdo · 2016-02-26T18:27:48Z

It does not look reversed wtth Chrome PDF viewer, just not very accurate...

roozgar · 2016-02-26T18:36:28Z

@amitdo
is there any way to reach a better accuracy in Arabic language until to change to new engine?
now with tesseract i get about 100% accuracy in English but for Arabic result is about 30-40%
but for example i checked google drive ocr for Arabic and i see it have 100 results for same image..

can we work on language data for a better results?

tbadran · 2016-02-26T19:08:30Z

I am using Adobe Reader.
But please note that words are not reversed while viewing the PDF because it contains the original image with text layer.
I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

tbadran · 2016-02-26T19:11:34Z

This is a serious issue with the PDF output feature using Arabic Language and similar languages that be written from right to left

amitdo · 2016-02-26T20:23:50Z

@roozgar

It seems that Ray is planning to release soon a new version of Tesseract, that will include a new OCR engine based on LSTM.

With LSTM, OCR for printed Arabic (not real handwrite) can reach 95% character accuracy.

"Offline Printed Urdu Nastaleeq Script Recognition
with Bidirectional LSTM Networks"
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.447.4577&rep=rep1&type=pdf

amitdo · 2016-02-26T21:51:51Z

I checked google drive ocr for Arabic and i see it have 100 results for same image..

Neither you or I know what programs they are using to do OCR there...

amitdo · 2016-02-26T22:35:02Z

@tbadran

But please note that words are not reversed while viewing the PDF because it contains the original image with text layer.
I mean when you copy text layer then paste it to any text editor it will be reversed, so now can't search for the text inside the PDF because it is stored revered inside the text layer!

Yes, I know...

Here is a copy of the invisible text layer (copied & pasted):

مداها ينم همهما
اللغة العريية
لغة جهد مه
مسنره هي انحاء العالم

Using Chromium (Google browser) PDF viewer under Linux.

Your original jpg image:

jbreiden · 2016-02-27T02:22:31Z

I try hard to make sure Arabic and other right-to-left languages work correctly in Tesseract PDF. As the problem is isolated further I'm happy to look, but I'm not aware of any reason things would have broken.

jbreiden · 2016-02-27T02:49:57Z

A quick check shows Chrome gives good results (as per amitdo) and Acroread gives bad results (as per tbadran). This is surprising, I thought we were good with Acroread. I wonder if this is a regression and if so when it occurred.

jbreiden · 2016-02-27T06:41:36Z

Regarding recognition accuracy, that's a better topic for the forum. But in short: Don't compare against Google Drive. Don't expect major accuracy improvements unless/until Ray is successful with his ideas. And most importantly, don't trust any predictions about 'soon'. That last one is true for all software everywhere.

amitdo · 2016-02-27T09:23:30Z

@roozgar

You can try training Tesseract using the regular engine. Use the the wiki and see #169. I really don't know how good the result will be for Arabic.

Like jbreiden said, the timeline could change...

tbadran · 2016-02-27T16:02:30Z

Please note my testing using the binaries for Windows downloaded from:
http://domasofan.spdns.eu/tesseract/
and I am Using Windows 10 with Acrobat Pro 11 to view output PDF file

tbadran · 2016-02-27T16:05:25Z

I have tested multiple different sample files not only sample uploaded above and every time getting same issue in output PDF on windows 10 + Acrobat Pro 11

tfmorris · 2016-02-29T19:31:59Z

On OS X, I'm seeing the opposite of earlier reports:

Acrobat Reader DC 15.10.20056.167417 appears correct when cutting & pasting
Google Chrome Version 48.0.2564.116 (64-bit) appears backwards

tfmorris · 2016-02-29T19:33:44Z

Adobe Acrobat:

امهمه مني اهادم
ةييرعلا ةغللا
. هم دهج ةغل
ملاعلا ءاحنا يه هرنسم

Google Chrome

مداها ينم همهما
اللغة العريية
لغة جهد مه
مسنره هي انحاء العالم

amitdo · 2016-02-29T22:01:43Z

Tom,

Look at the original jpg.
Lines 2 and 4 in Google Chrome look quite similar to lines 2 and 3 in the original jpg. First word in line 3 in the original jpg became first word in line 3 in Google Chrome.
Clearly, that's the 'good' output...

amitdo · 2016-02-29T22:49:36Z

Again, in Google Chromium.
If I mark the first two lines in the PDF + first word in line 3,
copy the (invisible) text, paste it to a text file,
mark the second to last word in line 3 in the PDF,
copy the (invisible) text, paste it to the text file, I get:

مداها ينم همهما
اللغة العريية
لغة مسنره هي انحاء العالم

jbreiden · 2016-03-01T00:16:05Z

I find it a little easier to test with Hebrew because the letters do not connect. Tesseract version 3.03 behaves the same, so this is not a regression. Will need to think about this, because it is not obvious what exactly is going wrong. Lots of PDF files do a crazy 'write it backwards' strategy but that should not be required. Tesseract writes in reading order.

jbreiden · 2016-03-09T01:22:06Z

There are two things I can think of doing. One is to give up and write Arabic
backwards (which I really hate!). The other is to put an entry in the PDF
metadata, Catalog/ViewerPreferences/Direction. Will continue thinking about
this, slowly.

amitdo · 2016-03-09T09:43:52Z

@jbreiden
I didn't understand you. In one comment you talk about Hebrew and in another one you only referring Arabic. Does Hebrew displayed correctly with Adobe Reader?

amitdo · 2016-03-09T10:03:42Z

Please make sure that any change you do is not causing any regression with Chrome PDF viewer and OS X Preview. Thanks for your work!

jbreiden · 2016-03-09T22:28:45Z

@amitdo Hebrew has the exact same problem as Arabic.

amitdo · 2016-03-10T11:23:21Z

Maybe explicitly using unicode bidi control characters can help ?

jbreiden · 2016-03-18T18:11:17Z

That's another possibility, thanks for the suggestion.

jbreiden · 2018-09-17T18:13:35Z

Correction: ** currently inactive **. I still think it is important though.

…

On Mon, Sep 17, 2018 at 11:12 AM Jeff Breidenbach ***@***.***> wrote: Status is still unsolved, and currently inactive. I swallowed my pride and experimented with writing arabic backwards in PDF like everyone else does, and it still didn't work nicely.

stweil · 2018-09-17T19:06:06Z

This sounds as if there will not be a fix in the near future. So we should not require that this bug must be fixed for 4.0.0.

MalekBadi · 2019-04-21T17:51:19Z

Is this issue resolved ??

yregaieg · 2019-05-30T13:30:47Z

@amitdo
Using the latest ABBYY FineReader 14 to create a searchable pdf:
* Both Chrome and Adobe Acrobat Reader can select/copy/paste correctly.
Conclusion:
It seems that Tesseract needs tweaking to solve this problem.

Original Image.zip
Tesseract.pdf
Abby Finereader.pdf

@jbreiden I have experimented with the files he attached, and I came to notice something that does actually make sense : It seems both files have different setup for text orientation when I start selecting from mid-sentence and drag over a few lines :
Tesseract :

ABBY :

yregaieg · 2019-05-30T13:59:07Z

Not sure if this is of any value : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Page 598 (606 / 756) has a table describing writing mode :

And it seems to be possible to specify direction of writing as RTL using this parameter on a structure element and all child elements. @jbreiden can you have a look at it ?

Section 14.8.2.3.3 could be related to this issue as well. (I see that you had a look at this section 3 years ago in here #238 (comment) so it is probably not really what is causing our issue)

ReactNativeFan · 2020-06-09T22:14:00Z

I would like to inform that the problem still persists in Tesseract 4.1.1
The Tesseract recognizes and displays Arabic text correctly. However, when export results as PDF/A, the stored text in PDF/A are reversed.

Yes, if you open the PDF in Acrobat, it will give you reversed words, and will work fine for Google Chrome PDF reader. However, when i extracted the stored text in PDF/A using pdfToText, the words are reversed too, which means the text was stored in the wrong order.

See the following example for more details:

Here is the PDF/A generated by Tesseract
Recognized_PDFA_By_Tesseract.pdf

To summarize:

True Text
مرحبا بكم جميعا
اللغة العربية

Tesseract Text 100% correct
مرحبا بكم جميعا
اللغة العربية

Tesseract PDF/A Text
اعيمج مكب ابحرم
ةيبرعلا ةغللا

As you see in the Tesseract PDF/A text, every word is reversed although the .hOCR file is correct.

Actually, the words are not reversed (you still can read every letter) but the "entire line is mirrored". Usually, we face this problem when rendering Arabic text in HTML by setting "text-align:right"

I think, the problem here is that the x-coord of each RTL letter is rendered by measuring x from left rather than right i.e., (x,y) should be (W-x, y) where W is the page width.

Mennaruuk · 2021-01-19T22:02:26Z

This issue also persists in Tesseract 5 alpha. Another issue is when I double-click to select then copy a word, my computer does copy correctly the whole word. Visibly, however, the blue selection box isn't going over the entire word. You can observe that in the screenshot below. Both of these issues occur in Adobe Reader DC and Okular.

diyajunaid · 2021-05-17T10:47:21Z

Please advise if this issue is resolved in any latest version of tesseract?

saleha-DS · 2022-02-21T05:07:51Z

I'm using Tess4j 4.5.1 and having the same issue. When I process image and create pdf, open it in Acrobat reader, it displays 100% correct. However, search doesn't work unless I reverse the letters. When I copy the text and paste it in MS Word, it display reversed.

amitdo · 2022-02-21T16:13:36Z

Unfortunately, this long standing issue was not solved.

Copy&paste or search of words in documents written in RTL scripts only works with Google Chrome's PDF reader., Even with this viewer there might be some issues.

wollmers · 2022-02-21T22:12:59Z

Unfortunately, this long standing issue was not solved.

Copy&paste or search of words in documents written in RTL scripts only works with Google Chrome's PDF reader., Even with this viewer there might be some issues.

RTL scripts are stored left to right (storage order) according to standards, but only rendered and displayed RTL.

In the above example it's stored the right way, but the github editor can not handle it correctly. But I can use a screenshot of my command line:

Also on my MacOS the program Preview displays the PDF correctly and it is searchable with copy & paste into the search field:

Thus it's a problem of PDF viewers, not of Tesseract.

amitdo · 2022-02-23T13:54:36Z

RTL scripts are stored left to right (storage order) according to standards, but only rendered and displayed RTL.

Which standards are you referring here?

Here is how it works in html:
https://www.w3.org/International/questions/qa-visual-vs-logical.

Regarding the PDF standard:

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Tesseract does not do what is described in:
14.8.2.3.3 Reverse-Order Show Strings

wollmers · 2022-02-23T17:50:58Z

RTL scripts are stored left to right (storage order) according to standards, but only rendered and displayed RTL.

Which standards are you referring here?

Unicode
https://unicode.org/reports/tr9/#Introduction

Here is how it works in html: https://www.w3.org/International/questions/qa-visual-vs-logical.

HTML just implements the Unicode standard. And all editors, command line clients I know use logical storage order.

Regarding the PDF standard:

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Tesseract does not do what is described in: 14.8.2.3.3 Reverse-Order Show Strings

That's an obscurity of the PDF specification. Yes, in the above PDF sample I didn't find the string /ReversedChars. Does this mean, Tesseract should implement this storage method?

At least Acrobat Reader should do it right for the interfaces to the "normal" world like search field. Other PDF viewers do it.

amitdo · 2022-02-23T20:31:20Z

It's not just Adobe, Firefox and Evince also have the same issue.

Years ago, there was an attempt to implement the strings reversal as described in the spec, to make Adobe and other viewers happy, but it didn't work so well.

amitdo · 2022-02-23T21:38:43Z

There is a regression with Google's Chrome. It completely fails to render the Arabic pdf above. The same failure occurs with another pdf with Hebrew.

$ chromium --version
Chromium 98.0.4758.102 snap

florisre · 2023-04-24T14:50:13Z

As of now (using tesseract 5.3.1), the issue still exists. Brave (Chromium-based) handles my Arabic-script PDF correctly, Firefox and Okular do not. Haven't tested anything else.

tbadran changed the title ~~Arabic language (right to left in writing) stored (left to write) after create PDF Searchable~~ Arabic language (right to left in writing) stored (left to right) after create PDF Searchable Feb 25, 2016

amitdo mentioned this issue Mar 26, 2016

Arabic Language output is reversed #169

Closed

amitdo mentioned this issue Feb 6, 2019

Searchable PDF text is reversed in Hebrew #2219

Closed

yregaieg mentioned this issue May 30, 2019

Tesseract creates PDF with no spaces for Arabic #2446

Closed

Mennaruuk mentioned this issue Feb 15, 2021

Arabic script is backwards and improperly aligned in output searchable PDFs ocrmypdf/OCRmyPDF#718

Closed

amitdo added the RTL label Mar 18, 2021

diyajunaid mentioned this issue May 17, 2021

Arabic OCR left to right issue madmaze/pytesseract#350

Closed

stweil mentioned this issue Jun 29, 2021

Tesseract not rendering searchable PDF correctly in Arabic #3472

Closed

stweil added this to the 5.0.0 milestone Jun 29, 2021

amitdo modified the milestones: 5.0.0, 6.0.0 Aug 16, 2021

florisre mentioned this issue Apr 24, 2023

Issue with PDFs containing Arabic script/RTL script zotero/reader#101

Open

florisre mentioned this issue Jul 27, 2023

Arabic script, embedded by tesseract-ocr, not handled correctly mozilla/pdf.js#16754

Open

stweil mentioned this issue Aug 18, 2023

OCRing images written in Hebrew with diacritics is completely not working #4119

Open

jbarlow83 mentioned this issue Oct 20, 2023

OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian ocrmypdf/OCRmyPDF#1157

Open

This comment was marked as off-topic.

Sign in to view

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable #238

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable #238

Comments

tbadran commented Feb 25, 2016

roozgar commented Feb 25, 2016

tbadran commented Feb 25, 2016

tbadran commented Feb 25, 2016

amitdo commented Feb 26, 2016

amitdo commented Feb 26, 2016

roozgar commented Feb 26, 2016

tbadran commented Feb 26, 2016

tbadran commented Feb 26, 2016

amitdo commented Feb 26, 2016

amitdo commented Feb 26, 2016

amitdo commented Feb 26, 2016

jbreiden commented Feb 27, 2016

jbreiden commented Feb 27, 2016

jbreiden commented Feb 27, 2016

amitdo commented Feb 27, 2016

tbadran commented Feb 27, 2016

tbadran commented Feb 27, 2016

tfmorris commented Feb 29, 2016

tfmorris commented Feb 29, 2016

amitdo commented Feb 29, 2016

amitdo commented Feb 29, 2016

jbreiden commented Mar 1, 2016

jbreiden commented Mar 9, 2016

amitdo commented Mar 9, 2016

amitdo commented Mar 9, 2016

jbreiden commented Mar 9, 2016

amitdo commented Mar 10, 2016

jbreiden commented Mar 18, 2016

jbreiden commented Sep 17, 2018 via email

stweil commented Sep 17, 2018

MalekBadi commented Apr 21, 2019

yregaieg commented May 30, 2019

yregaieg commented May 30, 2019 • edited Loading

ReactNativeFan commented Jun 9, 2020 • edited Loading

Mennaruuk commented Jan 19, 2021

diyajunaid commented May 17, 2021

saleha-DS commented Feb 21, 2022

amitdo commented Feb 21, 2022

wollmers commented Feb 21, 2022

amitdo commented Feb 23, 2022

wollmers commented Feb 23, 2022

amitdo commented Feb 23, 2022 • edited Loading

amitdo commented Feb 23, 2022 • edited Loading

florisre commented Apr 24, 2023

This comment was marked as off-topic.

yregaieg commented May 30, 2019 •

edited

Loading

ReactNativeFan commented Jun 9, 2020 •

edited

Loading

amitdo commented Feb 23, 2022 •

edited

Loading

amitdo commented Feb 23, 2022 •

edited

Loading