Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian #1157

Open
PSEUDO-SAPPHO opened this issue Sep 24, 2023 · 9 comments
Assignees
Labels
third party issue Problem with a third party dependency

Comments

@PSEUDO-SAPPHO
Copy link

PSEUDO-SAPPHO commented Sep 24, 2023

What were you trying to do?

I have used ocrmypdf to perform OCR on a PDF document, but I'm encountering a specific issue with RTL (right-to-left) languages like Persian. Despite successful OCR processing, the text in the resulting PDF is not selectable or searchable within PDF readers like foxit reader or other popular PDF viewers.

I tested Foxit Reader and OCR-generated text was not rtl, However, when using Zotero's PDF reader, I observed that words are separated. It's worth noting that I tested this PDF on chrome and edge and i didn't encounter the issues, ocr works and text output is available with "ocrmypdf".

Where are you installing from?

Wndows package manager (chocolatey, etc.)

What operating system are you working on?

Windows

Relevant log output

No response

@PSEUDO-SAPPHO PSEUDO-SAPPHO changed the title OCR-Generated Text Layers Not Readable by PDF Readers OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian Sep 24, 2023
@jbarlow83
Copy link
Collaborator

Please provide an example file, the command you're using, and the versions you're using.

@medmedin2014
Copy link

medmedin2014 commented Oct 14, 2023

@jbarlow83 @PSEUDO-SAPPHO

ocrmypdf: 15.1.0
Operating System: Manjaro Linux 
KDE Plasma Version: 5.27.8
KDE Frameworks Version: 5.110.0
Qt Version: 5.15.11
Kernel Version: 6.5.7-2-MANJARO (64-bit)
Graphics Platform: Wayland

I confirm the bug with Arabic, it puts a reversed text on the output pdf.

Source file:
تقديم.pdf

Command:
ocrmypdf -l ara -f تقديم.pdf out-تقديم.pdf

Output:
out-تقديم.pdf

If you try to copy some text from the output pdf you will get Arabic letters copied in reverse order:

If you copy:
Screenshot_20231014_124640

You get:
يساردلا لشفلا ةلأسم تتاب

Instead of:
باتت مسألة الفشل الدراسي

@jbarlow83
Copy link
Collaborator

Unfortunately, this is an open issue in Tesseract PDF generation.
tesseract-ocr/tesseract#238
Other RTL languages might be affected too (Hebrew).

@jbarlow83 jbarlow83 added third party issue Problem with a third party dependency and removed need test file labels Oct 20, 2023
jbarlow83 added a commit that referenced this issue Dec 3, 2023
@jbarlow83
Copy link
Collaborator

Fixed in v16

@AhmadHakami
Copy link

@jbarlow83: Fixed in v16

this problem has not been solved yet even with the updated version

tesseract v5.3.1
ocrmypdf 16.0.3
  • Reference:
    وبعد الاطلاع علی الترتیبات التنظیمیة للمؤسسة

  • Searchable pdf:
    دعبو عالطالا یلع تابیترتلا ةیمیظنتلا ةسسؤملل

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Jan 7, 2024

To confirm I'm not insane, the English translation of the first line should be something like
"The issue of academic failure has become a matter of concern to parents, teachers, and public opinion alike over the decades..."

I did some experiments - it's difficult since many programs handle RTL poorly, so it's hard to tell where what is working in the first place.

@jbarlow83 jbarlow83 reopened this Jan 7, 2024
@AhmadHakami
Copy link

Hi @jbarlow83
any updates?

@jbarlow83
Copy link
Collaborator

Both Tesseract and OCRmyPDF use the Glyphless font approach to RTL. Glyphless is a font where every glyph is mapped to a non-printing character. I've come to believe that this approach won't work for RTL languages across all PDF viewers, barely works for Tesseract and techniques that improve rendering for LTR languages over the Tesseract baseline don't work for RTL.

There are at least three ways to create RTL text and some viewers don't support some methods well.

At the very least I believe I need to add a new character to the Glyphless font, which would be the blank RTL character. That would allow RTL fonts to be inserted in an approach that is closer to how RTL fonts are typically rendering, as far as I know anyway.

It would probably also help to have a blank double-width character for CJK characters, and maybe something for vertical CJK.

Alternately it looks like Nato Sans has become a universal open source font and I could look into embedding it everywhere.

@UsernamePlankalkul
Copy link

Hi ... this is not the problem with Tesseract ... because the result of extrating RTL texts from images are fine in Tesseract ... its something with the ocrmypdf and maybe encoding or rearanging the charcters ... i'm still looking for the solution ...

SumatraPDF also show the corect arangment of characters . but we dont want to use the software because of poor performance and lack of facilities ...

Did anyone found the solution ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third party issue Problem with a third party dependency
Projects
None yet
Development

No branches or pull requests

5 participants