-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCR-Generated Text Layers Not Readable by PDF Readers for RTL Languages Like Persian #1157
Comments
Please provide an example file, the command you're using, and the versions you're using. |
I confirm the bug with Arabic, it puts a reversed text on the output pdf. Source file: Command: Output: If you try to copy some text from the output pdf you will get Arabic letters copied in reverse order: You get: Instead of: |
Unfortunately, this is an open issue in Tesseract PDF generation. |
Fixed in v16 |
this problem has not been solved yet even with the updated version
|
To confirm I'm not insane, the English translation of the first line should be something like I did some experiments - it's difficult since many programs handle RTL poorly, so it's hard to tell where what is working in the first place. |
Hi @jbarlow83 |
Both Tesseract and OCRmyPDF use the Glyphless font approach to RTL. Glyphless is a font where every glyph is mapped to a non-printing character. I've come to believe that this approach won't work for RTL languages across all PDF viewers, barely works for Tesseract and techniques that improve rendering for LTR languages over the Tesseract baseline don't work for RTL. There are at least three ways to create RTL text and some viewers don't support some methods well. At the very least I believe I need to add a new character to the Glyphless font, which would be the blank RTL character. That would allow RTL fonts to be inserted in an approach that is closer to how RTL fonts are typically rendering, as far as I know anyway. It would probably also help to have a blank double-width character for CJK characters, and maybe something for vertical CJK. Alternately it looks like Nato Sans has become a universal open source font and I could look into embedding it everywhere. |
Hi ... this is not the problem with Tesseract ... because the result of extrating RTL texts from images are fine in Tesseract ... its something with the ocrmypdf and maybe encoding or rearanging the charcters ... i'm still looking for the solution ... SumatraPDF also show the corect arangment of characters . but we dont want to use the software because of poor performance and lack of facilities ... Did anyone found the solution ? |
What were you trying to do?
I have used ocrmypdf to perform OCR on a PDF document, but I'm encountering a specific issue with RTL (right-to-left) languages like Persian. Despite successful OCR processing, the text in the resulting PDF is not selectable or searchable within PDF readers like foxit reader or other popular PDF viewers.
I tested Foxit Reader and OCR-generated text was not rtl, However, when using Zotero's PDF reader, I observed that words are separated. It's worth noting that I tested this PDF on chrome and edge and i didn't encounter the issues, ocr works and text output is available with "ocrmypdf".
Where are you installing from?
Wndows package manager (chocolatey, etc.)
What operating system are you working on?
Windows
Relevant log output
No response
The text was updated successfully, but these errors were encountered: