-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arabic script is backwards and improperly aligned in output searchable PDFs #718
Comments
I see no difference between the output of Your issue may also be related to the program that is used to view the PDF. Some PDF viewers may not handle RTL text correctly at all. |
ocrmypdf copies the output of Tesseract into the PDF essentially without modification. |
Okay, I see, looks like two issues with Tesseract. I'll get in touch with them about these. Maybe it's the alpha version that isn't doing well, although when I downloaded Tesseract, one of the mirrors said, "We don't provide an installer for Tesseract 4.1.0 because we think that the latest version 5.0.0-alpha is better for most Windows users in many aspects (functionality, speed, stability)." I'll test Tesseract 4 and see if the issue can be reproduced there. Thank you for your assistance! |
I tested this on Tesseract 4.1.0 and I was able to reproduce both issues. I'll create an issue at Tesseract's GitHub page, hope it could be looked into. |
@Mennaruuk Would you mind linking to the Tesseract here? |
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.1.0-elag2019.exe I opened the file with Sumatra PDF, and the text direction was proper, so I was able to copy and paste just fine. So this seems to be largely a problem with the PDF readers (such as Adobe Acrobat Reader DC, Firefox, and Okular). Tesseract has had an issue about this since 2016: |
Is the mirroring issue solved, please ? |
I’m positive the problem isn’t with OCRmyPDF. It’s with whatever PDF reader you have. Some of them just suck at reading Arabic OCR. Try opening output PDF with Sumatra PDF if you have Windows and see if the issue is resolved. |
I downloaded it, the issue is the same unfortunately. |
Sorry for the late response. Would you mind if you may share a sample PDF? Also what OS are you running? |
@Mennaruuk الحمد لله |
Hello I have the same issue and I have the latest Tesseract 5.0 beta installed (20210506). But this issue still exists for me. In which version was this resolved? |
Please add a sample file and note what PDF viewer you are using. |
Unfortunately I dont have a sample file right now. But I can tell you what I did: tesseract is normally installed via the exe file (version: tesseract-ocr-w64-setup-v5.0.0-alpha.20210506.exe) i also made it download all languages files and language scripts. i basically kept everything default. the pdf viewers i tested are okular and adobe reader. both display the same issue: reversed text search (and also reversed text copy) for arabic text. EDIT: as an example file, you can also use the one which was already posted in here. i tested it, same issue with the png file already posted here ( needed to do convert s.png -background white -alpha remove -alpha off output.png |or else img2pdf wouldnt turn it into a pdf) |
May try |
I just took the image from this issue and turned it into a pdf. I also tested your methoed and the output.txt was properly without errors. and when i copy from output.txt it copies them the right way. but when i copy from the output.pdf, it copies all teyt reversed. im a bit confused, what oculd be the reason? since it seem to work in the output.txt EDIT: just tested sumatra gain (i thought i already tested it) and in sumatra it works properly. i can properly copy the text without the reverse issue. i wonder why it doesnt work in okular and adobe read, because when i open original arabic files and copy them, it works. but it doesnt work for my ocr files. EDIT 2: I mean, is this a general issue with adobe reader and okular? Or do i need to change their settings, so they recognize arabic letters better? EDIT 3: and also thank you very much for taking time for me and helping me out. So, the conclusion is, that ocrmypdf, tesseract and img2pdf seem to work properly. but there is an issue with my pdf viewers giving me reversed copied-text, while sumatrapdf seem to be able to handle it. |
I don't believe this is any different from the known and unresolved issues in tesseract-ocr/tesseract#238 It seems that issue, in turn, is contending with the fact that many PDF generators don't generate RTL properly, many PDF viewers don't handle it properly, and your operating system's clipboard may not handle it properly. Until it is solved in tesseract, there is not much I can do. |
I see. Thank you. |
Describe the bug
Output PDF files do not properly OCR Arabic text. It is backwards. For example, a word like orange is displayed as egnaro. Also, text is improperly aligned in PDF files.
To Reproduce
What command line or API call were you trying to run?
(I had OCRmyPDF work on an input image. I reproduced the same results with an input PDF.)
Logs
Example file
Expected behavior
Text should be displayed, for example, as I like OCR. Instead, it's being displayed as RCO ekil I.
Another thing is OCR isn't being properly aligned. For example, in this screenshot, I'm selecting one word. However, when I right-click it, not only is the text backwards, but it's a different word. The word matches the one right above the one I selected. You can see that the red boxes are highlighting how the OCR alignment is incorrect.
Another screenshot to showcase improper alignment: this is what happens when I select everything. You can see the blue is prominently below, not on, each word.
It's important to mention that OCRmyPDF properly performs text recognition for the output TXT files. This seems to be happening mainly in PDF files.
System
The text was updated successfully, but these errors were encountered: