-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arabic language (right to left in writing) stored (left to right) after create PDF Searchable #238
Comments
please put your sample file and the command you used for ocr job |
This is the command: tesseract c:\temp\test_ara.jpg -l ara -psm 3 c:\temp\test_ara pdf Files are attached (source JPG and output PDF) please check original word |
Command and Samples are attached now in the previous comment |
Which program are you using to view the PDF? |
It does not look reversed wtth Chrome PDF viewer, just not very accurate... |
@amitdo can we work on language data for a better results? |
I am using Adobe Reader. |
This is a serious issue with the PDF output feature using Arabic Language and similar languages that be written from right to left |
It seems that Ray is planning to release soon a new version of Tesseract, that will include a new OCR engine based on LSTM. With LSTM, OCR for printed Arabic (not real handwrite) can reach 95% character accuracy. "Offline Printed Urdu Nastaleeq Script Recognition |
Neither you or I know what programs they are using to do OCR there... |
Yes, I know... Here is a copy of the invisible text layer (copied & pasted): مداها ينم همهما Using Chromium (Google browser) PDF viewer under Linux. |
I try hard to make sure Arabic and other right-to-left languages work correctly in Tesseract PDF. As the problem is isolated further I'm happy to look, but I'm not aware of any reason things would have broken. |
A quick check shows Chrome gives good results (as per amitdo) and Acroread gives bad results (as per tbadran). This is surprising, I thought we were good with Acroread. I wonder if this is a regression and if so when it occurred. |
Regarding recognition accuracy, that's a better topic for the forum. But in short: Don't compare against Google Drive. Don't expect major accuracy improvements unless/until Ray is successful with his ideas. And most importantly, don't trust any predictions about 'soon'. That last one is true for all software everywhere. |
Please note my testing using the binaries for Windows downloaded from: |
I have tested multiple different sample files not only sample uploaded above and every time getting same issue in output PDF on windows 10 + Acrobat Pro 11 |
On OS X, I'm seeing the opposite of earlier reports:
|
Adobe Acrobat: امهمه مني اهادم Google Chrome مداها ينم همهما |
Tom, Look at the original jpg. |
Again, in Google Chromium. مداها ينم همهما |
I find it a little easier to test with Hebrew because the letters do not connect. Tesseract version 3.03 behaves the same, so this is not a regression. Will need to think about this, because it is not obvious what exactly is going wrong. Lots of PDF files do a crazy 'write it backwards' strategy but that should not be required. Tesseract writes in reading order. |
There are two things I can think of doing. One is to give up and write Arabic |
@jbreiden |
Please make sure that any change you do is not causing any regression with Chrome PDF viewer and OS X Preview. Thanks for your work! |
@amitdo Hebrew has the exact same problem as Arabic. |
Maybe explicitly using unicode bidi control characters can help ? |
That's another possibility, thanks for the suggestion. |
Correction: ** currently inactive **. I still think it is important though.
…On Mon, Sep 17, 2018 at 11:12 AM Jeff Breidenbach ***@***.***> wrote:
Status is still unsolved, and currently inactive. I swallowed my pride and
experimented
with writing arabic backwards in PDF like everyone else does, and it still
didn't work nicely.
|
This sounds as if there will not be a fix in the near future. So we should not require that this bug must be fixed for 4.0.0. |
Is this issue resolved ?? |
@jbreiden I have experimented with the files he attached, and I came to notice something that does actually make sense : It seems both files have different setup for text orientation when I start selecting from mid-sentence and drag over a few lines : |
Not sure if this is of any value : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf Section 14.8.2.3.3 could be related to this issue as well. (I see that you had a look at this section 3 years ago in here #238 (comment) so it is probably not really what is causing our issue) |
I would like to inform that the problem still persists in Tesseract 4.1.1 Yes, if you open the PDF in Acrobat, it will give you reversed words, and will work fine for Google Chrome PDF reader. However, when i extracted the stored text in PDF/A using pdfToText, the words are reversed too, which means the text was stored in the wrong order. See the following example for more details: Here is the PDF/A generated by Tesseract To summarize: True Text Tesseract Text 100% correct Tesseract PDF/A Text As you see in the Tesseract PDF/A text, every word is reversed although the .hOCR file is correct. Actually, the words are not reversed (you still can read every letter) but the "entire line is mirrored". Usually, we face this problem when rendering Arabic text in HTML by setting "text-align:right" I think, the problem here is that the x-coord of each RTL letter is rendered by measuring x from left rather than right i.e., (x,y) should be (W-x, y) where W is the page width. |
Please advise if this issue is resolved in any latest version of tesseract? |
I'm using Tess4j 4.5.1 and having the same issue. When I process image and create pdf, open it in Acrobat reader, it displays 100% correct. However, search doesn't work unless I reverse the letters. When I copy the text and paste it in MS Word, it display reversed. |
Unfortunately, this long standing issue was not solved. Copy&paste or search of words in documents written in RTL scripts only works with Google Chrome's PDF reader., Even with this viewer there might be some issues. |
RTL scripts are stored left to right (storage order) according to standards, but only rendered and displayed RTL. In the above example it's stored the right way, but the github editor can not handle it correctly. But I can use a screenshot of my command line: Also on my MacOS the program Preview displays the PDF correctly and it is searchable with copy & paste into the search field: Thus it's a problem of PDF viewers, not of Tesseract. |
Which standards are you referring here? Here is how it works in html: Regarding the PDF standard: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf Tesseract does not do what is described in: |
Unicode
HTML just implements the Unicode standard. And all editors, command line clients I know use logical storage order.
That's an obscurity of the PDF specification. Yes, in the above PDF sample I didn't find the string At least Acrobat Reader should do it right for the interfaces to the "normal" world like search field. Other PDF viewers do it. |
It's not just Adobe, Firefox and Evince also have the same issue. Years ago, there was an attempt to implement the strings reversal as described in the spec, to make Adobe and other viewers happy, but it didn't work so well. |
There is a regression with Google's Chrome. It completely fails to render the Arabic pdf above. The same failure occurs with another pdf with Hebrew.
|
As of now (using tesseract 5.3.1), the issue still exists. Brave (Chromium-based) handles my Arabic-script PDF correctly, Firefox and Okular do not. Haven't tested anything else. |
I have tested latest release 3.05 on windows platform to OCR Arabic document to PDF (searchable) and when choose text from output PDF file it seems stored in opposite (left to right) and letters should be stored from (Right to left)!!!
i.e. original text In Arabic is
مرحبا
Stored in PDF as text as
ابحرم
The text was updated successfully, but these errors were encountered: