Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gujarati, Hindi and Sanskrit Language OCR not working #583

Closed
vikithakar opened this issue Jan 22, 2024 · 5 comments
Closed

Gujarati, Hindi and Sanskrit Language OCR not working #583

vikithakar opened this issue Jan 22, 2024 · 5 comments

Comments

@vikithakar
Copy link

vikithakar commented Jan 22, 2024

Screenshot from 2024-01-22 18-03-00

Description of Issue

After building Papermerge with Gujarati, Hindi and Sanskrit Language support, when you upload and run OCR on files, it churns out OCR text which is not correct. I think the tesseract-ocr is consistent with the text output it gives for the file, but it seems like papermerge does not have fonts or Character Sets to display the translations in the OCR text language.

Build Details

Dockerfile to add tesseract-ocr to papermerge

FROM papermerge/papermerge:3.0.2
RUN apt install tesseract-ocr-hin tesseract-ocr-guj tesseract-ocr-san -y

Info:

  • Papermerge Version 3.0.2
@vikithakar vikithakar added the bug Something isn't working label Jan 22, 2024
@ciur ciur added missing-language and removed bug Something isn't working labels Jan 23, 2024
@ciur
Copy link
Owner

ciur commented Jan 23, 2024

Thank you for reporting the issue!

@ciur
Copy link
Owner

ciur commented Jan 30, 2024

@vikithakar

In order to make this work, I need to include Gujarati, Hindi and Sanskrit codes here and here. For the second list, I need respective language written in original language; for example fra in French is "Français"; ell in Greek is "Ελληνικά".

Could you please provide original writing of the language name for Gujarati, Hindi and Sanskrit ?

  • guj in Gujrati is "..." ?
  • hin in Hindi is "..." ?
  • san in "Sanskrit is "..." ?

@vikithakar
Copy link
Author

@ciur
Original Language Name

  • guj in Gujarati is ગુજરાતી
  • hin in Hindi is हिंदी
  • san in Sanskrit is संस्कृत

@ciur
Copy link
Owner

ciur commented Jan 31, 2024

@vikithakar

PR for adding above mentioned languages.

Change will be available in 3.0.3 release

Note that you will need to build your image as before. However, when you will start papermerge don't forget to add PAPERMERGE__OCR__DEFAULT_LANGUAGE variable so that when you import docs they will be OCRed in "default OCR" language.

In ticket's screenshot you've uploaded you can see that document was OCRed with OCR language being set "German" (deu code corresponds to German language). That's why those strange characters.

lang-codes

@ciur
Copy link
Owner

ciur commented Jan 31, 2024

@vikithakar

Here is screenshot with working app (as mentioned above will be part of 3.0.3):

papermerge-with-hindi-text

@ciur ciur closed this as completed Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants