Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't OCR a pdf file #73

Open
bit-man opened this issue May 1, 2024 · 3 comments
Open

Can't OCR a pdf file #73

bit-man opened this issue May 1, 2024 · 3 comments
Assignees
Labels
Bug Something isn't working

Comments

@bit-man
Copy link
Contributor

bit-man commented May 1, 2024

Uploading a PDF file and trying to OCR (method: simple, format : txt) by pressing button Convert into Document opens a new tab with the error Not Found and no file is downloaded

image

At docker console the error show is

172.17.0.1 - - [01/May/2024:20:26:37 +0000] "GET /HRProprietary/HRConvert2/DATA/856ca1146d63/7f10275ffce6/m1m2.txt HTTP/1.1" 404 489 "http://localhost:8080/HRProprietary/HRConvert2/convertCore.php?showFiles=1&gui=Default&language=en&color=blue" "Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0"

Doing tail of txt log at Logs folder shows

Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Initiating Converter.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: User selected to perform OCR on file m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Copying file m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/b0806464b510/m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Copied file m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Verified file /DATA/HRConvert2/856ca1146d63/b0806464b510/m1m2.txt.
ERROR!!! May 1, 2024, 8:36 pm, HRConvert2-22, 856ca1146d63/b0806464b510: OCR Operation Failed!
@bit-man
Copy link
Contributor Author

bit-man commented May 1, 2024

Tryed to follow code at convertCore.php and seems the failing code is at if (!in_array(strtolower($oldExtension), $pdf1array)) . This evaluation results in false and thus no attempt to convert is made which makes no sense to me because its supposed to be the Code to convert a PDF to a document, as stated by the previous line comment

Stripped of the negation and an file si downloaded but is empty 😢 . Still not working
The log output follows :

Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Initiating Converter.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: User selected to perform OCR on file m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Copying file m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Copied file m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Verified file /DATA/HRConvert2/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Performing OCR intermediate operation using method 0.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Converted file /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.jpg to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Performing OCR final using method 0.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Renamed file /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Created a file at /DATA/HRConvert2/856ca1146d63/1029442e5485/m1m2.txt.

No time today to do a followup. Will try the weekend or else. Happy if anyone else can continue from here
Added this change to https://github.com/bit-man/HRConvert2 in case anyone wants to try a fix

@zelon88
Copy link
Owner

zelon88 commented May 22, 2024

Sorry for the delayed response. Can you try the following.....

sudo leafpad /etc/ImageMagick-6/policy.xml

Find and edit the following line.....

<policy domain="coder" rights="none" pattern="PDF" />

.....To.....

<policy domain="coder" rights="read|write" pattern="PDF" />

And let me know the result.

@zelon88
Copy link
Owner

zelon88 commented Jun 10, 2024

I am not satisfied myself with OCR performance of PDF files lately.
I've known for some time that the functions for OCR need to be refactored. This is mentioned in CHANGELOG.txt several times, I'm sure of it.

Look for a refactor of the OCR related functions hopefully before v3.4 comes out. This is some of the oldest code left in the codebase today. Most of it pre-dates the v2.7 Valkyre -> Diablo re-write.

@zelon88 zelon88 self-assigned this Jun 10, 2024
@zelon88 zelon88 added the Bug Something isn't working label Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants