Fix use of wrong UNICHARSET #1954

stweil · 2018-10-06T11:28:54Z

Signed-off-by: Stefan Weil sw@weilnetz.de

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil · 2018-10-06T11:31:09Z

This should fix issue #1205. I found it by adding an assertion more up in the call stack. There might be more similar code using the wrong UNICHARSET, so I'll try to add more assertions to find it.

With tesseract v4.0.0-beta.3 we often observe crashes with: ``` contains_unichar_id(unichar_id):Error:Assert failed:in file ../../src/ccutil/unicharset.h, line 511 ``` This seems to have been fixed by tesseract-ocr/tesseract#1954 Still, even after updating to 4.1.1, text recognition from PDF in ERP5 is too expensive. We also update Ghostscript to 9.54.0, because this version has built-in OCR, which does not need to convert the PDF to PNG then TIFF as we currently do in ERP5. See merge request nexedi/slapos!985

Fix use of wrong UNICHARSET

8dc9e9f

Signed-off-by: Stefan Weil <sw@weilnetz.de>

ghost assigned stweil Oct 6, 2018

ghost added the review label Oct 6, 2018

stweil mentioned this pull request Oct 6, 2018

some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

Closed

zdenop merged commit 9efedc1 into tesseract-ocr:master Oct 6, 2018

ghost removed the review label Oct 6, 2018

stweil deleted the unicharset branch October 6, 2018 13:21

stweil mentioned this pull request Mar 10, 2019

Issue 13590: tesseract-ocr/fuzzer-api: Heap-buffer-overflow in GenericVector<int>::size #2298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix use of wrong UNICHARSET #1954

Fix use of wrong UNICHARSET #1954

stweil commented Oct 6, 2018

stweil commented Oct 6, 2018

Fix use of wrong UNICHARSET #1954

Fix use of wrong UNICHARSET #1954

Conversation

stweil commented Oct 6, 2018

stweil commented Oct 6, 2018