Windows Build 577% Slower than Linux Build #1307

leemorton · 2018-02-05T13:33:45Z

Environment

Tesseract Version: 4.00.00alpha + UB-Mannheim/4.00.00alpha
Platform: Debian Stretch x64 + Windows 10 Professional x64

Current Behavior:

Identical machine spec with identical workload and tesseract configuration results in consistent 577% slower performance on Windows 10 x64 compared with Debian Stretch x64. Essentially the job takes averagely 18 seconds on the Linux build, and 1 minute 44 seconds on the win build. Has been tested on other machines and fresh installations.

Expected Behavior:

Significantly less than 577% difference in performance.

What could be causing the win build to experience that level of overhead...?

stweil · 2018-02-05T13:44:34Z

Could you please repeat your test with environment variable OMP_THREAD_LIMIT=1 (see #1081) and report the results?

I expect the difference will be much smaller then. Windows multithreading is not performing very good. For a single threaded Tesseract there should be nearly no difference because the code was generated by the same kind of compiler (gcc) in both cases.

Which version is UB-Mannheim/4.00.00alpha? The latest is tesseract-ocr-setup-4.0.0-alpha.20180109.exe, did you use that one?

leemorton · 2018-02-05T14:59:38Z

Gave it a try with OMP_THREAD_LIMIT=1, also then added OMP_NUM_THREADS = 1.
Best it came up with with was 2 minutes 48 seconds. Took the environment variables away again and got 1 minute 35 seconds. With or without those settings, task manager shows all the logical processors spiking, however they are much more erratically spiked with OMP_THREAD_LIMIT=1 and more consistently high with no dips without that.

tesseract-ocr-setup-4.0.0-alpha.20180109.exe is the version in use.

I also seem to get this error at the end of OCRing with or without those environment variables, doubt its related but...
Detected 32 diacritics
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../ccutil/unicharset.h, line 513

stweil · 2018-02-05T17:05:42Z

For further investigation more information is needed. Could you provide your test image somewhere? Which traineddata do you use? How does the command line look like?

leemorton · 2018-02-06T08:59:01Z

Here is the command:
tesseract -l eng -c include_page_breaks=1 --psm 1 --oem 3 "in.multipage" "out" hocr tsv

Unfortunately I cant provide the exact images from this example as they contain personal details. But they are 12 PNGs all 2480x3508 pixels @ 72ppi & 8 bit depth. They are PNGs generated by ImageMagick from a PDF. The performance issue is experienced on all PNGs however (converted from any other format), not just this document.

I am using the original tessdata provided with tesseract-ocr-setup-4.0.0-alpha.20180109.exe

stweil · 2018-02-06T10:52:57Z

Thanks for that information. "Original tessdata" means that you are using eng.traineddata from https://github.com/tesseract-ocr/tessdata/. That model supports two different OCR engines (old and LSTM), and with --oem 3 you implicitly selected the LSTM engine. The tesseract-ocr package which is part of Debian Stretch would use the old engine (which is much faster).

Meanwhile there exist better models for Tesseract 4: get eng.traineddata from https://github.com/tesseract-ocr/tessdata_best for best results or from https://github.com/tesseract-ocr/tessdata_fast for fast OCR with good results. Those new models only support LSTM, but not the old OCR engine.

egorpugin · 2018-02-06T15:24:05Z

BTW, it is worth to compare with MSVC builds.
I'm a bit sceptical about MinGW-w64 from UB Mannheim and in MinGW-w64 at all. It could provide another layer of wrappers around WinAPI via Linux pseudosyscalls.

stweil · 2018-02-06T15:31:44Z

MSVC has a good reputation regarding code quality and might have a better implementation of OpenMP than gcc for Windows. As I said before, MinGW-w64 (and therefore also the UB Mannheim executables) uses gcc, so that's the same binary code for central parts (like dot product) as the Linux code. Therefore there should be only a small difference for single threaded Tesseract.

Shreeshrii · 2018-03-28T16:08:09Z

@zdenop Please label

Performance

Shreeshrii · 2018-03-29T18:57:39Z

In order to add jp2 lib, I just built both leptonica and tesseract using cmake with default options.

I find the OCR with this is much much slower than the version I had built with autotools/make.

This may have to do with the fact that with autotools, while running configure I had disabled openmp, opencl and graphics.

How to disable these three when building using cmake?

Shreeshrii · 2018-03-30T05:50:24Z

@egorpugin How to disable openmp, opencl and graphics while building tesseract with cmake for running on linux? Since I built leptonica with it, I have to use same for tesseract (otherwise there are libraryname issues).

egorpugin · 2018-03-30T11:19:21Z

The best way for now is to remove those options from CMakeLists.txt.
But I'm not very sure in linux cmake builds.
They are very very untested, sorry.

zdenop · 2018-10-01T15:34:29Z

I don't think removing options is good idea.

OpenCL should be activated by user request e.g. should be off by default.
AFAIR OpenMP could be eliminated by environment settings (if somebody has problem with it) and some user can benefit from it. If we turn it OFF by default, nobody can have benefit from it.
Disabling graphics by default: who will benefit from it? If somebody would like to use, (s)he would need to recompile tesseract library...

Please have in mind end users, who don't want/can't compile tesseract by them-self. I think If option has no huge side effect or could be easily turn off, it should be compiled.

stweil · 2018-10-21T18:24:11Z

@leemorton, could you please repeat your performance test with the latest 64 bit installer? I‌ assume that you used 32 bit Tesseract on Windows and 64 bit Tesseract on Linux, so that might explain some performance differences.

stweil · 2018-10-21T18:26:11Z

contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../ccutil/unicharset.h, line 513

That bug was recently fixed.

stweil · 2019-06-22T18:15:10Z

I close this issue as there was no recent activity and recent code does not show large differences for the performance on Linux and Windows.

amitdo mentioned this issue Feb 8, 2018

some images translated to text using Tesseract 4 throw an error regarding "contains_unichar_id" #1205

Closed

zdenop added feature request build process labels Sep 29, 2018

stweil mentioned this issue Mar 10, 2019

Issue 13590: tesseract-ocr/fuzzer-api: Heap-buffer-overflow in GenericVector<int>::size #2298

Closed

stweil closed this as completed Jun 22, 2019

amitdo added the performance label May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows Build 577% Slower than Linux Build #1307

Windows Build 577% Slower than Linux Build #1307

leemorton commented Feb 5, 2018 •

edited

Loading

stweil commented Feb 5, 2018 •

edited

Loading

leemorton commented Feb 5, 2018

stweil commented Feb 5, 2018

leemorton commented Feb 6, 2018

stweil commented Feb 6, 2018 •

edited

Loading

egorpugin commented Feb 6, 2018 •

edited

Loading

stweil commented Feb 6, 2018

Shreeshrii commented Mar 28, 2018

Shreeshrii commented Mar 29, 2018

Shreeshrii commented Mar 30, 2018

egorpugin commented Mar 30, 2018

zdenop commented Oct 1, 2018

stweil commented Oct 21, 2018

stweil commented Oct 21, 2018

stweil commented Jun 22, 2019

Windows Build 577% Slower than Linux Build #1307

Windows Build 577% Slower than Linux Build #1307

Comments

leemorton commented Feb 5, 2018 • edited Loading

Environment

Current Behavior:

Expected Behavior:

stweil commented Feb 5, 2018 • edited Loading

leemorton commented Feb 5, 2018

stweil commented Feb 5, 2018

leemorton commented Feb 6, 2018

stweil commented Feb 6, 2018 • edited Loading

egorpugin commented Feb 6, 2018 • edited Loading

stweil commented Feb 6, 2018

Shreeshrii commented Mar 28, 2018

Shreeshrii commented Mar 29, 2018

Shreeshrii commented Mar 30, 2018

egorpugin commented Mar 30, 2018

zdenop commented Oct 1, 2018

stweil commented Oct 21, 2018

stweil commented Oct 21, 2018

stweil commented Jun 22, 2019

leemorton commented Feb 5, 2018 •

edited

Loading

stweil commented Feb 5, 2018 •

edited

Loading

stweil commented Feb 6, 2018 •

edited

Loading

egorpugin commented Feb 6, 2018 •

edited

Loading