Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Build 577% Slower than Linux Build #1307

Closed
leemorton opened this issue Feb 5, 2018 · 15 comments
Closed

Windows Build 577% Slower than Linux Build #1307

leemorton opened this issue Feb 5, 2018 · 15 comments

Comments

@leemorton
Copy link

leemorton commented Feb 5, 2018

Environment

  • Tesseract Version: 4.00.00alpha + UB-Mannheim/4.00.00alpha
  • Platform: Debian Stretch x64 + Windows 10 Professional x64

Current Behavior:

Identical machine spec with identical workload and tesseract configuration results in consistent 577% slower performance on Windows 10 x64 compared with Debian Stretch x64. Essentially the job takes averagely 18 seconds on the Linux build, and 1 minute 44 seconds on the win build. Has been tested on other machines and fresh installations.

Expected Behavior:

Significantly less than 577% difference in performance.

What could be causing the win build to experience that level of overhead...?

@stweil
Copy link
Contributor

stweil commented Feb 5, 2018

Could you please repeat your test with environment variable OMP_THREAD_LIMIT=1 (see #1081) and report the results?

I expect the difference will be much smaller then. Windows multithreading is not performing very good. For a single threaded Tesseract there should be nearly no difference because the code was generated by the same kind of compiler (gcc) in both cases.

Which version is UB-Mannheim/4.00.00alpha? The latest is tesseract-ocr-setup-4.0.0-alpha.20180109.exe, did you use that one?

@leemorton
Copy link
Author

Gave it a try with OMP_THREAD_LIMIT=1, also then added OMP_NUM_THREADS = 1.
Best it came up with with was 2 minutes 48 seconds. Took the environment variables away again and got 1 minute 35 seconds. With or without those settings, task manager shows all the logical processors spiking, however they are much more erratically spiked with OMP_THREAD_LIMIT=1 and more consistently high with no dips without that.

tesseract-ocr-setup-4.0.0-alpha.20180109.exe is the version in use.

I also seem to get this error at the end of OCRing with or without those environment variables, doubt its related but...
Detected 32 diacritics
contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../ccutil/unicharset.h, line 513

@stweil
Copy link
Contributor

stweil commented Feb 5, 2018

For further investigation more information is needed. Could you provide your test image somewhere? Which traineddata do you use? How does the command line look like?

@leemorton
Copy link
Author

Here is the command:
tesseract -l eng -c include_page_breaks=1 --psm 1 --oem 3 "in.multipage" "out" hocr tsv

Unfortunately I cant provide the exact images from this example as they contain personal details. But they are 12 PNGs all 2480x3508 pixels @ 72ppi & 8 bit depth. They are PNGs generated by ImageMagick from a PDF. The performance issue is experienced on all PNGs however (converted from any other format), not just this document.

I am using the original tessdata provided with tesseract-ocr-setup-4.0.0-alpha.20180109.exe

@stweil
Copy link
Contributor

stweil commented Feb 6, 2018

Thanks for that information. "Original tessdata" means that you are using eng.traineddata from https://github.com/tesseract-ocr/tessdata/. That model supports two different OCR engines (old and LSTM), and with --oem 3 you implicitly selected the LSTM engine. The tesseract-ocr package which is part of Debian Stretch would use the old engine (which is much faster).

Meanwhile there exist better models for Tesseract 4: get eng.traineddata from https://github.com/tesseract-ocr/tessdata_best for best results or from https://github.com/tesseract-ocr/tessdata_fast for fast OCR with good results. Those new models only support LSTM, but not the old OCR engine.

@egorpugin
Copy link
Contributor

egorpugin commented Feb 6, 2018

BTW, it is worth to compare with MSVC builds.
I'm a bit sceptical about MinGW-w64 from UB Mannheim and in MinGW-w64 at all. It could provide another layer of wrappers around WinAPI via Linux pseudosyscalls.

@stweil
Copy link
Contributor

stweil commented Feb 6, 2018

MSVC has a good reputation regarding code quality and might have a better implementation of OpenMP than gcc for Windows. As I said before, MinGW-w64 (and therefore also the UB Mannheim executables) uses gcc, so that's the same binary code for central parts (like dot product) as the Linux code. Therefore there should be only a small difference for single threaded Tesseract.

@Shreeshrii
Copy link
Collaborator

@zdenop Please label

Performance

@Shreeshrii
Copy link
Collaborator

In order to add jp2 lib, I just built both leptonica and tesseract using cmake with default options.

I find the OCR with this is much much slower than the version I had built with autotools/make.

This may have to do with the fact that with autotools, while running configure I had disabled openmp, opencl and graphics.

How to disable these three when building using cmake?

@Shreeshrii
Copy link
Collaborator

@egorpugin How to disable openmp, opencl and graphics while building tesseract with cmake for running on linux? Since I built leptonica with it, I have to use same for tesseract (otherwise there are libraryname issues).

@egorpugin
Copy link
Contributor

The best way for now is to remove those options from CMakeLists.txt.
But I'm not very sure in linux cmake builds.
They are very very untested, sorry.

@zdenop
Copy link
Contributor

zdenop commented Oct 1, 2018

I don't think removing options is good idea.

  • OpenCL should be activated by user request e.g. should be off by default.
  • AFAIR OpenMP could be eliminated by environment settings (if somebody has problem with it) and some user can benefit from it. If we turn it OFF by default, nobody can have benefit from it.
  • Disabling graphics by default: who will benefit from it? If somebody would like to use, (s)he would need to recompile tesseract library...

Please have in mind end users, who don't want/can't compile tesseract by them-self. I think If option has no huge side effect or could be easily turn off, it should be compiled.

@stweil
Copy link
Contributor

stweil commented Oct 21, 2018

@leemorton, could you please repeat your performance test with the latest 64 bit installer? I‌ assume that you used 32 bit Tesseract on Windows and 64 bit Tesseract on Linux, so that might explain some performance differences.

@stweil
Copy link
Contributor

stweil commented Oct 21, 2018

contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../ccutil/unicharset.h, line 513

That bug was recently fixed.

@stweil
Copy link
Contributor

stweil commented Jun 22, 2019

I close this issue as there was no recent activity and recent code does not show large differences for the performance on Linux and Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants