Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow server #652

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Workflow server #652

wants to merge 30 commits into from

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Dec 1, 2020

Implementation of the workflow server.

ocrd --help
Commands:
  bashlib    Work with bash library
  log        Logging
  ocrd-tool  Work with ocrd-tool.json JSON_FILE
  validate   All the validation in one CLI
  process    Run processor CLIs in a series of tasks
  workflow   Process a series of tasks
  workspace  Working with workspace
  zip        Bag/Spill/Validate OCRD-ZIP bags
ocrd workflow --help
Usage: ocrd workflow [OPTIONS] COMMAND [ARGS]...

  Process a series of tasks

Options:
  --help  Show this message and exit.

Commands:
  client   Have the workflow server run commands
  process  Run processor CLIs in a series of tasks (alias for ``ocrd process``)
  server   Start server for a series of tasks to run processor CLIs or APIs...
ocrd workflow server --help
Usage: ocrd workflow server [OPTIONS] TASKS...

  Start server for a series of tasks to run processor CLIs or APIs on
  workspaces

  Parse the given tasks and try to instantiate all Pythonic processors among
  them with the given parameters. Open a web server that listens on the
  given host and port for GET requests named ``process`` with the following
  (URL-encoded) arguments:

      mets (string): Path name (relative to the server's CWD,
      or absolute) of the workspace to process

      page_id (string): Comma-separated list of page IDs to process

      overwrite (bool): Remove output pages/images if they already exist

  The server will handle each request by running the tasks on the given
  workspace. Pythonic processors will be run via API (on those same
  instances).  Non-Pythonic processors (or those not directly accessible in
  the current venv) will be run via CLI normally, instantiating each time.
  Also, between each contiguous chain of Pythonic tasks in the overall
  series, no METS de/serialization will be performed.

  Stop the server by sending SIGINT (e.g. via ctrl+c on the terminal), or
  sending a GET request named ``shutdown``.

Options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -h, --host TEXT                 host name/IP to listen at
  -p, --port INTEGER RANGE        TCP port to listen at
  --help                          Show this message and exit.
ocrd workflow client --help
Usage: ocrd workflow client [OPTIONS] COMMAND [ARGS]...

  Have the workflow server run commands

Options:
  -h, --host TEXT           host name/IP to listen at
  -p, --port INTEGER RANGE  TCP port to listen at
  --help                    Show this message and exit.

Commands:
  list-tasks  Have the workflow server print the configured task sequence
  process     Have the workflow server process another workspace
  shutdown    Have the workflow server shutdown gracefully
ocrd workflow client process --help
Usage: ocrd workflow client process [OPTIONS]

  Have the workflow server process another workspace

Options:
  -m, --mets TEXT     METS to process
  -g, --page-id TEXT  ID(s) of the pages to process
  --overwrite         Remove output pages/images if they already exist
  --help              Show this message and exit.

Example:

ocrd workflow server -p 5000 'olena-binarize -P impl sauvola-ms-split -I OCR-D-IMG -O OCR-D-IMG-BINPAGE-sauvola' 'anybaseocr-crop -I OCR-D-IMG-BINPAGE-sauvola -O OCR-D-IMG-BINPAGE-sauvola-CROP' 'cis-ocropy-denoise -P noise_maxsize 3.0 -P level-of-operation page -I OCR-D-IMG-BINPAGE-sauvola-CROP -O OCR-D-IMG-BINPAGE-sauvola-CROP-DEN' 'tesserocr-deskew -P operation_level page -I OCR-D-IMG-BINPAGE-sauvola-CROP-DEN -O OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract' 'cis-ocropy-deskew -P level-of-operation page -I OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract -O OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract-DESK-ocropy' 'tesserocr-recognize -P segmentation_level region -P model deu -I OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract-DESK-ocropy -O OCR-D-IMG-BINPAGE-sauvola-CROP-DEN-DESK-tesseract-DESK-ocropy-OCR-tesseract-deu'

curl -G -d mets=../kant_aufklaerung_1784/data/mets.xml -d overwrite=True -d page_id=PHYS_0017,PHYS_0020 http://127.0.0.1:5000/process
curl -G -d mets=../blumenbach_anatomie_1805/mets.xml http://127.0.0.1:5000/process
curl -G http://127.0.0.1:5000/shutdown
# equivalently:
ocrd workflow client -p 5000 process -m ../kant_aufklaerung_1784/data/mets.xml --overwrite -g PHYS_0017,PHYS_0020
ocrd workflow client process -m ../blumenbach_anatomie_1805/mets.xml
ocrd workflow client shutdown

Please note that (as already explained here) this only inreases efficiency for Python processors in the current venv, so bashlib processors or sub-venv processors will still be run via CLI. So you might want to group them differently in ocrd_all, or cascade workflows across sub-venvs.

(EDITED to reflect ocrd workflow client addition)

- add workflow CLI group:
  - add alias `ocrd workflow process` to `ocrd process`
  - add new `ocrd workflow server`, running a web server
    for the given workflow that tries to instantiate
    all Pythonic processors once (to re-use their API
    instead of starting CLI each time)
- add `run_api` analogue to existing `run_cli` and let
  `run_processor` delegate to it in `ocrd.processor.helpers`:
  - `run_processor` only has workspace de/serialization and
    processor instantiation
  - `run_api` has core `process()`, but now also enters and
    leaves the workspace directory, and passes any exceptions
- ocrd.task_sequence: differentiate between `parse_tasks`
  (independent of workspace or fileGrps) and `run_tasks`,
  generalize `run_tasks` to use either `run_cli` or new
  `run_api` (where instances are available, avoiding
  unnecessary METS de/serialisation)
- amend `TaskSequence` by `instance` attribute
  and `instantiate` method:
  - peek into a CLI to check for Pythonic processors
  - try to compile and exec, using monkey-patching
    to disable normal argument passing, execution, and
    exiting; merely importing and fetching the class
    of the processor
  - instantiate processor without workspace or fileGrps
  - avoid unnecessary CLI call to get ocrd-tool.json
@bertsky
Copy link
Collaborator Author

bertsky commented Dec 1, 2020

And here's a recipe for doing workspace parallelization with the new workflow server and GNU parallel:

N=4
for ((i=0;i<N;i++)); do
    ocrd workflow server -p $((5000+i)) \
        "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
        "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
        "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
        "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
        "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
        "tesserocr-segment-region -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG" \
        "segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true" \
        "tesserocr-deskew -I OCR-D-SEG-REPAIR -O OCR-D-SEG-REG-DESKEW" \
        "cis-ocropy-clip -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-REG-DESKEW-CLIP" \
        "tesserocr-segment-line -I OCR-D-SEG-REG-DESKEW-CLIP -O OCR-D-SEG-LINE" \
        "cis-ocropy-clip -I OCR-D-SEG-LINE -O OCR-D-SEG-LINE-CLIP -P level-of-operation line" \
        "cis-ocropy-dewarp -I OCR-D-SEG-LINE-CLIP -O OCR-D-SEG-LINE-RESEG-DEWARP" \
        "tesserocr-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P textequiv_level glyph -P overwrite_words true -P model GT4HistOCR_50000000}"
done
find . -name mets.xml | parallel -j$N curl -G -d mets={} http://127.0.0.1:\$((5000+{%}))/process
# equivalently:
find . -name mets.xml | parallel -j$N ocrd workflow client -p \$((5000+{%})) process -m {}

(EDITED to reflect ocrd workflow client addition)

@codecov-io
Copy link

codecov-io commented Dec 1, 2020

Codecov Report

Merging #652 (6d15084) into master (135acb6) will decrease coverage by 7.76%.
The diff coverage is 35.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #652      +/-   ##
==========================================
- Coverage   84.34%   76.58%   -7.77%     
==========================================
  Files          52       56       +4     
  Lines        3047     3587     +540     
  Branches      608      723     +115     
==========================================
+ Hits         2570     2747     +177     
- Misses        349      694     +345     
- Partials      128      146      +18     
Impacted Files Coverage Δ
ocrd/ocrd/cli/workflow.py 27.27% <27.27%> (ø)
ocrd/ocrd/processor/helpers.py 70.40% <56.00%> (-10.37%) ⬇️
ocrd/ocrd/cli/__init__.py 100.00% <100.00%> (ø)
ocrd/ocrd/cli/process.py 93.33% <100.00%> (+0.47%) ⬆️
ocrd/ocrd/processor/base.py 64.92% <100.00%> (-15.49%) ⬇️
ocrd_utils/ocrd_utils/os.py 70.88% <0.00%> (-24.86%) ⬇️
ocrd_utils/ocrd_utils/constants.py 90.47% <0.00%> (-9.53%) ⬇️
ocrd/ocrd/cli/workspace.py 74.74% <0.00%> (-1.91%) ⬇️
ocrd/ocrd/__init__.py 100.00% <0.00%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 135acb6...6d15084. Read the comment docs.

@bertsky bertsky requested a review from kba December 1, 2020 07:35
@bertsky
Copy link
Collaborator Author

bertsky commented Dec 1, 2020

Please note that (as already explained here) this only inreases efficiency for Python processors in the current venv, so bashlib processors or sub-venv processors will still be run via CLI. So you might want to group them differently in ocrd_all, or cascade workflows across sub-venvs.

So here are examples for both options:

  • grouping sub-venvs in ocrd_all to include ocrd_calamari with all other needed modules in one venv

      cd ocrd_all
      . venv/local/sub-venv/headless-tf1/bin/activate
      make MAKELEVEL=1 OCRD_MODULES="core ocrd_cis ocrd_anybaseocr ocrd_wrap ocrd_tesserocr tesserocr tesseract ocrd_calamari" all
      ocrd workflow server \
          "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
          "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
          "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
          "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
          "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
          "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG -P level-of-operation page" \
          "tesserocr-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW" \
          "cis-ocropy-dewarp -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-LINE-RESEG-DEWARP" \
          "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /path/to/models/\*.ckpt.json"
    
  • cascade workflows across sub-venvs (combined with workspace parallelization):

      N=4
      for ((i=0;i<N;i++)); do
          . venv/bin/activate
          ocrd workflow server -p $((5000+i)) \
              "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN"
          . venv/local/sub-venv/headless-tf1/bin/activate
          ocrd workflow server -p $((6000+i)) \
              "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP"
          . venv/bin/activate
          ocrd workflow server -p $((7000+i)) \
              "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
              "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
              "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
              "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG-REG -P level-of-operation page" \
              "tesserocr-deskew -I OCR-D-SEG-REG -O OCR-D-SEG-REG-DESKEW" \
              "cis-ocropy-dewarp -I OCR-D-SEG-REG-DESKEW -O OCR-D-SEG-LINE-RESEG-DEWARP"
          . venv/local/sub-venv/headless-tf1/bin/activate
          ocrd workflow server -p $((8000+i)) \
              "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /path/to/models/\*.ckpt.json"
          . venv/bin/activate
      done
      find . -name mets.xml | parallel -j$N \
          ocrd workflow client -p \$((5000+{%})) process -m {} "&&" \
          ocrd workflow client -p \$((6000+{%})) process -m {} "&&" \
          ocrd workflow client -p \$((7000+{%})) process -m {} "&&" \
          ocrd workflow client -p \$((8000+{%})) process -m {}
    

…lementations currently expect them in the constructor)
@bertsky
Copy link
Collaborator Author

bertsky commented Jan 25, 2021

b4a8bcb: Gosh, I managed to forget that one essential, distinctive change: triggering the actual instantiation!

(Must have slipped through somewhere on the way from the proof-of-concept standalone to this integrated formulation.)

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 26, 2021

I wonder if Flask (especially its server component, branded as development and has a debug option) is efficient enough. IIUC we actually need multi-threading in the server if we also want to spawn multiple processes for the worker (i.e. workspace parallelism). So maybe we should use a gevent or uwsgi server instead. (Of course, memory or GPU resources would need to be allocated to them carefully...)

But a multi-threaded server would require explicit session management in GPU-enabled processors (so startup and processing are in the same context). The latter is necessary anyway, though: Otherwise, when a workflow has multiple Tensorflow processors, they would steal each other's graphs.

Another thing that still bothers me is failover capacity. The workflow server should restart if any of its instances crash.

@bertsky
Copy link
Collaborator Author

bertsky commented Feb 11, 2021

Another edge for performance might be transferring the workspace to fast storage during processing, like a RAM disk. So whenever a /process request arrives, clone the workspace to /tmp, run there, and then write back (overwriting). In Dockerization (simply defining a service via make server PORT=N and EXPOSE N), you would probably run with --mount type=tmpfs,destination=/tmp then.

On the other hand, that machinery could as well be handled outside the workflow server – so the workspace would already be on fast storage. But for the Docker option, that would be more complicated (data shared with the outside would still need to be slow, so a second service would need to take care of moving workspaces to and fro).

@bertsky bertsky mentioned this pull request Mar 4, 2021
@codecov-commenter
Copy link

codecov-commenter commented May 13, 2021

Codecov Report

Merging #652 (cac80d6) into master (db79ff6) will decrease coverage by 3.35%.
The diff coverage is 25.68%.

❗ Current head cac80d6 differs from pull request most recent head d98daa8. Consider uploading reports for the commit d98daa8 to get more accurate results

@@            Coverage Diff             @@
##           master     #652      +/-   ##
==========================================
- Coverage   79.95%   76.60%   -3.36%     
==========================================
  Files          56       58       +2     
  Lines        3488     3697     +209     
  Branches      706      734      +28     
==========================================
+ Hits         2789     2832      +43     
- Misses        565      723     +158     
- Partials      134      142       +8     
Impacted Files Coverage Δ
ocrd/ocrd/cli/server.py 0.00% <0.00%> (ø)
ocrd/ocrd/cli/workflow.py 39.36% <39.36%> (ø)
ocrd/ocrd/processor/helpers.py 70.40% <48.14%> (-10.37%) ⬇️
ocrd/ocrd/cli/__init__.py 100.00% <100.00%> (ø)
ocrd/ocrd/cli/process.py 93.33% <100.00%> (+0.47%) ⬆️
ocrd/ocrd/processor/base.py 67.16% <100.00%> (+0.24%) ⬆️
ocrd/ocrd/resolver.py 94.50% <0.00%> (-2.20%) ⬇️
ocrd/ocrd/workspace.py 71.05% <0.00%> (-0.52%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ecdb840...d98daa8. Read the comment docs.

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 9, 2021

@kba, note ccb369a is necessary for core in general since resmgr wants to be able to identify the startup CWD from the workspace. The other two recent commits add proper handling and passing of errors over the network.

Another thing that still bothers me is failover capacity. The workflow server should restart if any of its instances crash.

That's already covered under normal circumstances: run_api catches everything, logs it, and returns the exception. run_tasks gets that, and re-raises. cli.workflow.process catches everything, logs it, and sends the appropriate response. So the processor instances are still alive. Actual crashes like segfaults or pipe signals would drag down the server as well. So failover would be necessary above that anyway – for example in Docker.

But a multi-threaded server would require explicit session management in GPU-enabled processors (so startup and processing are in the same context). The latter is necessary anyway, though: Otherwise, when a workflow has multiple Tensorflow processors, they would steal each other's graphs.

I just confirmed this by testing: If I start a workflow with multiple TF processors, the last one will steal the others' session. You have to explicitly create sessions, store them into the instance, and reset prior to processing. (This means changes to all existing TF processors we have. See here and here for examples.)

I wonder if Flask (especially its server component, branded as development and has a debug option) is efficient enough. IIUC we actually need multi-threading in the server if we also want to spawn multiple processes for the worker (i.e. workspace parallelism). So maybe we should use a gevent or uwsgi server instead. (Of course, memory or GPU resources would need to be allocated to them carefully...)

Since Python's GIL prevents actual thread-level parallelism on shared resources (like processor instances), we'd have to do multi-processing anyway. I think I'll incorporate uwsgi (which does preforking) to achieve workspace parallelism. The server will have to do parse_tasks and instantiate in a dedicated function with Flask's decorator @app.before_first_request IIUC.

- replace Flask dev server with external uwsgi call
- factor out Flask app code into separate Python module
  which uWSGI can pick up
- make uWSGI run given number of workers via multi-processing
  but not multi-threading, and prefork before loading app
  (to protect GPU and non-thread-safe processors, and because of GIL)
- pass tasks and other settings via CLI options (wrapped in JSON)
- set worker Harakiri (reload after timeout) based on number of
  pages multiplied by given page timeout
- add option for number of processes and page timeout
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 10, 2021

I think I'll incorporate uwsgi (which does preforking) to achieve workspace parallelism. The server will have to do parse_tasks and instantiate in a dedicated function with Flask's decorator @app.before_first_request IIUC.

Done! (Works as expected, even with GPU-enabled processors sharing memory via growth.)

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 11, 2021

Updated help text for ocrd workflow server now reads:

Usage: ocrd workflow server [OPTIONS] TASKS...

  Start server for a series of tasks to run processor CLIs or APIs on
  workspaces

  Parse the given tasks and try to instantiate all Pythonic processors among
  them with the given parameters. Open a web server that listens on the
  given ``host`` and ``port`` and queues requests into ``processes`` worker
  processes for GET requests named ``/process`` with the following (URL-
  encoded) arguments:

      mets (string): Path name (relative to the server's CWD,
      or absolute) of the workspace to process

      page_id (string): Comma-separated list of page IDs to process

      log_level (int): Override all logger levels during processing

      overwrite (bool): Remove output pages/images if they already exist

  The server will handle each request by running the tasks on the given
  workspace. Pythonic processors will be run via API (on those same
  instances).  Non-Pythonic processors (or those not directly accessible in
  the current venv) will be run via CLI normally, instantiating each time.
  Also, between each contiguous chain of Pythonic tasks in the overall
  series, no METS de/serialization will be performed.

  If processing does not finish before ``timeout`` seconds per page, then
  the request will fail and the respective worker be reloaded.

  To see the server's workflow configuration, send a GET request named
  ``/list-tasks``.

  Stop the server by sending SIGINT (e.g. via ctrl+c on the terminal), or
  sending a GET request named ``/shutdown``.

Options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  -t, --timeout INTEGER           maximum processing time (in sec per page)
                                  before reloading worker (0 to disable)

  -j, --processes INTEGER         number of parallel workers to spawn
  -h, --host TEXT                 host name/IP to listen at
  -p, --port INTEGER RANGE        TCP port to listen at
  --help                          Show this message and exit.

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 11, 2021

I just confirmed this by testing: If I start a workflow with multiple TF processors, the last one will steal the others' session. You have to explicitly create sessions, store them into the instance, and reset prior to processing. (This means changes to all existing TF processors we have. See here and here for examples.)

Let me recapitulate the whole issue of sharing GPUs (physically, esp. its RAM) and sharing Tensorflow sessions (logically, when run via API in the same preloading workflow server):

Normal computing resources like CPU, RAM and disk are shared naturally by the OS' task and I/O scheduler, the CPU's MMU and the disk driver's scheduler. Oversubscription can quickly become inefficient, but is easily avoidable by harmonizing the number of parallel jobs with the number of physical cores and their RAM outfit. Still, it can be worth risking transient oversubscription in exchange for a higher average resource utilization.

For GPU however, it's not that simple: GPURAM is not normally paged (because that would make it slow) or even swapped, hence is an exclusive resource which may result in OOM errors, and when processes need to wait for shaders, the benefit of using GPU over CPU in the first place might vanish. Therefore, a runtime system / framework like OCR-D needs to take extra care of strictly preventing GPU oversubscription, as even transient oversubscription usually leads to OOM failures. However, it's rather hard to anticipate what GPU resources a certain workflow configuration will need (both on average and at peak).

In the OCR-D makefilization, I chose to deal with GPU-enabled workflow steps bluntly by marking them as such and synchronizing all GPU-enabled processor runs via a semaphore (discounting current runners against the number of physical GPUs). But exclusive GPU locks/semaphores stop working when the processors get preloaded into memory, and often multiple processors could actually share a single GPU – it depends on how they are programmed and what size the models (or even the input images!) are.

For Tensorflow, the normal mode of operation is to allocate all available GPURAM on startup and thus use it exclusively throughout the process' lifetime. But there are more options:

  • ConfigProto().gpu_options.allow_growth: allocates dynamically as needed (slower during the first run; "needed" can still mean more than strictly necessary; no de-allocation; can still yield OOM)
  • ConfigProto().gpu_options.per_process_gpu_memory_fraction: allocates no more than the given fraction of physical memory (much slower during the first run and slightly slower when competing for memory; can still yield OOM)
  • ConfigProto().gpu_options.experimental.use_unified_memory: enables paging of memory to CPU (much slower on the first run and slightly slower when competing for memory; no more OOM)

So it now depends on how clever the processors are programmed, how many of them we want to run (sequentially in the same workflow, or in parallel) and how large the models/images are. I believe OCR-D needs to come up with conventions for advertising the number of GPU consumers and for writing processors sharing resources cooperatively (without giving up too much performance). If we have enough runtime configuration facilities, then the user/admin can at least dimension and optimise by experiment.

Of course, additionally having processing servers for the individual processors would also help better control resource allocation: each such server could be statically configured for full utilization and then dynamically distribute the work load from all processors (across pages / workflows / workspaces) in parallel (over shaders/memory) and sequentially (over a queue). Plus preloading and isolation would not fall onto the workflow server's shoulders anymore.

One of the largest advantages of using the workflow server thus is probably not reduced overhead (when you have small workspaces/documents) or latency, but the ability to scale across machines: as a network service, it can easily deployed across a Docker swarm for example.

bertsky added 3 commits June 11, 2021 18:33
- add `--server` option to CLI decorator
- implement via new `ocrd.server.ProcessingServer`:
  - based on gunicorn (for preforking directly from
    configured CLI in Python, but instantiating the
    processor after forking to avoid any shared GPU
    context)
  - using multiprocessing.Lock and Manager to lock
    (synchronize) workspaces among workers
  - using signal.alarm for worker timeout mechanics
  - using pre- and post-fork hooks for GPU- vs CPU-
    worker mechanics
  - doing Workspace validation within the request
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 16, 2021

In 6263bb1 I have started implementing the processing server pattern. It is completely indepent of the workflow server for now, so the latter still needs to be adapted to make proper use of the former. This could help overcome some of the problems with the current workflow server approach:

  • not require the processors to be in the same venv (or even be Pythonic) anymore,
  • not require monkey-patching Python classes anymore,
  • not stick multiple tasks in the same process anymore (risking interference as in the Tensorflow case).

But the potential merits reach even further:

  • run different parts of the workflow with different parallelism (slower tasks could be assigned more cores; GPU tasks could be assigned the exact number of GPU resources available while others get to see more CPU cores)
  • orchestrating workflows incrementally and with signalling (progress meters) and timeouts
  • encapsulating processor servers as Docker services, docker-composing (or even docker-swarming) workflows
    (which means that ocrd/all won't need to be a fat container anymore, and individual processor images can be based on the required CUDA runtime; currently we can only make exactly one CUDA/TF version work at the same time)

Some of the design decisions which will have to be made now:

  • set up processing servers in advance (externally/separately) or managed by workflow server
    • if external: task descriptions should contain the --server parameters
    • if managed: decide which ports to run them on, try setting them up, run ring back service and wait for startup notification, timeout and fall back to CLI, teardown managed PS PIDs/ports on WS termination
  • what to do with non-Pythonic processors
    • integrate a simple but transparent / automatic socat-based server with means of the shell into bashlib
    • ignore them and always run as CLI

@kba kba mentioned this pull request Jul 23, 2021
@tdoan2010
Copy link
Contributor

Hi @bertsky, after reading through the PR, this is my understanding about it. Please correct me if I'm wrong.

Understanding

Whenever we start a workflow server, we have to define a workflow for it. Since the workflow is defined at the instantiation time, one server can run one workflow only. A workflow here is a series of commands to trigger OCR processors with the respective input and output. So, the syntax is like this:

ocrd workflow server -p <port> <workflow>

Advantage

The benefit of this approach is that all necessary processors are already loaded into memory and ready to be used. So, if we want to run 1 workflow multiple times, this approach will save us time from loading and unloading models. But if we want to run another workflow, then we have to start another server with ocrd workflow server.

Disadvantage

Imagine we have a use case where users have workflow descriptions (in whatever language that we can parse) and want to execute them on our infrastructure. This ocrd workflow server is not an appropriate approach, since its advantage lies on the fact that the workflow is fixed (so that it can be loaded once and stay in the memory "forever"), 1 workflow - 1 server.

@bertsky
Copy link
Collaborator Author

bertsky commented Mar 2, 2022

@tdoan2010 your understanding of the workflow server is correct. This PR also implements a processing server, but you did not address that part.

To your interpretation:

Disadvantage

Imagine we have a use case where users have workflow descriptions (in whatever language that we can parse) and want to execute them on our infrastructure. This ocrd workflow server is not an appropriate approach, since its advantage lies on the fact that the workflow is fixed (so that it can be loaded once and stay in the memory "forever"), 1 workflow - 1 server.

The current implementation also parses workflow descriptions (in the syntax of ocrd process). There is no loss of generality here – future syntaxes can of course be integrated as well.

Further, I fail to see the logic of why this is inappropriate for dynamically defined workflows. You can always start up another workflow server at run time if necessary. In fact, such servers could be automatically managed (set up / tear down, swarming) by a more high-level API service. And you don't loose anything by starting up all processors once and then processing. You only gain speed (starting with the non-first workspace to run on).

As argued in the original discussion memo cited at the top, overhead due to initialization is significant in OCR-D, especially with GPU processors, and the only way around this is workflow servers and/or processing servers.

@tdoan2010 tdoan2010 mentioned this pull request May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants