-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow server #652
base: master
Are you sure you want to change the base?
Workflow server #652
Conversation
- add workflow CLI group: - add alias `ocrd workflow process` to `ocrd process` - add new `ocrd workflow server`, running a web server for the given workflow that tries to instantiate all Pythonic processors once (to re-use their API instead of starting CLI each time) - add `run_api` analogue to existing `run_cli` and let `run_processor` delegate to it in `ocrd.processor.helpers`: - `run_processor` only has workspace de/serialization and processor instantiation - `run_api` has core `process()`, but now also enters and leaves the workspace directory, and passes any exceptions - ocrd.task_sequence: differentiate between `parse_tasks` (independent of workspace or fileGrps) and `run_tasks`, generalize `run_tasks` to use either `run_cli` or new `run_api` (where instances are available, avoiding unnecessary METS de/serialisation) - amend `TaskSequence` by `instance` attribute and `instantiate` method: - peek into a CLI to check for Pythonic processors - try to compile and exec, using monkey-patching to disable normal argument passing, execution, and exiting; merely importing and fetching the class of the processor - instantiate processor without workspace or fileGrps - avoid unnecessary CLI call to get ocrd-tool.json
And here's a recipe for doing workspace parallelization with the new workflow server and GNU parallel:
(EDITED to reflect |
Codecov Report
@@ Coverage Diff @@
## master #652 +/- ##
==========================================
- Coverage 84.34% 76.58% -7.77%
==========================================
Files 52 56 +4
Lines 3047 3587 +540
Branches 608 723 +115
==========================================
+ Hits 2570 2747 +177
- Misses 349 694 +345
- Partials 128 146 +18
Continue to review full report at Codecov.
|
So here are examples for both options:
|
…lementations currently expect them in the constructor)
b4a8bcb: Gosh, I managed to forget that one essential, distinctive change: triggering the actual instantiation! (Must have slipped through somewhere on the way from the proof-of-concept standalone to this integrated formulation.) |
I wonder if Flask (especially its server component, branded as But a multi-threaded server would require explicit session management in GPU-enabled processors (so startup and processing are in the same context). The latter is necessary anyway, though: Otherwise, when a workflow has multiple Tensorflow processors, they would steal each other's graphs. Another thing that still bothers me is failover capacity. The workflow server should restart if any of its instances crash. |
Another edge for performance might be transferring the workspace to fast storage during processing, like a RAM disk. So whenever a On the other hand, that machinery could as well be handled outside the workflow server – so the workspace would already be on fast storage. But for the Docker option, that would be more complicated (data shared with the outside would still need to be slow, so a second service would need to take care of moving workspaces to and fro). |
Codecov Report
@@ Coverage Diff @@
## master #652 +/- ##
==========================================
- Coverage 79.95% 76.60% -3.36%
==========================================
Files 56 58 +2
Lines 3488 3697 +209
Branches 706 734 +28
==========================================
+ Hits 2789 2832 +43
- Misses 565 723 +158
- Partials 134 142 +8
Continue to review full report at Codecov.
|
@kba, note ccb369a is necessary for core in general since resmgr wants to be able to identify the startup CWD from the workspace. The other two recent commits add proper handling and passing of errors over the network.
That's already covered under normal circumstances:
I just confirmed this by testing: If I start a workflow with multiple TF processors, the last one will steal the others' session. You have to explicitly create sessions, store them into the instance, and reset prior to processing. (This means changes to all existing TF processors we have. See here and here for examples.)
Since Python's GIL prevents actual thread-level parallelism on shared resources (like processor instances), we'd have to do multi-processing anyway. I think I'll incorporate uwsgi (which does preforking) to achieve workspace parallelism. The server will have to do |
- replace Flask dev server with external uwsgi call - factor out Flask app code into separate Python module which uWSGI can pick up - make uWSGI run given number of workers via multi-processing but not multi-threading, and prefork before loading app (to protect GPU and non-thread-safe processors, and because of GIL) - pass tasks and other settings via CLI options (wrapped in JSON) - set worker Harakiri (reload after timeout) based on number of pages multiplied by given page timeout - add option for number of processes and page timeout
Done! (Works as expected, even with GPU-enabled processors sharing memory via growth.) |
Updated help text for
|
Let me recapitulate the whole issue of sharing GPUs (physically, esp. its RAM) and sharing Tensorflow sessions (logically, when run via API in the same preloading workflow server): Normal computing resources like CPU, RAM and disk are shared naturally by the OS' task and I/O scheduler, the CPU's MMU and the disk driver's scheduler. Oversubscription can quickly become inefficient, but is easily avoidable by harmonizing the number of parallel jobs with the number of physical cores and their RAM outfit. Still, it can be worth risking transient oversubscription in exchange for a higher average resource utilization. For GPU however, it's not that simple: GPURAM is not normally paged (because that would make it slow) or even swapped, hence is an exclusive resource which may result in OOM errors, and when processes need to wait for shaders, the benefit of using GPU over CPU in the first place might vanish. Therefore, a runtime system / framework like OCR-D needs to take extra care of strictly preventing GPU oversubscription, as even transient oversubscription usually leads to OOM failures. However, it's rather hard to anticipate what GPU resources a certain workflow configuration will need (both on average and at peak). In the OCR-D makefilization, I chose to deal with GPU-enabled workflow steps bluntly by marking them as such and synchronizing all GPU-enabled processor runs via a semaphore (discounting current runners against the number of physical GPUs). But exclusive GPU locks/semaphores stop working when the processors get preloaded into memory, and often multiple processors could actually share a single GPU – it depends on how they are programmed and what size the models (or even the input images!) are. For Tensorflow, the normal mode of operation is to allocate all available GPURAM on startup and thus use it exclusively throughout the process' lifetime. But there are more options:
So it now depends on how clever the processors are programmed, how many of them we want to run (sequentially in the same workflow, or in parallel) and how large the models/images are. I believe OCR-D needs to come up with conventions for advertising the number of GPU consumers and for writing processors sharing resources cooperatively (without giving up too much performance). If we have enough runtime configuration facilities, then the user/admin can at least dimension and optimise by experiment. Of course, additionally having processing servers for the individual processors would also help better control resource allocation: each such server could be statically configured for full utilization and then dynamically distribute the work load from all processors (across pages / workflows / workspaces) in parallel (over shaders/memory) and sequentially (over a queue). Plus preloading and isolation would not fall onto the workflow server's shoulders anymore. One of the largest advantages of using the workflow server thus is probably not reduced overhead (when you have small workspaces/documents) or latency, but the ability to scale across machines: as a network service, it can easily deployed across a Docker swarm for example. |
- add `--server` option to CLI decorator - implement via new `ocrd.server.ProcessingServer`: - based on gunicorn (for preforking directly from configured CLI in Python, but instantiating the processor after forking to avoid any shared GPU context) - using multiprocessing.Lock and Manager to lock (synchronize) workspaces among workers - using signal.alarm for worker timeout mechanics - using pre- and post-fork hooks for GPU- vs CPU- worker mechanics - doing Workspace validation within the request
In 6263bb1 I have started implementing the processing server pattern. It is completely indepent of the workflow server for now, so the latter still needs to be adapted to make proper use of the former. This could help overcome some of the problems with the current workflow server approach:
But the potential merits reach even further:
Some of the design decisions which will have to be made now:
|
Hi @bertsky, after reading through the PR, this is my understanding about it. Please correct me if I'm wrong. UnderstandingWhenever we start a workflow server, we have to define a workflow for it. Since the workflow is defined at the instantiation time, one server can run one workflow only. A workflow here is a series of commands to trigger OCR processors with the respective input and output. So, the syntax is like this: ocrd workflow server -p <port> <workflow> AdvantageThe benefit of this approach is that all necessary processors are already loaded into memory and ready to be used. So, if we want to run 1 workflow multiple times, this approach will save us time from loading and unloading models. But if we want to run another workflow, then we have to start another server with DisadvantageImagine we have a use case where users have workflow descriptions (in whatever language that we can parse) and want to execute them on our infrastructure. This |
@tdoan2010 your understanding of the workflow server is correct. This PR also implements a processing server, but you did not address that part. To your interpretation:
The current implementation also parses workflow descriptions (in the syntax of Further, I fail to see the logic of why this is inappropriate for dynamically defined workflows. You can always start up another workflow server at run time if necessary. In fact, such servers could be automatically managed (set up / tear down, swarming) by a more high-level API service. And you don't loose anything by starting up all processors once and then processing. You only gain speed (starting with the non-first workspace to run on). As argued in the original discussion memo cited at the top, overhead due to initialization is significant in OCR-D, especially with GPU processors, and the only way around this is workflow servers and/or processing servers. |
Implementation of the workflow server.
Example:
Please note that (as already explained here) this only inreases efficiency for Python processors in the current venv, so bashlib processors or sub-venv processors will still be run via CLI. So you might want to group them differently in ocrd_all, or cascade workflows across sub-venvs.
(EDITED to reflect
ocrd workflow client
addition)