-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High number of parallel tests causes runtime errors #294
Comments
Hi @dariusliep for a short term fix can you try running ducktape with increased values for these variables: ducktape/ducktape/tests/runner_client.py Lines 272 to 273 in dcbc33c
we might want to expose these so that they can be configurable. |
A quick theory for whats happening here is that there are to many test runner threads running at a time that the python GIL never gets back to the runner itself to receive the messages, causing the clients to retry to no avail. Because the clients use ZMQ it would possibly be possible to run the runner clients on different machines to get true multiprocessing and could help with this but that would have to be implemented. |
Sorry if I'm missing something, but tests seem to be run in subprocesses, not threads - they should not be bound by the same GIL. |
yeah your right @vp-elitnet sorry for the late reply. We have gotten this a lot in our own tests typically this is a signal that the tests havn't returned a status in the in the default amount of time (IE when tests are running, if they exceed a timeout, the test runner assumes the test clients have died, and dies itself). This could mean your test simply runs for longer than the timeout (especially if you are running tests not in parallel). A simple fix is to use this parameter and bump it:
|
might want to start a discussion on a better long term solution |
When running high number of parallel tests (attached sample contains 350 tests), following runtime errors are received (repeated multiple times):
Ducktape version is 0.8.9. Python 3.9.10.
Attached test sources, configuration and very simple shell script which starts tests.
runtime_issue.tar.gz
Provided tests does not fail due to such runtime error, i.e. result is pass, but tests for our company product fails with error provided above. Could not reproduce the exact behavior in simple test as in our product tests (reproduce failure due to runtime error).
Edit: Forgot to mention that default file descriptors count should be increased, because zmq errors occurs. Default value of 1024 is not sufficient. I have used 10000.
The text was updated successfully, but these errors were encountered: