High number of parallel tests causes runtime errors #294

dariusliep · 2022-02-22T13:14:16Z

When running high number of parallel tests (attached sample contains 350 tests), following runtime errors are received (repeated multiple times):

RuntimeError('Unable to receive response from driver')
Traceback (most recent call last):
  File "/data/projects/dap/dev_data/issue_test/venv/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 222, in _do_safely
    action()
  File "/data/projects/dap/dev_data/issue_test/venv/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 174, in <lambda>
    self._do_safely(lambda: self.send(self.message.finished(result=result)),
  File "/data/projects/dap/dev_data/issue_test/venv/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 66, in send
    return self.sender.send(event)
  File "/data/projects/dap/dev_data/issue_test/venv/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 320, in send
    raise RuntimeError("Unable to receive response from driver")
RuntimeError: Unable to receive response from driver

Ducktape version is 0.8.9. Python 3.9.10.
Attached test sources, configuration and very simple shell script which starts tests.
runtime_issue.tar.gz

Provided tests does not fail due to such runtime error, i.e. result is pass, but tests for our company product fails with error provided above. Could not reproduce the exact behavior in simple test as in our product tests (reproduce failure due to runtime error).

Edit: Forgot to mention that default file descriptors count should be increased, because zmq errors occurs. Default value of 1024 is not sufficient. I have used 10000.

The text was updated successfully, but these errors were encountered:

imcdo · 2022-04-28T00:16:07Z

Hi @dariusliep for a short term fix can you try running ducktape with increased values for these variables:

ducktape/ducktape/tests/runner_client.py

Lines 272 to 273 in dcbc33c

    
           REQUEST_TIMEOUT_MS = 3000 
        
           NUM_RETRIES = 5

we might want to expose these so that they can be configurable.

imcdo · 2022-04-28T00:18:18Z

A quick theory for whats happening here is that there are to many test runner threads running at a time that the python GIL never gets back to the runner itself to receive the messages, causing the clients to retry to no avail. Because the clients use ZMQ it would possibly be possible to run the runner clients on different machines to get true multiprocessing and could help with this but that would have to be implemented.

vp-elitnet · 2022-08-03T08:43:18Z

there are to many test runner threads running at a time that the python GIL never gets back to the runner itself to receive the messages

Sorry if I'm missing something, but tests seem to be run in subprocesses, not threads - they should not be bound by the same GIL.

imcdo · 2022-10-10T18:16:23Z

yeah your right @vp-elitnet sorry for the late reply. We have gotten this a lot in our own tests typically this is a signal that the tests havn't returned a status in the in the default amount of time (IE when tests are running, if they exceed a timeout, the test runner assumes the test clients have died, and dies itself). This could mean your test simply runs for longer than the timeout (especially if you are running tests not in parallel). A simple fix is to use this parameter and bump it:

ducktape/ducktape/command_line/parse_args.py

Line 90 in 830d64b

    
           parser.add_argument("--test-runner-timeout", action="store", type=int, default=1800000,

(--test-runner-timeout) by default if all test clients dont respond in 30 minutes it dies, so id bump this up to a larger value. A more long term solution would be to implement some kind of heartbeat on the ducktape side for each runner client

imcdo · 2022-10-10T18:17:43Z

might want to start a discussion on a better long term solution

imcdo · 2022-10-10T22:04:53Z

#322

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High number of parallel tests causes runtime errors #294

High number of parallel tests causes runtime errors #294

dariusliep commented Feb 22, 2022 •

edited

Loading

imcdo commented Apr 28, 2022

imcdo commented Apr 28, 2022

vp-elitnet commented Aug 3, 2022

imcdo commented Oct 10, 2022

imcdo commented Oct 10, 2022

imcdo commented Oct 10, 2022

High number of parallel tests causes runtime errors #294

High number of parallel tests causes runtime errors #294

Comments

dariusliep commented Feb 22, 2022 • edited Loading

imcdo commented Apr 28, 2022

imcdo commented Apr 28, 2022

vp-elitnet commented Aug 3, 2022

imcdo commented Oct 10, 2022

imcdo commented Oct 10, 2022

imcdo commented Oct 10, 2022

dariusliep commented Feb 22, 2022 •

edited

Loading