Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High number of parallel tests causes runtime errors #294

Open
dariusliep opened this issue Feb 22, 2022 · 6 comments
Open

High number of parallel tests causes runtime errors #294

dariusliep opened this issue Feb 22, 2022 · 6 comments

Comments

@dariusliep
Copy link

dariusliep commented Feb 22, 2022

When running high number of parallel tests (attached sample contains 350 tests), following runtime errors are received (repeated multiple times):

RuntimeError('Unable to receive response from driver')
Traceback (most recent call last):
  File "/data/projects/dap/dev_data/issue_test/venv/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 222, in _do_safely
    action()
  File "/data/projects/dap/dev_data/issue_test/venv/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 174, in <lambda>
    self._do_safely(lambda: self.send(self.message.finished(result=result)),
  File "/data/projects/dap/dev_data/issue_test/venv/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 66, in send
    return self.sender.send(event)
  File "/data/projects/dap/dev_data/issue_test/venv/lib/python3.9/site-packages/ducktape/tests/runner_client.py", line 320, in send
    raise RuntimeError("Unable to receive response from driver")
RuntimeError: Unable to receive response from driver

Ducktape version is 0.8.9. Python 3.9.10.
Attached test sources, configuration and very simple shell script which starts tests.
runtime_issue.tar.gz

Provided tests does not fail due to such runtime error, i.e. result is pass, but tests for our company product fails with error provided above. Could not reproduce the exact behavior in simple test as in our product tests (reproduce failure due to runtime error).

Edit: Forgot to mention that default file descriptors count should be increased, because zmq errors occurs. Default value of 1024 is not sufficient. I have used 10000.

@imcdo
Copy link
Member

imcdo commented Apr 28, 2022

Hi @dariusliep for a short term fix can you try running ducktape with increased values for these variables:

REQUEST_TIMEOUT_MS = 3000
NUM_RETRIES = 5

we might want to expose these so that they can be configurable.

@imcdo
Copy link
Member

imcdo commented Apr 28, 2022

A quick theory for whats happening here is that there are to many test runner threads running at a time that the python GIL never gets back to the runner itself to receive the messages, causing the clients to retry to no avail. Because the clients use ZMQ it would possibly be possible to run the runner clients on different machines to get true multiprocessing and could help with this but that would have to be implemented.

@vp-elitnet
Copy link
Contributor

there are to many test runner threads running at a time that the python GIL never gets back to the runner itself to receive the messages

Sorry if I'm missing something, but tests seem to be run in subprocesses, not threads - they should not be bound by the same GIL.

@imcdo
Copy link
Member

imcdo commented Oct 10, 2022

yeah your right @vp-elitnet sorry for the late reply. We have gotten this a lot in our own tests typically this is a signal that the tests havn't returned a status in the in the default amount of time (IE when tests are running, if they exceed a timeout, the test runner assumes the test clients have died, and dies itself). This could mean your test simply runs for longer than the timeout (especially if you are running tests not in parallel). A simple fix is to use this parameter and bump it:

parser.add_argument("--test-runner-timeout", action="store", type=int, default=1800000,
(--test-runner-timeout) by default if all test clients dont respond in 30 minutes it dies, so id bump this up to a larger value. A more long term solution would be to implement some kind of heartbeat on the ducktape side for each runner client

@imcdo
Copy link
Member

imcdo commented Oct 10, 2022

might want to start a discussion on a better long term solution

@imcdo
Copy link
Member

imcdo commented Oct 10, 2022

#322

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants