feat(store): Explicitly time the client out if the connection hung #4039

whisperity · 2023-10-11T12:11:52Z

Related to #3672, but does not implement it.

This is a partial solution that attempts to gracefully kill an ongoing CodeChecker store client if the reply to the massStoreRun() API call does not arrive in a timely manner. After reaching the timeout, the client throws an exception onto itself and presents an understandable error message about the situation. This implementation is temporary; thus, the timeout is a hard-coded limit.

The client still exits with a failure in these cases because there are no guarantees (and no deterministic ways to check) whether the timeout happened because the TCP connection failed and stuck (which is observed in a lot of cases, unfortunately) or whether the server indeed took more time than allotted to process the stored reports. A successful exit signal would result in CI scripts thinking the store was successful and then potentially breaking later, unreliably, on the fact that the stored run does not, in fact, exist or is in an outdated state, etc.

whisperity · 2023-10-11T12:16:03Z

Example for the error message, as rendered to the terminal:

[INFO 2023-10-11 14:10] - Compressing report zip file...
[INFO 2023-10-11 14:10] - Compressing report zip file done (386EiB / 4TiB).
[INFO 2023-10-11 14:10] - Storing results to the server...
[WARNING 2023-10-11 14:10] - Timeout watchdog hit 3600 seconds (1:00:00)
Traceback (most recent call last):
  File "/home/whisperity/Ericsson/CodeChecker/build/CodeChecker/lib/python3/codechecker_client/cmd/store.py", line 872, in main
    client.massStoreRun(args.name,
  File "/home/whisperity/Ericsson/CodeChecker/build/CodeChecker/lib/python3/codechecker_client/thrift_call.py", line 45, in wrapper
    return func(*args, **kwargs)
  File "/home/whisperity/Ericsson/CodeChecker/venv_dev/lib/python3.10/site-packages/codechecker_api/codeCheckerDBAccess_v6/codeCheckerDBAccess.py", line 2325, in massStoreRun
    self.send_massStoreRun(runName, tag, version, zipfile, force, trimPathPrefixes, description)
  File "/home/whisperity/Ericsson/CodeChecker/venv_dev/lib/python3.10/site-packages/codechecker_api/codeCheckerDBAccess_v6/codeCheckerDBAccess.py", line 2340, in send_massStoreRun
    self._oprot.trans.flush()
  File "/home/whisperity/Ericsson/CodeChecker/venv_dev/lib/python3.10/site-packages/thrift/transport/THttpClient.py", line 184, in flush
    self.__http_response = self.__http.getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
  File "/home/whisperity/Ericsson/CodeChecker/build/CodeChecker/lib/python3/codechecker_client/cmd/store.py", line 760, in _signal_handler
    raise WatchdogError(timeout,
store.WatchdogError: Timeout watchdog hit 3600 seconds (1:00:00)
[ERROR 2023-10-11 14:10] - Timeout!
        The server's reply did not arrive after 3600 seconds (1:00:00) elapsed since the server-side processing began.

        This does *NOT* mean that there was an issue with the run you were storing!
        The server might still be processing the results...
        However, it is more likely that the server had already finished, but the client did not receive a response.
        Usually, this is caused by the underlying TCP connection failing to signal a low-level disconnect.
        Clients potentially hanging indefinitely in these scenarios is an unfortunate and known issue.
                See http://github.com/Ericsson/codechecker/issues/3672 for details!

        This error here is a temporary measure to ensure an infinite hang is replaced with a well-explained timeout.
        A more proper solution will be implemented in a subsequent version of CodeChecker.

vodorok · 2023-10-11T15:05:47Z

I would prefer if we would not show a stacktrace to the user. It is a "planned" exceptional behaviour, the error message is enough.

whisperity · 2023-10-11T15:16:23Z

I would prefer if we would not show a stacktrace to the user. It is a "planned" exceptional behaviour, the error message is enough.

@vodorok I'm not sure about that. Without the stack trace, we do not know whether the timeout was triggered when reading from the socket (a.k.a., where we expect this) or during the writing of the socket (where such could, at least theoretically, also happen, but where we do NOT expect the program to hang for so long).

The cardinal problem is the exception trigger is an asynchronous operation, so depending on where the normal execution was when the timeout triggered, it could have had killed it anywhere that is within the execution path for .massStoreRun().
It's sort of a shooting a cannon on a fly situation.

The example output message I triggered by putting an explicit and large enough timer.sleep() in the server-side handler (not part of the patch), just to simulate the situation. I also didn't wait for a real hour, I "photoshopped" the example message afterwards 😈

vodorok · 2023-10-11T15:25:26Z

I checked the implementation in the meantime, and you only guard the rpc call with the watchdog.
In the end does the stacktrace matter when the watchdog triggers and shoots down the store command?

What I trying to say, that it will always be a server (or network) related issue, when the timeout triggers, and the user usually cannot do anything about it, apart from restarting the process, and hoping for the best.

But I might be missing some context here...

dkrupp

The stack trace is a but ugly IMHO, but I understand why we need it.

LGTM.

whisperity · 2023-10-11T17:39:26Z

The only difference which might be meaningful is whether the send() fails versus the recv() failing. Nevertheless, I will put the stack trace behind the debug verbosity mode and update the patch.

vodorok · 2023-10-11T21:23:52Z

@whisperity Thanks that is a good compromise.

This is a partial solution that attempts to gracefully kill an ongoing `CodeChecker store` client if the reply to the `massStoreRun()` API call does not arrive in a timely manner. After reaching the timeout, the client throws an exception onto itself and presents an understandable error message about the situation. This implementation is temporary, thus the timeout is a hard-coded limit. The client still exits with a failure in these cases, because there are no guarantees (and no deterministic ways to check) whether the timeout happened because the TCP connection failed and stuck (which is observed in a lot of cases, unfortunately) or whether the server indeed took more time than allotted to process the stored reports. A successful exit signal would result in CI scripts thinking the store was successfull and then potentially breaking later, unreliably, on the fact that the stored run does not in fact exist, or is in an outdated state, etc.

whisperity · 2023-10-12T15:20:37Z

@vodorok @dkrupp Meanwhile I came up with a better idea, where we still print just the two most relevant parts of the stack, but we do that outside of debug mode as well.

What do you think?

[WARNING 2023-10-12 17:16] - Timeout watchdog hit 5 seconds (0:00:05)
[INFO 2023-10-12 17:16] - Timeout was triggered during:
  File "/home/whisperity/Ericsson/CodeChecker/build/CodeChecker/lib/python3/codechecker_client/cmd/store.py", line 872, in main
    client.massStoreRun(args.name,

[INFO 2023-10-12 17:16] - Timeout interrupted this low-level operation:
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)

[ERROR 2023-10-12 17:16] - Timeout!
[... The rest of the explanation, at the bottom, stays the same as previously ...]

This ensures if users notify us about the problem we can clearly identify (I'd wager not many run the very verbose debug mode...) that it is indeed what we expected to happen (unfortunately), and give us a way out if we see something like _sock.send(b) instead, which MIGHT happen although has not (yet?) been observed in practice (I hope).

dkrupp · 2023-10-13T22:28:31Z

I think it is much better like this. It is ok to land.

whisperity · 2023-10-17T12:09:01Z

@vodorok Could you please APPROVE this so it doesn't show as if you had not reviewed?

I think I have good news. After having run an analysis in which the (knowingly bogus!) checker I executed produced 178753 reports, I managed to test this in the real.

The store began today 10:58 Zulu, with the client emitting the following before going down to hang:

[INFO 2023-10-17 10:57] - Storing analysis results for run 'contour_v0.2.0.173_BenBlaise_readability-use-builtin-literals'
[...]
[INFO 2023-10-17 10:58] - Building report zip file (/tmp/tmpicjck0yo.zip) done.
[INFO 2023-10-17 10:58] - Compressing report zip file...
[INFO 2023-10-17 10:58] - Compressing report zip file done (174.5MiB / 5.4MiB).
[INFO 2023-10-17 10:58] - Storing results to the server...

The client is now hung on read(4, ...), FD 4 being the TCP socket, as expected. netstat on the client side does NOT show any active connection to the server, and lsof on the CodeChecker commands only show that it was a TCP socket, it does not show where it pointed to, like it does for active connections...

Compare a living SSH connection to what is seen on the store-side:

(Note the /tmp/tmpicjck0yo.zip confirming that it is indeed the store we are discussing.)

In the server logs, there is no indication that the store of contour_... ever begun. The only indication is that a prior analysis of qtbase has killed something on the server, I have a partial stack trace from about 5 minutes earlier, but that's it:

[INFO 2023-10-17 10:52] - [qtbase_v6.2.0_BenBlaise_readability-use-builtin-literals] Unzip storage file...
[ERROR 2023-10-17 10:53] - Failed to extract received ZIP.
Traceback (most recent call last):
  File "/codechecker/lib/python3/codechecker_server/api/mass_store_run.py", line 88, in unzip

Nevertheless, and as expected, one hour after the store started executing, the timeout was triggered, explained, and the store command exited with an error status, as observed through CSA-TestBench:

[INFO 2023-10-17 10:58] - Storing results to the server...
[WARNING 2023-10-17 11:58] - Timeout watchdog hit 3600 seconds (1:00:00)
[INFO 2023-10-17 11:58] - Timeout was triggered during:
  File "[...]/codechecker/build/CodeChecker/lib/python3/codechecker_client/cmd/store.py", line 872, in main    client.massStoreRun(args.name,

[INFO 2023-10-17 11:58] - Timeout interrupted this low-level operation:
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)

[ERROR 2023-10-17 11:58] - Timeout!
	The server's reply did not arrive after 3600 seconds (1:00:00) elapsed since the server-side processing began.

	This does *NOT* mean that there was an issue with the run you were storing!
	The server might still be processing the results...
	However, it is more likely that the server had already finished, but the client did not receive a response.
	Usually, this is caused by the underlying TCP connection failing to signal a low-level disconnect.
	Clients potentially hanging indefinitely in these scenarios is an unfortunate and known issue.
		See http://github.com/Ericsson/codechecker/issues/3672 for details!

	This error here is a temporary measure to ensure an infinite hang is replaced with a well-explained timeout.
	A more proper solution will be implemented in a subsequent version of CodeChecker.
11:58:32 (INFO) <<--==--==--==--<<

11:58:32 (INFO) Execution finished, return code: 1

11:58:32 (INFO) [contour_v0.2.0.173_BenBlaise_readability-use-builtin-literals] Results stored.
11:58:35 (INFO) [contour] Postprocessed.
11:58:35 (INFO) [acid] Checking out project...

Given the fact that this specific CI job is configured to not hard-error on store errors, it seems that we do the best of both worlds: store kills itself appropriately, but if the overarching job chooses to ignore it, it can continue appropriately.

whisperity added CLI 💻 Related to the command-line interface, such as the cmd, store, etc. commands usability 👍 Usability-related features labels Oct 11, 2023

whisperity added this to the release 6.23.0 milestone Oct 11, 2023

whisperity requested review from dkrupp, vodorok and Szelethus October 11, 2023 12:11

whisperity requested a review from bruntib as a code owner October 11, 2023 12:11

whisperity requested a review from cservakt October 11, 2023 12:12

dkrupp approved these changes Oct 11, 2023

View reviewed changes

whisperity force-pushed the feat/store/timeout-message branch from 3e06352 to 15af7d8 Compare October 12, 2023 15:20

whisperity requested review from dkrupp and removed request for bruntib, cservakt and dkrupp October 12, 2023 15:20

dkrupp merged commit 0a3cc8b into Ericsson:master Oct 13, 2023
8 checks passed

whisperity deleted the feat/store/timeout-message branch October 18, 2023 11:03

whisperity mentioned this pull request Aug 7, 2024

Asynchronous store protocol #3672

Open

whisperity mentioned this pull request Aug 25, 2024

feat: massStoreRunAsynchronous() #4326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(store): Explicitly time the client out if the connection hung #4039

feat(store): Explicitly time the client out if the connection hung #4039

whisperity commented Oct 11, 2023

whisperity commented Oct 11, 2023

vodorok commented Oct 11, 2023 •

edited by whisperity

Loading

whisperity commented Oct 11, 2023 •

edited

Loading

vodorok commented Oct 11, 2023 •

edited

Loading

dkrupp left a comment

whisperity commented Oct 11, 2023

vodorok commented Oct 11, 2023

whisperity commented Oct 12, 2023

dkrupp commented Oct 13, 2023 •

edited

Loading

whisperity commented Oct 17, 2023 •

edited

Loading

feat(store): Explicitly time the client out if the connection hung #4039

feat(store): Explicitly time the client out if the connection hung #4039

Conversation

whisperity commented Oct 11, 2023

whisperity commented Oct 11, 2023

Example for the error message, as rendered to the terminal:

vodorok commented Oct 11, 2023 • edited by whisperity Loading

whisperity commented Oct 11, 2023 • edited Loading

vodorok commented Oct 11, 2023 • edited Loading

dkrupp left a comment

Choose a reason for hiding this comment

whisperity commented Oct 11, 2023

vodorok commented Oct 11, 2023

whisperity commented Oct 12, 2023

dkrupp commented Oct 13, 2023 • edited Loading

whisperity commented Oct 17, 2023 • edited Loading

vodorok commented Oct 11, 2023 •

edited by whisperity

Loading

whisperity commented Oct 11, 2023 •

edited

Loading

vodorok commented Oct 11, 2023 •

edited

Loading

dkrupp commented Oct 13, 2023 •

edited

Loading

whisperity commented Oct 17, 2023 •

edited

Loading