Let exit_code be better aligned with C/POSIX #240

EdSchouten · 2023-02-21T09:30:45Z

I am glad that REv2 already uses int32 as its data type for exit_code. Even though most UNIX systems historically only had 8-bit exit codes, modern versions of POSIX provide interfaces for obtaining the full integer value. Using functions like waitid(), you can obtain a siginfo_t whose si_status field contains the full value.

Where we do deviate from C/POSIX is that there are no facilities for indicating that a process did not terminate by calling _Exit(), but through a signal. This leads to confusing situations where people wonder why their build actions terminate with exit code 139, while in reality it terminated through SIGSEGV (signal 11, giving exit code 128+11=139). Even worse are modern versions of Go's os/exec package, which will always return an exit code of -1 in case of signal delivery. This means that users can't even figure out why their actions terminated.

One solution to this would be to change the existing field:

  // The exit code of the command.
  int32 exit_code = 4;

To something like this:

  oneof termination_reason {
    // The command terminated by calling the operating system's equivalent
    // of `_Exit()` with the code provided.
    int32 exit_code = 4;

    // The command terminated by the delivery of a signal.
    int32 signal_number = 13;
 }

Even though virtually all operating systems use 9 for SIGKILL and 15 for SIGTERM, C/POSIX make no such requirement. The only place where that does happen, is in the definition of the kill utility, where as part of the X/Open System Interfaces (XSI) they provide a hardcoded list of numerical command line options.

My recommendation would thus be to use this instead:

  oneof termination_reason {
    // The command terminated by calling the operating system's equivalent
    // of `_Exit()` with the code provided.
    int32 exit_code = 4;

    // The command terminated by the delivery of a signal. This field contains the name of
    // the signal, with the "SIG" prefix removed (e.g., "SEGV" for segmentation violations).
    string signal_name = 13;
 }

Unfortunately it's not safe for us to add a oneof to the protocol retroactively, as it would cause exit_code to be zero when signal_name is set. Maybe we can already add this to REv2 in the form of a separate field, with the remark that exit_code is set to a non-zero value if signal_name is set...

The text was updated successfully, but these errors were encountered:

POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states the following: > There are two kinds of process termination: > > 1. Normal termination occurs by a return from main(), when requested > with the exit(), _exit(), or _Exit() functions; or when the last > thread in the process terminates by returning from its start > function, by calling the pthread_exit() function, or through > cancellation. > > 2. Abnormal termination occurs when requested by the abort() function > or when some signals are received. Both these cases can be detected by parent processes separately. Unfortunately, REv2 currently doesn't allow this, as we only provide an exit code. Implementations such as Buildbarn tend to set the exit code for abnormal termination to 128+signum, meaning that if a segmentation fault occurs, an action terminates with exit code 139. This tends to be fairly confusing for non-expert users. The goal of this change is to make all of this less confusing. It adds a new message type named ProcessTerminationResult, which contains the termination result in a more structured way. It can now either be an exit code or a signal number. If it turns out that other operating systems provide even more ways, we can add additional `oneof` cases. For the exit code in case of normal termination I decided to use an int64 instead of int32. The reason being that POSIX allows you to get the full exit code back, using APIs like waitid(). On ILP64 systems, the exit code may thus be 64 bits. For the signal in case of abnormal termination I decided to use a string. The reason being that signal numbers are not covered by POSIX. Everybody always assumes that 9 is SIGKILL, but POSIX only documents this as a convention for the command line flags of the `kill` utility. Not the the actual `SIG*` constants. Fixes: bazelbuild#240

POSIX issue 7, 2018 edition, section 3.303 "Process Termination" states the following: > There are two kinds of process termination: > > 1. Normal termination occurs by a return from main(), when requested > with the exit(), _exit(), or _Exit() functions; or when the last > thread in the process terminates by returning from its start > function, by calling the pthread_exit() function, or through > cancellation. > > 2. Abnormal termination occurs when requested by the abort() function > or when some signals are received. Both these cases can be detected by parent processes separately. Unfortunately, REv2 currently doesn't allow this, as we only provide an exit code. Implementations such as Buildbarn tend to set the exit code for abnormal termination to 128+signum, meaning that if a segmentation fault occurs, an action terminates with exit code 139. This tends to be fairly confusing for non-expert users. The goal of this change is to make all of this less confusing. It adds a new message type named ProcessTerminationResult, which contains the termination result in a more structured way. It can now either be an exit code or a signal number. If it turns out that other operating systems provide even more ways, we can add additional `oneof` cases. For the exit code in case of normal termination I decided to use an int64 instead of int32. The reason being that POSIX allows you to get the full exit code back, using APIs like waitid(). On ILP64 systems, the exit code may thus be 64 bits. For the signal in case of abnormal termination I decided to use a string. The reason being that signal numbers are not covered by POSIX. Everybody always assumes that 9 is SIGKILL, but POSIX only documents this as a convention for the command line flags of the `kill` utility. Not the the actual `SIG*` constants. Adding an enumeration would also be unwise, as operating systems are free to add their own signals that are not covered by POSIX (e.g., SIGINFO on BSD). Fixes: bazelbuild#240

On POSIX-like systems, processes may either terminate normally with an integer exit code. The exit code may span the full range of int, even though all but the waitid() function retur the bottom eight bits. In addition to that, processes may terminate abnormally due to a signal (SIGABRT, SIGSEGV, etc.) POSIX shells (sh, bash, etc.) are more restrictive, in that they can only return exit codes between 0 and 126. 127 is used to denote that the executable cannot be found. Exit codes above 128 indicate that the process terminated due to a signal. Right now we let test-setup.sh terminate using the exit code obtained using $?. This means that if a program terminates due to SIGABRT, test-setup.sh terminates with exit code 128+6=134. This causes us to lose some information, as the (remote) execution environment now only sees plain exit codes. This change extends test-setup.sh to check for exit codes above 128. In that case it will send a signal to itself, so that the original signal condition is raised once again. See also: bazelbuild/remote-apis#240

On POSIX-like systems, processes may either terminate normally with an integer exit code. The exit code may span the full range of int, even though all but the waitid() function retur the bottom eight bits. In addition to that, processes may terminate abnormally due to a signal (SIGABRT, SIGSEGV, etc.) POSIX shells (sh, bash, etc.) are more restrictive, in that they can only return exit codes between 0 and 126. 127 is used to denote that the executable cannot be found. Exit codes above 128 indicate that the process terminated due to a signal. Right now we let test-setup.sh terminate using the exit code obtained using $?. This means that if a program terminates due to SIGABRT, test-setup.sh terminates with exit code 128+6=134. This causes us to lose some information, as the (remote) execution environment now only sees plain exit codes. This change extends test-setup.sh to check for exit codes above 128. In that case it will send a signal to itself, so that the original signal condition is raised once again. See also: bazelbuild/remote-apis#240 Closes #18827. PiperOrigin-RevId: 547773406 Change-Id: Ia29a6ea1eefdb8caa5624a799755b420a61478a0

On POSIX-like systems, processes may either terminate normally with an integer exit code. The exit code may span the full range of int, even though all but the waitid() function retur the bottom eight bits. In addition to that, processes may terminate abnormally due to a signal (SIGABRT, SIGSEGV, etc.) POSIX shells (sh, bash, etc.) are more restrictive, in that they can only return exit codes between 0 and 126. 127 is used to denote that the executable cannot be found. Exit codes above 128 indicate that the process terminated due to a signal. Right now we let test-setup.sh terminate using the exit code obtained using $?. This means that if a program terminates due to SIGABRT, test-setup.sh terminates with exit code 128+6=134. This causes us to lose some information, as the (remote) execution environment now only sees plain exit codes. This change extends test-setup.sh to check for exit codes above 128. In that case it will send a signal to itself, so that the original signal condition is raised once again. See also: bazelbuild/remote-apis#240 Closes bazelbuild#18827. PiperOrigin-RevId: 547773406 Change-Id: Ia29a6ea1eefdb8caa5624a799755b420a61478a0

On POSIX-like systems, processes may either terminate normally with an integer exit code. The exit code may span the full range of int, even though all but the waitid() function retur the bottom eight bits. In addition to that, processes may terminate abnormally due to a signal (SIGABRT, SIGSEGV, etc.) POSIX shells (sh, bash, etc.) are more restrictive, in that they can only return exit codes between 0 and 126. 127 is used to denote that the executable cannot be found. Exit codes above 128 indicate that the process terminated due to a signal. Right now we let test-setup.sh terminate using the exit code obtained using $?. This means that if a program terminates due to SIGABRT, test-setup.sh terminates with exit code 128+6=134. This causes us to lose some information, as the (remote) execution environment now only sees plain exit codes. This change extends test-setup.sh to check for exit codes above 128. In that case it will send a signal to itself, so that the original signal condition is raised once again. See also: bazelbuild/remote-apis#240 Closes #18827. PiperOrigin-RevId: 547773406 Change-Id: Ia29a6ea1eefdb8caa5624a799755b420a61478a0 Co-authored-by: Ed Schouten <eschouten@apple.com>

EdSchouten linked a pull request Mar 8, 2023 that will close this issue

Add ProcessTerminationResult, for better alignment with POSIX #244

Open

EdSchouten changed the title ~~REv3: Let exit_code be better aligned with C/POSIX~~ Let exit_code be better aligned with C/POSIX Mar 14, 2023

EdSchouten mentioned this issue Jul 1, 2023

test-setup.sh: Attempt to raise the original signal once more bazelbuild/bazel#18827

Closed

iancha1992 mentioned this issue Jul 13, 2023

[6.3.0] test-setup.sh: Attempt to raise the original signal once more bazelbuild/bazel#18932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let exit_code be better aligned with C/POSIX #240

Let exit_code be better aligned with C/POSIX #240

EdSchouten commented Feb 21, 2023

Let exit_code be better aligned with C/POSIX #240

Let exit_code be better aligned with C/POSIX #240

Comments

EdSchouten commented Feb 21, 2023