Fail in-progress jobs when the worker running them exits abnormally #277

rosa · 2024-08-12T17:48:02Z

This applies to:

Killed workers that the supervisor detects as dead.
Reaped workers without a clear exit status.
Orphaned executions that somehow lost their worker.
Workers whose heartbeat expired.

To do this easily, since the supervisor doesn't register all workers for efficiency, we need to rely on a new unique identifier that links the supervisor with their configured processes. Since the registration happens after forking, the supervisor doesn't know the registered process IDs of its supervised processes. This unique identifier is a name that gets randomly generated when the process is instantiated. This made me realise I was reusing the configured processes object to start new processes, which is quite prone to issues with already created thread pools and stuff like that 😬 Because of this, this PR also changes the approach to have the Configuration object return configured processes that need to be instantiated before starting, and each time create a new object.

So we can uniquely identify processes by supervisor and name, without having to rely on the PID, that can be duplicated across processes.

We were reusing the instances of Worker and Dispatcher from the initial configuration all the time, which could bring some problems with stopped pools. Now that we need a name to be generated and be unique per process instance, we really need to instantiate new processes every time they're started.

This applies to: - Killed workers that the supervisor detects as dead. - Reaped workers without a clear exit status. - Orphaned executions that somehow lost their worker. - Workers whose heartbeat expired.

As it won't be possible to start new processes after the column is made NOT NULL and before deploying the code that uses that column.

rosa force-pushed the fail-jobs-when-worker-is-killed branch 2 times, most recently from 9848dae to 80dbef5 Compare August 12, 2024 18:09

rosa mentioned this pull request Aug 14, 2024

Handle empty backtrace from Solid Queue failed execution rails/mission_control-jobs#149

Merged

rosa added 4 commits August 21, 2024 15:36

Add new column name to processes

84cb6e4

So we can uniquely identify processes by supervisor and name, without having to rely on the PID, that can be duplicated across processes.

Remove process's name from metadata and add it to instrumentation events

cb5669f

Fail in-progress jobs when the worker running them exits abnormally

9fb89f1

This applies to: - Killed workers that the supervisor detects as dead. - Reaped workers without a clear exit status. - Orphaned executions that somehow lost their worker. - Workers whose heartbeat expired.

rosa force-pushed the fail-jobs-when-worker-is-killed branch 2 times, most recently from 69f30b4 to 3945042 Compare August 21, 2024 13:39

Split processes' name migration into two

76d2c0f

As it won't be possible to start new processes after the column is made NOT NULL and before deploying the code that uses that column.

rosa force-pushed the fail-jobs-when-worker-is-killed branch from 3945042 to 76d2c0f Compare August 21, 2024 13:45

rosa merged commit 89d30c7 into main Aug 21, 2024
8 checks passed

rosa deleted the fail-jobs-when-worker-is-killed branch August 21, 2024 14:21

rosa mentioned this pull request Aug 24, 2024

How would you kill a specific job? #124

Closed

Darhazer mentioned this pull request Aug 27, 2024

NoMethod error in fork_supervisor upon terminating processes #305

Closed

rosa mentioned this pull request Aug 27, 2024

Fix issue when pruning a supervisor and its supervisees via callbacks #306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail in-progress jobs when the worker running them exits abnormally #277

Fail in-progress jobs when the worker running them exits abnormally #277

rosa commented Aug 12, 2024

Fail in-progress jobs when the worker running them exits abnormally #277

Fail in-progress jobs when the worker running them exits abnormally #277

Conversation

rosa commented Aug 12, 2024