Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail in-progress jobs when the worker running them exits abnormally #277

Merged
merged 5 commits into from
Aug 21, 2024

Conversation

rosa
Copy link
Member

@rosa rosa commented Aug 12, 2024

This applies to:

  • Killed workers that the supervisor detects as dead.
  • Reaped workers without a clear exit status.
  • Orphaned executions that somehow lost their worker.
  • Workers whose heartbeat expired.

To do this easily, since the supervisor doesn't register all workers for efficiency, we need to rely on a new unique identifier that links the supervisor with their configured processes. Since the registration happens after forking, the supervisor doesn't know the registered process IDs of its supervised processes. This unique identifier is a name that gets randomly generated when the process is instantiated. This made me realise I was reusing the configured processes object to start new processes, which is quite prone to issues with already created thread pools and stuff like that 😬 Because of this, this PR also changes the approach to have the Configuration object return configured processes that need to be instantiated before starting, and each time create a new object.

So we can uniquely identify processes by supervisor and name, without
having to rely on the PID, that can be duplicated across processes.
We were reusing the instances of Worker and Dispatcher from the initial
configuration all the time, which could bring some problems with stopped
pools. Now that we need a name to be generated and be unique per process
instance, we really need to instantiate new processes every time they're
started.
This applies to:
- Killed workers that the supervisor detects as dead.
- Reaped workers without a clear exit status.
- Orphaned executions that somehow lost their worker.
- Workers whose heartbeat expired.
@rosa rosa force-pushed the fail-jobs-when-worker-is-killed branch 2 times, most recently from 69f30b4 to 3945042 Compare August 21, 2024 13:39
As it won't be possible to start new processes after the column
is made NOT NULL and before deploying the code that uses that column.
@rosa rosa force-pushed the fail-jobs-when-worker-is-killed branch from 3945042 to 76d2c0f Compare August 21, 2024 13:45
@rosa rosa merged commit 89d30c7 into main Aug 21, 2024
8 checks passed
@rosa rosa deleted the fail-jobs-when-worker-is-killed branch August 21, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant