You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a remote worker node fails and is restarted, the running works will remain in the "Running" state and will not marked as "Completed" or "Failed" state forever.
If the worker process that should have been running does not exist after the node is restarted, shouldn't mark the work as "Failed" state?
Version
Using upstream devel image from quay.io/ansible/receptor:devel.
With the current implementation, the sender of work will wait forever for the completion of work that never ends.
It seems that in the node failure, the entire processes/threads of the Goroutine that monitors the process are terminated, so the workpiece is isolated forever.
I think it would be a more natural behavior to implement a work to fail if, after Receptor restarts, there is no worker process that references the unit directory that exists.
The text was updated successfully, but these errors were encountered:
The actual case that could be a problem by this issue:
In AWX, the job template was invoked on Execution Node; in terms of Receptor, ansible-runner worker was invoked as command work on executor node.
During the job running, Execution Node is restarted due to some reason like virtualization host down, power outage, etc.
In this case, launched job in AWX is in running state forever until job timeout even though Ansible Runner is already down and the work is orphaned. AWX can't know that the job never be completed.
Description
If a remote worker node fails and is restarted, the running works will remain in the "Running" state and will not marked as "Completed" or "Failed" state forever.
If the worker process that should have been running does not exist after the node is restarted, shouldn't mark the work as "Failed" state?
Version
Using upstream
devel
image fromquay.io/ansible/receptor:devel
.$ docker compose exec foo receptorctl version receptorctl 1.3.0+g8f8481c receptor 1.3.0+g8f8481c
Steps to reproduce
issue
Prepare files
foo.yml
bar.yml
docker-compose.yml
Prepare environment
Submit work
Restart executor node to simulate node failure
Ensure the work is in still running state and never be marked as completed or failed
Additional information
With the current implementation, the sender of work will wait forever for the completion of work that never ends.
It seems that in the node failure, the entire processes/threads of the Goroutine that monitors the process are terminated, so the workpiece is isolated forever.
I think it would be a more natural behavior to implement a work to fail if, after Receptor restarts, there is no worker process that references the unit directory that exists.
The text was updated successfully, but these errors were encountered: