You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Periodically, during workflow executions the workflow will end with a "Operation cannot be fulfilled on "x": the object has been modified" (full error message below). This occurs when FlytePropeller attempts to update a FlyteWorkflow CRD using a resource version that is not the latest, implying that another entity already updated the resource.
{"json":{"exec_id":"a5rmmwzffhclr2twfbwk","ns":"flytesnacks-development","routine":"worker-7","src":"passthrough.go:95"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com \"a5rmmwzffhclr2twfbwk\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2022-05-25T13:07:47-05:00"}
This is a pretty complex issue, but I believe there are two approaches to fix it. However, it is unclear as to which adheres to the intended functionality. We present the approaches in terms of describing the issue first:
Unecessarily processing the workflow after it has been marked success:
The FlytePropeller workqueue works by tracking strings with 3 data structures in the k8s workqueue, a queue, a dirty set, and a processing set. Queued objects flow from the dirty -> queue -> processing. In the case that an object in the processing stage and it is enqueued, it is set as "dirty" and once the processing completes it is immediately transitioned back to the queue to be reprocessed.
In FlytePropeller, we call Forget on the workqueue when there is an error or when a round of processing completes. The comments on these calls imply that this will remove the workflow id from the queue and wait for another update to re-enqueue. However, this is not the case; rather, this call removes the workflow from metadata in the rate limiter. This means that if the enqueueFlyteWorkflow function is called during the round which transitions the workflow from Succeeding the Succeeded (as it should because it is called during workflow transitions), the workflow is marked as "dirty" in the queue and upon completion is immediately re-enqueued. Therefore, there is another round of unnecessary processing that shouldn't occur.
The difficulty here is that the k8s workqueue implementations are not exported in their respective packages. With the lack of an API to remove an item from the queue it is impossible to stop it from being automatically re-enqueued if it has been marked "dirty".
Retrieving stale workflow:
The workflow resourceVersionCaching stores caches for workflow resource versions. Presumably, this is so that FlytePropeller can not operate on a stale version of the workflow. The problem is that during the aforementioned event (transition from Succeeding to Succeeded and another workflow evaluation) the resourceVersionCaching structure is updated with the current resource version (in a test example 55900 and the new resource version is 55903). Then when the next processing round begins we read a FlyteWorkflow CRD with resource version 55900 (rather than 55903) and attempt to transition the workflow from Succeeding to Succeeded again. This seems to be an issue because other evaluations of the same code read the latest resource version available (ie. 55903). This would result in the Succeeded version of the workflow to be processed which would quickly identify a terminal workflow and break out. It could be related to evicting the workflow from the resource version cache when it reaches a terminal state.
Expected behavior
FlytePropeller should not attempt to update a FlyteWorkflow CRD once it has been determined successful.
Additional context to reproduce
Execute the hello world example until this error occurs in the FlytePropeller logs.
Screenshots
No response
Are you sure this issue hasn't been raised already?
Yes
Have you read the Code of Conduct?
Yes
The text was updated successfully, but these errors were encountered:
Describe the bug
Periodically, during workflow executions the workflow will end with a "Operation cannot be fulfilled on "x": the object has been modified" (full error message below). This occurs when FlytePropeller attempts to update a FlyteWorkflow CRD using a resource version that is not the latest, implying that another entity already updated the resource.
This is a pretty complex issue, but I believe there are two approaches to fix it. However, it is unclear as to which adheres to the intended functionality. We present the approaches in terms of describing the issue first:
Unecessarily processing the workflow after it has been marked success:
The FlytePropeller workqueue works by tracking strings with 3 data structures in the k8s workqueue, a queue, a dirty set, and a processing set. Queued objects flow from the dirty -> queue -> processing. In the case that an object in the processing stage and it is enqueued, it is set as "dirty" and once the processing completes it is immediately transitioned back to the queue to be reprocessed.
In FlytePropeller, we call Forget on the workqueue when there is an error or when a round of processing completes. The comments on these calls imply that this will remove the workflow id from the queue and wait for another update to re-enqueue. However, this is not the case; rather, this call removes the workflow from metadata in the rate limiter. This means that if the enqueueFlyteWorkflow function is called during the round which transitions the workflow from Succeeding the Succeeded (as it should because it is called during workflow transitions), the workflow is marked as "dirty" in the queue and upon completion is immediately re-enqueued. Therefore, there is another round of unnecessary processing that shouldn't occur.
The difficulty here is that the k8s workqueue implementations are not exported in their respective packages. With the lack of an API to remove an item from the queue it is impossible to stop it from being automatically re-enqueued if it has been marked "dirty".
Retrieving stale workflow:
The workflow resourceVersionCaching stores caches for workflow resource versions. Presumably, this is so that FlytePropeller can not operate on a stale version of the workflow. The problem is that during the aforementioned event (transition from Succeeding to Succeeded and another workflow evaluation) the resourceVersionCaching structure is updated with the current resource version (in a test example 55900 and the new resource version is 55903). Then when the next processing round begins we read a FlyteWorkflow CRD with resource version 55900 (rather than 55903) and attempt to transition the workflow from Succeeding to Succeeded again. This seems to be an issue because other evaluations of the same code read the latest resource version available (ie. 55903). This would result in the Succeeded version of the workflow to be processed which would quickly identify a terminal workflow and break out. It could be related to evicting the workflow from the resource version cache when it reaches a terminal state.
Expected behavior
FlytePropeller should not attempt to update a FlyteWorkflow CRD once it has been determined successful.
Additional context to reproduce
Execute the hello world example until this error occurs in the FlytePropeller logs.
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: