[BUG] Calling update on modified FlyteWorkflow CRD #2546

hamersaw · 2022-05-25T21:37:52Z

Describe the bug

Periodically, during workflow executions the workflow will end with a "Operation cannot be fulfilled on "x": the object has been modified" (full error message below). This occurs when FlytePropeller attempts to update a FlyteWorkflow CRD using a resource version that is not the latest, implying that another entity already updated the resource.

{"json":{"exec_id":"a5rmmwzffhclr2twfbwk","ns":"flytesnacks-development","routine":"worker-7","src":"passthrough.go:95"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com \"a5rmmwzffhclr2twfbwk\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2022-05-25T13:07:47-05:00"}

This is a pretty complex issue, but I believe there are two approaches to fix it. However, it is unclear as to which adheres to the intended functionality. We present the approaches in terms of describing the issue first:

Unecessarily processing the workflow after it has been marked success:
The FlytePropeller workqueue works by tracking strings with 3 data structures in the k8s workqueue, a queue, a dirty set, and a processing set. Queued objects flow from the dirty -> queue -> processing. In the case that an object in the processing stage and it is enqueued, it is set as "dirty" and once the processing completes it is immediately transitioned back to the queue to be reprocessed.

In FlytePropeller, we call Forget on the workqueue when there is an error or when a round of processing completes. The comments on these calls imply that this will remove the workflow id from the queue and wait for another update to re-enqueue. However, this is not the case; rather, this call removes the workflow from metadata in the rate limiter. This means that if the enqueueFlyteWorkflow function is called during the round which transitions the workflow from Succeeding the Succeeded (as it should because it is called during workflow transitions), the workflow is marked as "dirty" in the queue and upon completion is immediately re-enqueued. Therefore, there is another round of unnecessary processing that shouldn't occur.

The difficulty here is that the k8s workqueue implementations are not exported in their respective packages. With the lack of an API to remove an item from the queue it is impossible to stop it from being automatically re-enqueued if it has been marked "dirty".

Retrieving stale workflow:
The workflow resourceVersionCaching stores caches for workflow resource versions. Presumably, this is so that FlytePropeller can not operate on a stale version of the workflow. The problem is that during the aforementioned event (transition from Succeeding to Succeeded and another workflow evaluation) the resourceVersionCaching structure is updated with the current resource version (in a test example 55900 and the new resource version is 55903). Then when the next processing round begins we read a FlyteWorkflow CRD with resource version 55900 (rather than 55903) and attempt to transition the workflow from Succeeding to Succeeded again. This seems to be an issue because other evaluations of the same code read the latest resource version available (ie. 55903). This would result in the Succeeded version of the workflow to be processed which would quickly identify a terminal workflow and break out. It could be related to evicting the workflow from the resource version cache when it reaches a terminal state.

Expected behavior

FlytePropeller should not attempt to update a FlyteWorkflow CRD once it has been determined successful.

Additional context to reproduce

Execute the hello world example until this error occurs in the FlytePropeller logs.

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

hamersaw · 2022-05-26T12:40:08Z

cc @EngHabu thoughts?

hamersaw added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels May 25, 2022

hamersaw self-assigned this May 25, 2022

hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label May 25, 2022

hamersaw added this to the 1.1.0 - Hawk milestone May 25, 2022

hamersaw mentioned this issue Jun 8, 2022

Tracking terminated workflows in LRU cache flyteorg/flytepropeller#450

Merged

8 tasks

wild-endeavor modified the milestones: 1.1.0 - Hawk, 1.2.0 Jun 28, 2022

hamersaw closed this as completed in flyteorg/flytepropeller#450 Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Calling update on modified FlyteWorkflow CRD #2546

[BUG] Calling update on modified FlyteWorkflow CRD #2546

hamersaw commented May 25, 2022 •

edited

Loading

hamersaw commented May 26, 2022

[BUG] Calling update on modified FlyteWorkflow CRD #2546

[BUG] Calling update on modified FlyteWorkflow CRD #2546

Comments

hamersaw commented May 25, 2022 • edited Loading

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

hamersaw commented May 26, 2022

hamersaw commented May 25, 2022 •

edited

Loading