Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Calling update on modified FlyteWorkflow CRD #2546

Closed
2 tasks done
hamersaw opened this issue May 25, 2022 · 1 comment · Fixed by flyteorg/flytepropeller#450
Closed
2 tasks done

[BUG] Calling update on modified FlyteWorkflow CRD #2546

hamersaw opened this issue May 25, 2022 · 1 comment · Fixed by flyteorg/flytepropeller#450
Assignees
Labels
bug Something isn't working
Milestone

Comments

@hamersaw
Copy link
Contributor

hamersaw commented May 25, 2022

Describe the bug

Periodically, during workflow executions the workflow will end with a "Operation cannot be fulfilled on "x": the object has been modified" (full error message below). This occurs when FlytePropeller attempts to update a FlyteWorkflow CRD using a resource version that is not the latest, implying that another entity already updated the resource.

{"json":{"exec_id":"a5rmmwzffhclr2twfbwk","ns":"flytesnacks-development","routine":"worker-7","src":"passthrough.go:95"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com \"a5rmmwzffhclr2twfbwk\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2022-05-25T13:07:47-05:00"}

This is a pretty complex issue, but I believe there are two approaches to fix it. However, it is unclear as to which adheres to the intended functionality. We present the approaches in terms of describing the issue first:

Unecessarily processing the workflow after it has been marked success:
The FlytePropeller workqueue works by tracking strings with 3 data structures in the k8s workqueue, a queue, a dirty set, and a processing set. Queued objects flow from the dirty -> queue -> processing. In the case that an object in the processing stage and it is enqueued, it is set as "dirty" and once the processing completes it is immediately transitioned back to the queue to be reprocessed.

In FlytePropeller, we call Forget on the workqueue when there is an error or when a round of processing completes. The comments on these calls imply that this will remove the workflow id from the queue and wait for another update to re-enqueue. However, this is not the case; rather, this call removes the workflow from metadata in the rate limiter. This means that if the enqueueFlyteWorkflow function is called during the round which transitions the workflow from Succeeding the Succeeded (as it should because it is called during workflow transitions), the workflow is marked as "dirty" in the queue and upon completion is immediately re-enqueued. Therefore, there is another round of unnecessary processing that shouldn't occur.

The difficulty here is that the k8s workqueue implementations are not exported in their respective packages. With the lack of an API to remove an item from the queue it is impossible to stop it from being automatically re-enqueued if it has been marked "dirty".

Retrieving stale workflow:
The workflow resourceVersionCaching stores caches for workflow resource versions. Presumably, this is so that FlytePropeller can not operate on a stale version of the workflow. The problem is that during the aforementioned event (transition from Succeeding to Succeeded and another workflow evaluation) the resourceVersionCaching structure is updated with the current resource version (in a test example 55900 and the new resource version is 55903). Then when the next processing round begins we read a FlyteWorkflow CRD with resource version 55900 (rather than 55903) and attempt to transition the workflow from Succeeding to Succeeded again. This seems to be an issue because other evaluations of the same code read the latest resource version available (ie. 55903). This would result in the Succeeded version of the workflow to be processed which would quickly identify a terminal workflow and break out. It could be related to evicting the workflow from the resource version cache when it reaches a terminal state.

Expected behavior

FlytePropeller should not attempt to update a FlyteWorkflow CRD once it has been determined successful.

Additional context to reproduce

Execute the hello world example until this error occurs in the FlytePropeller logs.

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@hamersaw hamersaw added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels May 25, 2022
@hamersaw hamersaw self-assigned this May 25, 2022
@hamersaw hamersaw removed the untriaged This issues has not yet been looked at by the Maintainers label May 25, 2022
@hamersaw hamersaw added this to the 1.1.0 - Hawk milestone May 25, 2022
@hamersaw
Copy link
Contributor Author

cc @EngHabu thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants