Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

correct propagation of launchplan start error #598

Merged
merged 2 commits into from
Aug 4, 2023

Conversation

hamersaw
Copy link
Contributor

TL;DR

Correctly fails a workflow node where the launchplan fails to start on admin.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Launchplans are executing in FlytePropeller as WorkflowNodes. Basically, a launchplan is executed by FlytePropeller sending an execution request admin, which then starts the launchplan, and FlytePropeller stores the execution ID in the WorkflowNode state. At each iteration FlytePropeller checks the status of the FlyteWorkflow CR represented by the execution ID and updates the WorkflowNode state accordingly.

What is happening in the issue linked below is FlyteAdmin is failing to start the launchplan. FlytePropeller detects this failure and in doing so maintains the proposed execution ID in the WorkflowNode state (here) and transitions the node to a failed state. When FlytePropeller attempts to event this state to FlyteAdmin, it checks whether the execution ID exists(here). Of course since FlyteAdmin failed to start the launchplan the execution ID does not exist. This failure results in the Workflow does not exist error that we see. And ultimately, FlytePropeller proceeds with aborting the WorkflowNode, which is entirely unnecessary.

To fix this, there are two possible solutions:
(1) If a launchplan fails to start by a user error (ex.invalid type interface), we do not set the execution ID on the WorkflowNode state because the execution ID was never started. Of course, this means that we trust FlyteAdmin to report user errors only when the launchplan was not able to execute -- I think this is reasonable. This is implemented in this PR.
(2) Allow FlyteAdmin to fail checking the existence of an execution ID for events that report a failed state.

Tracking Issue

https://github.com/unionai/cloud/issues/4172

Follow-up issue

NA

Signed-off-by: Daniel Rammer <daniel@union.ai>
@hamersaw hamersaw marked this pull request as ready for review July 31, 2023 19:57
@hamersaw hamersaw changed the title Fixed correct propagation of launchplan start error correct propagation of launchplan start error Jul 31, 2023
@codecov
Copy link

codecov bot commented Jul 31, 2023

Codecov Report

Merging #598 (67fec06) into master (8446baf) will increase coverage by 0.49%.
Report is 1 commits behind head on master.
The diff coverage is 100.00%.

❗ Current head 67fec06 differs from pull request most recent head 95ab90a. Consider uploading reports for the commit 95ab90a to get more accurate results

@eapolinario
Copy link
Contributor

So, instead of seeing "workflow not found" the side effect of this change is that users will see the underlying error in flyteconsole?

@hamersaw
Copy link
Contributor Author

hamersaw commented Aug 4, 2023

So, instead of seeing "workflow not found" the side effect of this change is that users will see the underlying error in flyteconsole?

Exactly.

Signed-off-by: Daniel Rammer <daniel@union.ai>
@hamersaw hamersaw merged commit 83d95dc into master Aug 4, 2023
14 checks passed
@hamersaw hamersaw deleted the bug/launch-plan-start-failure branch August 4, 2023 18:30
gvashishtha pushed a commit to gvashishtha/flytepropeller that referenced this pull request Aug 5, 2023
Signed-off-by: Daniel Rammer <daniel@union.ai>
Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>
eapolinario pushed a commit that referenced this pull request Aug 6, 2023
* go mod

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* updating go mod

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* bumping version

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* some commenting

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* make singular unions castable to their underlying type (#599)

Signed-off-by: Daniel Rammer <daniel@union.ai>
Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* fixed correct propagation of launchplan start error (#598)

Signed-off-by: Daniel Rammer <daniel@union.ai>
Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* bumping flytestdlib

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

---------

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>
Signed-off-by: Daniel Rammer <daniel@union.ai>
Co-authored-by: Dan Rammer <daniel@union.ai>
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
Signed-off-by: Daniel Rammer <daniel@union.ai>
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
* go mod

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* updating go mod

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* bumping version

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* some commenting

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* make singular unions castable to their underlying type (flyteorg#599)

Signed-off-by: Daniel Rammer <daniel@union.ai>
Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* fixed correct propagation of launchplan start error (flyteorg#598)

Signed-off-by: Daniel Rammer <daniel@union.ai>
Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

* bumping flytestdlib

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>

---------

Signed-off-by: Gopal K. Vashishtha <gvashishtha@anduril.com>
Signed-off-by: Daniel Rammer <daniel@union.ai>
Co-authored-by: Dan Rammer <daniel@union.ai>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants