Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint resource: address failed update scenarios #7

Merged
merged 2 commits into from
Mar 30, 2021

Conversation

surajkota
Copy link
Member

@surajkota surajkota commented Mar 19, 2021

Description of Changes

  • Handle scenarios related to:
    1. Failing update. The update endpoint operation is async and describing endpoint only contains information about the current endpoint config. If an update fails, the endpoint goes back to InService state with a FailureReason about the error. Currently, the controller will be stuck in a loop to keep applying the desired endpoint config and keep on failing with a little window for the customer to notice that the update failed because the failure reason only stays if the endpoint is in InService state and disappears as soon as a new update is applied.
      • This PR fixes the above problem by recording the endpointconfigName used for update under Status.LastEndpointConfigNameForUpdate and only retry an update if the FailureReason was related to an internal server error or if the user has submitted a new endpointConfigName. Additionally, it sets a ACK.Terminal condition to True and stops reconciling if either of these conditions is not met. User is expected to check the failureReason and re-submit the resource with a new endpointConfigName if they wish to update the endpoint.
      • This PR also adds Status.LatestEndpointConfigName so the user can check the current endpoint config name. This is needed in addition to Spec.endpointConfigName(desired) for them to have full visibility over the current state of the endpoint
    2. Endpoint goes to Failed state. Once the endpoint is in Failed state. The only option for the user is to delete and recreate the endpoint.
      • This PR checks if the endpoint is in Failed state as a pre-check-in sdkUpdate and blocks the update. Additionally, it sets ACK.Terminal condition to True so that the controller stops reconciling.
    • see pkg/resource/endpoint/custom_*.go for related implementation details of the above two scenarios
  • Enable retainAllVariantProperties for updates by default since we expect most customers to use this
  • Remove tags field from all the resources because these are only created but we don't have an update path. Tagging experience across ACK is being worked on separately
  • upgrade to latest code-generator and runtime - v0.0.5 and refactor code to use Delta functionality

Code-generator is now adding description to the fields in CRDs and hence the PR looks long but actual changes which need to be reviewed closely are listed above

Testing

aws-controllers-k8s/community#730

Dependent PR: aws-controllers-k8s/code-generator#34

@surajkota surajkota changed the title Endpoint failed update Endpoint resource: address failed update scenarios Mar 19, 2021
@surajkota surajkota changed the title Endpoint resource: address failed update scenarios [Do not merge] Endpoint resource: address failed update scenarios Mar 19, 2021
@surajkota surajkota self-assigned this Mar 20, 2021
@surajkota surajkota added the kind/enhancement Categorizes issue or PR as related to existing feature enhancements. label Mar 20, 2021
@surajkota surajkota force-pushed the endpoint_failed_update branch from d0b64cb to 23f4bf9 Compare March 25, 2021 03:10
@surajkota surajkota changed the title [Do not merge] Endpoint resource: address failed update scenarios Endpoint resource: address failed update scenarios Mar 25, 2021
pkg/resource/endpoint/custom_update_api.go Outdated Show resolved Hide resolved
pkg/resource/endpoint/custom_update_api.go Show resolved Hide resolved
pkg/resource/endpoint/custom_update_api.go Outdated Show resolved Hide resolved
terminalCondition = &ackv1alpha1.Condition{
Type: ackv1alpha1.ConditionTypeTerminal,
}
ko.Status.Conditions = append(ko.Status.Conditions, terminalCondition)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need an else statement to find the existing terminal condition (if it exists but is set to ConditionFalse)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!

  1. If a terminal condition is already set we do not want to override that, see Create Failed terminal condition Feature Group Creating requeue  #51-54
  2. If an error did not occur, this part of code will never execute and the condition will be cleared by this part of code. Hence the else does not exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminal condition may be set, but set to False (that case is not covered on lines 51-54)?

pkg/resource/endpoint/custom_update_conditions.go Outdated Show resolved Hide resolved
@surajkota surajkota force-pushed the endpoint_failed_update branch from 23f4bf9 to fdaf6dc Compare March 25, 2021 18:44
 	- set RetainAllVariantProperties True
@surajkota surajkota force-pushed the endpoint_failed_update branch from fdaf6dc to 378b6b0 Compare March 25, 2021 23:39
pkg/resource/endpoint/custom_update_api.go Outdated Show resolved Hide resolved
pkg/resource/endpoint/custom_update_api.go Outdated Show resolved Hide resolved
pkg/resource/endpoint/custom_update_api.go Outdated Show resolved Hide resolved
// "Request to service failed" means update failed because of ISE and can be retried
(latestStatus != nil && failureReason != nil && lastEndpointConfigForUpdate != nil &&
!strings.HasPrefix(*failureReason, FailureReasonInternalServiceErrorPrefix) &&
*desiredEndpointConfig != *latestEndpointConfig &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*desiredEndpointConfig != *latestEndpointConfig is not necessary since you are checking for *desiredEndpointConfig == *lastEndpointConfigForUpdate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question! both are needed

  1. FailureReason alone does mean an update failed it can appear because of other reasons(patching/scaling failed)
  2. *desiredEndpointConfig == *lastEndpointConfigForUpdate only tells us an update was tried with lastEndpointConfigForUpdate but does not tell us anything if the update was successful or not in the past because it is set if updateEndpoint returns 200 (remember aync operation).
  3. Now, sdkUpdate executes because of change in any field in Spec (like tags/deploymentConfig in future)

1&2 does not guarantee an update Failed. Hence we need to look at *latestEndpointConfigName to determine if the update was unsuccessful
*desiredEndpointConfig != *latestEndpointConfig + *desiredEndpointConfig == *lastEndpointConfigForUpdate+ FailureReason != nil indicate that an update is needed, has already been tried and failed.

That being said. I will add this information in the comments because I will forget it as well

pkg/resource/endpoint/custom_update_api.go Outdated Show resolved Hide resolved
terminalCondition = &ackv1alpha1.Condition{
Type: ackv1alpha1.ConditionTypeTerminal,
}
ko.Status.Conditions = append(ko.Status.Conditions, terminalCondition)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminal condition may be set, but set to False (that case is not covered on lines 51-54)?

@surajkota surajkota force-pushed the endpoint_failed_update branch from 378b6b0 to 1e07022 Compare March 26, 2021 18:30
@surajkota
Copy link
Member Author

surajkota commented Mar 26, 2021

The terminal condition may be set, but set to False (that case is not covered on lines 51-54)?

yep
https://github.com/surajkota/ack-sagemaker-controller/blob/endpoint_failed_update/pkg/resource/endpoint/custom_update_conditions.go#L65

@mbaijal
Copy link
Contributor

mbaijal commented Mar 29, 2021

 * This PR fixes the above problem by recording the endpointconfigName used for update under `Status.LastEndpointConfigNameForUpdate` and only retry an update if the `FailureReason` was related to an internal server error or if the user has submitted a new endpointConfigName. Additionally, it sets a `ACK.Terminal `condition to `True` and stops reconciling if either of these conditions is not met. User is expected to check the `failureReason` and re-submit the resource with a **new** endpointConfigName if they wish to update the endpoint.

Will this conditional check require any additional attention if we start using autogenerated names (I don't think so, but bringing it up so we don't miss it.)

@surajkota
Copy link
Member Author

Will this conditional check require any additional attention if we start using autogenerated names (I don't think so, but bringing it up so we don't miss it.)

I don't think so because those would be generated before creation

Copy link
Contributor

@mbaijal mbaijal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks ok since I have been working with this code and it works as expected.

generator.yaml Show resolved Hide resolved
pkg/resource/endpoint/custom_set_output.go Show resolved Hide resolved
pkg/resource/endpoint/custom_set_output.go Show resolved Hide resolved
@surajkota surajkota merged commit 6077275 into aws-controllers-k8s:main Mar 30, 2021
ryansteakley pushed a commit to ryansteakley/sagemaker-controller that referenced this pull request May 17, 2021
…8s#7)

* endpoint: address failed update scenarios
 	- set RetainAllVariantProperties True
        - Remove tags field from all the resources because these are only created but we don't have an update path. Tagging experience across ACK is being worked on separately
        - upgrade to latest code-generator and runtime - v0.0.5 and refactor code to use Delta functionality
ryansteakley pushed a commit to ryansteakley/sagemaker-controller that referenced this pull request May 18, 2021
…8s#7)

* endpoint: address failed update scenarios
 	- set RetainAllVariantProperties True
        - Remove tags field from all the resources because these are only created but we don't have an update path. Tagging experience across ACK is being worked on separately
        - upgrade to latest code-generator and runtime - v0.0.5 and refactor code to use Delta functionality
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Categorizes issue or PR as related to existing feature enhancements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants