Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patching status after removing finalizer #931

Closed
surajkota opened this issue Sep 7, 2021 · 5 comments
Closed

Patching status after removing finalizer #931

surajkota opened this issue Sep 7, 2021 · 5 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@surajkota
Copy link
Member

surajkota commented Sep 7, 2021

Describe the bug

2021-09-06T17:58:55.473Z	DEBUG	ackrt	removed resource from management	{"kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 2, "account": "", "role": "", "region": "us-west-2", "kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 3}
2021-09-06T17:58:55.473Z	DEBUG	ackrt	<< r.setResourceUnmanaged	{"kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 2, "account": "", "role": "", "region": "us-west-2", "kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 3}
2021-09-06T17:58:55.473Z	INFO	ackrt	deleted resource	{"kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 2, "account": "", "role": "", "region": "us-west-2", "kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 3}
2021-09-06T17:58:55.473Z	DEBUG	ackrt	< r.deleteResource	{"kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 2, "account": "", "role": "", "region": "us-west-2", "kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 3}
2021-09-06T17:58:55.473Z	DEBUG	ackrt	> r.patchResourceStatus	{"kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 2, "account": "", "role": "", "region": "us-west-2", "kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 3}
2021-09-06T17:58:55.473Z	DEBUG	ackrt	>> kc.Patch (status)	{"kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 2, "account": "", "role": "", "region": "us-west-2", "kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 3}
2021-09-06T17:58:55.474Z	DEBUG	ackrt	<< kc.Patch (status)	{"kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 2, "account": "", "role": "", "region": "us-west-2", "kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 3, "error": "models.sagemaker.services.k8s.aws \"xgboost-model\" not found"}
2021-09-06T17:58:55.474Z	DEBUG	ackrt	< r.patchResourceStatus	{"kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 2, "account": "", "role": "", "region": "us-west-2", "kind": "Model", "namespace": "default", "name": "xgboost-model", "generation": 3}
2021-09-06T17:58:55.474Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "model", "request": "default/xgboost-model"}
2021-09-06T17:58:55.474Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "model", "request": "default/xgboost-model"}

See the Logs above. Controller is trying to patch status after finalizer is removed and gets an error "error": "models.sagemaker.services.k8s.aws \"xgboost-model\" not found"}. This bug was introduced in aws-controllers-k8s/code-generator#175 when rm.Delete() started returning a resource passed through updateConditions instead of nil post successful delete.

https://github.com/aws-controllers-k8s/code-generator/blob/a866a846b883a772c4671061783ccc21a48f9412/templates/pkg/resource/manager.go.tpl#L157-L160

Steps to reproduce
generate controller with the latest code-gen and delete a resource

Expected solution

  1. return nil, nil instead of latest, nil after finalizer is removed - https://github.com/aws-controllers-k8s/runtime/blob/24e0b8ddeaaf7df4bf55ef1a75e4608ea5ec01fb/pkg/runtime/reconciler.go#L508
  2. Change https://github.com/aws-controllers-k8s/code-generator/blob/a866a846b883a772c4671061783ccc21a48f9412/templates/pkg/resource/manager.go.tpl#L160 to return nil, nil. When delete is successful and returns nil resource, pass the same to the caller instead of returning the resource passed to this method.
  3. Check if the resource is managed before patching in handleReconcileError - https://github.com/aws-controllers-k8s/runtime/blob/24e0b8ddeaaf7df4bf55ef1a75e4608ea5ec01fb/pkg/runtime/reconciler.go#L647

Recommended solution - Implement both 1&2. Please let me know your thoughts. I have opened PRs for solutions 1&2.

@jaypipes
Copy link
Collaborator

jaypipes commented Sep 7, 2021

@surajkota would this problem go away if we simply removed the finalizer in setResourceUnmanaged but did not call kc.Patch right afterwards?

@surajkota
Copy link
Member Author

@jaypipes yes. The PRs linked to the issue are helping do same thing. Lmk if you are proposing something different

aws-controllers-k8s/runtime#50

aws-controllers-k8s/code-generator#182

ack-bot pushed a commit to aws-controllers-k8s/code-generator that referenced this issue Sep 7, 2021
Issue #, if available:
aws-controllers-k8s/community#931

Description of changes:
Solution 2 in the issue linked above. If `rm.Delete` does not return an error, and observed is also nil, return `nil, nil` to the caller instead of the resource passed to the method.

Testing:
`make test`
Manually using sagemaker controller

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
ack-bot pushed a commit to aws-controllers-k8s/runtime that referenced this issue Sep 8, 2021
Issue #, if available:
aws-controllers-k8s/community#931

Description of changes:
Solution 1 in the issue linked above - `return nil, nil` instead of `resource, nil` after finalizer is removed 

Testing:
`make test`
Manually using sagemaker controller

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
ack-bot pushed a commit to aws-controllers-k8s/sagemaker-controller that referenced this issue Sep 9, 2021
Issue #, if available:
fixes 
- aws-controllers-k8s/community#937
- aws-controllers-k8s/community#931
- aws-controllers-k8s/community#938
- (hopefully) aws-controllers-k8s/community#935 - ran 100 iterations of the test locally using kind cluster

Description of changes:
update runtime 0.13.2 and regenerate code to fix the issues above

Testing:
`make test`
PR build

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
ack-bot pushed a commit to aws-controllers-k8s/applicationautoscaling-controller that referenced this issue Sep 9, 2021
Issue #, if available:
fixes 
- aws-controllers-k8s/community#937
- aws-controllers-k8s/community#931
- aws-controllers-k8s/community#938
- (hopefully) aws-controllers-k8s/community#935 - ran 100 iterations of sagemaker adopted endpoint test locally using kind cluster

Description of changes:
update runtime 0.13.2 and regenerate code to fix the issues above

Testing:
`make test`
PR build

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
@vijtrip2 vijtrip2 reopened this Jan 11, 2022
@vijtrip2
Copy link
Contributor

Current solution does not cover the cases where resource status needs patching before it is marked as managed. See aws-controllers-k8s/runtime#64

Reopening this issue and I will propose some more solution to this problem.

@vijtrip2 vijtrip2 self-assigned this Jan 11, 2022
@vijtrip2
Copy link
Contributor

vijtrip2 commented Jan 11, 2022

Context

ACK reconciler patches resource status only once at the end of reconciler loop using HandleReconcilerError(ctx, desired, latest, err) method.
This works great because all the error handling, requeue logic is present at one place for CUD operations. This also provides reconciler opportunity to keep updating resource Status throughout reconciler loop, and then status is efficiently updated in k8s cluster using one call at the end of reconciler loop.

However, there is one race condition in above flow for Delete operation. Since reconciler removes the finalizer before calling HandleReconcilerError method, there is a race condition between patching resource status at end of reconciler loop and k8s deleting the resource from etcd.

If the resource is deleted first, PatchStatus operation logs a NotFound error in controller logs, and we will discuss solutions to fix that.

Note: Functionality wise there is no impact caused by this race condition. If the resource is deleted, it is perfectly fine to ignore NotFound error.

Existing Solution

Check for finalizer on the resource before patching the resource status.

Con:
  • This works great in most cases but it does not allow ACK reconciler to patch status until resourceManager.Create() call is made because ACK reconciler adds the finalizer right before making the resourceManager.Create() call.
  • In ACK, Resource References are resolved before rm.Create() call, hence if there is an error in resolving Resource References, the status.Conditions is not patched because the finalizer is still not present.

Proposed Solution

Keep the existing code as is but check the error for kc.Status().Patch() operation. If the error is NotFound ignore the error, otherwise log the error in controller logs.

Pro:
  • Easy to impelement
  • Does not impact reconciler behavior in any major way

Alternate Approaches

1. Add the finalizer in ACK resource as very first step and not wait till rm.Create() call

Con:
  • This will be a major change in functionality and some new bugs may surface. This will be major effort for just getting rid of an error log statement.

2. Remove the condition that checks for finalizer and attempt to patch all resource status

Con:
  • Customers will see a false positive error string in controller logs.

3. Perform Read call before trying to patch the resource status

Con:
  • Extra API Server read call
  • For practical purpose, ignoring NotFound error from Read is identical to ignoring NotFound error from PatchStatus call(without reading first)

4. Refactor code to handle delete request status patching separately

Con:
  • introduces more branching in the code

@jaypipes
Copy link
Collaborator

@vijtrip2 thank you for the excellent and very clear description of the problem and alternative solutions above!

I agree with you that the proposed solution of keeping the existing code but checking for the NotFound in the Status.Patch() error is the simplest approach with the least risk of introducing additional problems in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants