Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PATCH without optimistick locking to remove finalizer this way will avoid conflicts #1233

Closed
csviri opened this issue May 24, 2022 · 16 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@csviri
Copy link
Collaborator

csviri commented May 24, 2022

For remove this makes sense in all cases.
(For add this could cause some issues, its hard to avoid "double adding" without optimistic locking" atm)

2022-05-24 14:19:26,505 i.j.o.p.e.ReconciliationDispatcher [ERROR] [flink.basic-session-job-example] Error during event processing ExecutionScope{ resource id: ResourceID{name='basic-session-job-example', namespace='flink'}, version: 854260} failed.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://127.0.0.1:55618/apis/flink.apache.org/v1beta1/namespaces/flink/flinksessionjobs/basic-session-job-example. Message: Operation cannot be fulfilled on flinksessionjobs.flink.apache.org "basic-session-job-example": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=flink.apache.org, kind=flinksessionjobs, name=basic-session-job-example, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on flinksessionjobs.flink.apache.org "basic-session-job-example": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:43)
	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher$CustomResourceFacade.replaceResourceWithLock(ReconciliationDispatcher.java:344)
	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.removeFinalizer(ReconciliationDispatcher.java:317)
	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleCleanup(ReconciliationDispatcher.java:283)
	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:78)
	at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:55)
	at io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:356)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
@csviri csviri added the kind/feature Categorizes issue or PR as related to a new feature. label May 24, 2022
@csviri csviri self-assigned this May 24, 2022
@csviri csviri changed the title Use PATCH without optimistick locking to add/remove finalizer this way will avoid conflicts Use PATCH without optimistick locking to remove finalizer this way will avoid conflicts May 24, 2022
@popdaniel942
Copy link

Hello.
I am using the latest version of the operator and whenever I delete a resource and the cleanup method is being called, the first time I get the error described above with "the object has been modified; please apply your changes to the latest version and try again." and then the cleanup is called again and everything works fine.
Is this related to the issue described here?

@csviri
Copy link
Collaborator Author

csviri commented May 25, 2022

Hi @popdaniel942 , yes this is related to that. Will investigate that, it should not happen in general (or just rarely when the resource is updated while the reconciliation running), so this issue will mitigate that or rather eliminate that problem in all cases. But will investigate why is that happening anyway.

@csviri
Copy link
Collaborator Author

csviri commented May 26, 2022

@popdaniel942 there is integration test on our side, that checks this, so it does not happen at least in normal flows:
https://github.com/java-operator-sdk/java-operator-sdk/blob/49ee0a03f90fdeb72a5f5cdd8d1477ef4329eeb3/operator-framework/src/test/java/io/javaoperatorsdk/operator/ControllerExecutionIT.java#L69-L71

If you could tell more in what situation this heppens to you would be very helpful. Isn't you code open source? thx

@popdaniel942
Copy link

popdaniel942 commented May 26, 2022

@csviri unfortunately the code isn't open source...I tried the simplest way possible to use the operator and I still get problem. I used version 3.0.0. The crds are generated from Java code. The reconciliation works, but the cleanup fails with the above error and then retries and it works.
Controller:
@ControllerConfiguration( name = "resource", namespaces = Constants.WATCH_CURRENT_NAMESPACE)
public class ResourceController implements Reconciler, ErrorStatusHandler, Cleaner {

private static final Logger LOGGER = LoggerFactory.getLogger(ResourceController.class);

@OverRide
public DeleteControl cleanup(Resourceresource, Context context) {
LOGGER.info("Delete resources for " + resource.getMetadata().getName());

return DeleteControl.defaultDelete();

}

@OverRide
public UpdateControl reconcile(Resource resource, Context context) {
LOGGER.info("Create or Update resources for " + resource.getMetadata().getName());
resource.setStatus(new ResourceStatus());

return UpdateControl.updateStatus(resource);

}

@OverRide
public ErrorStatusUpdateControl updateErrorStatus(Resource resource, Context context,
Exception e) {
LOGGER.error(e.getMessage());
resource.setStatus(new ResourceStatus());

return ErrorStatusUpdateControl.updateStatus(resource);

}

Resource:
@Version("v1")
@group("resourceGroup")
public class Resource extends CustomResource<ResourceSpec, ResourceStatus> implements Namespaced {

@NotNull
private ResourceSpec spec;

@Override
protected ResourceStatus initStatus() {
    return new ResourceStatus();
}

}

@popdaniel942
Copy link

apparently with 3.0.1 this example works

@csviri
Copy link
Collaborator Author

csviri commented May 26, 2022

@popdaniel942 thank you! we were able to reproduce this on an other project too, I'm on it! :)

@popdaniel942
Copy link

In our full code it still doesn't work with 3.0.1. But we don't change anything in the resource In the cleanup. All we do is call some rest methods. And when DeleteControl.defaultDelete() is called it fails. And after the retry it works.

@csviri
Copy link
Collaborator Author

csviri commented May 26, 2022

Yes, this happens because the framework tried to remove the finalizer during a default delete. And the resource it sees in cache might have a different version than on the server. This is strange because once a resource is marked for deletion it cannot be changed, only finalizer removed. So if there are no more finalizers used, it's weird. But also since the cleanup is triggered is means that the resource (in memory) is already marked for deletion. Looking into it..

@csviri
Copy link
Collaborator Author

csviri commented May 26, 2022

@popdaniel942 could you pls turn on debug level logs, with MDC values for resource name+namespace+version, and thread id present, something like this:

appender.console.layout.pattern = %style{%d}{yellow} %tid %style{%-30c{1.}}{cyan} %highlight{[%-5level] [%X{resource.namespace}.%X{resource.name}.%X{resource.resourceVersion}] %msg%n%throwable}

the styles are not important :D

and send us the log files pls? thx!

Also isn't some controller on the cluster that adds an additional finalizer?

@csviri
Copy link
Collaborator Author

csviri commented May 27, 2022

@popdaniel942 are you using UpdateControl.patchStatus to update the resource?

@popdaniel942
Copy link

Right now we are using UpdateControl.updateStatus but we got the error with both. What is the difference between them? Also I see that when the cleanup is made there is a foregroungDeletion finalizer being added and that the resourceVersion gets changed. Also in some cases, if we run the code in debug mode with breakpoints on the cleanup method we get the error, but if we run normally without debug mode, we don't get the error

@csviri
Copy link
Collaborator Author

csviri commented May 27, 2022

It seems to be that I found the problem. It is actually not a bug, but this issue (using patch for finalizer removal) will fix it, see details here:
#1245

@csviri
Copy link
Collaborator Author

csviri commented May 27, 2022

Right now we are using UpdateControl.updateStatus but we got the error with both. What is the difference between them?

its described here what can go wrong with the patch and why: #1245 (comment)

In case of updateStatus (since there is OL) it easy to cache the resource after the update, and ensure that next reconciliation has starts with most recent one.

Also I see that when the cleanup is made there is a foregroungDeletion finalizer being added and that the resourceVersion gets changed.

In this case it probably happened also occasionally before, since the finalizer removal is using optimistic locking (OL) ATM (this issues will change that)

Also in some cases, if we run the code in debug mode with breakpoints on the cleanup method we get the error, but if we run normally without debug mode, we don't get the error.

This is probably caused because if you debug (it takes some time) to other finalizer is removed, or the resource is changed in other way.

@csviri
Copy link
Collaborator Author

csviri commented May 27, 2022

In case of updateStatus (since there is OL) it easy to cache the resource after the update, and ensure that next reconciliation has starts with most recent one.

If interested more in depth an see why is this possible pls check this:
fabric8io/kubernetes-client#3943 (comment)

@csviri
Copy link
Collaborator Author

csviri commented May 27, 2022

Ok so the issue should be solved by this PR:
#1249

Feel free to try ti out already with a local build from that branch. thx!

At the end it won't be patch, but the update will be retried, see the reasons in the related issue:
#1248

Hopefully the PR will be merged soon, and a new patch version released.

@csviri
Copy link
Collaborator Author

csviri commented May 30, 2022

Fix for the mentioned issue, is not part of mentioned issues and released under v3.0.2, will close this issue.

@csviri csviri closed this as completed May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants