[FLINK-27551] Update status manually instead of relying on updatecontrol #199

gyfora · 2022-05-10T06:59:15Z

This PR reworks how status updates of Flink resources are updated in kubernetes. This is necessary due to operator-framework/java-operator-sdk#1198

Based on offline discussion with the JOSDK team this seems to be the safest short term solution.

The caching of the latest status is necessary to avoid race condition between spec updates and status patches that are made by us.

Tests have been updated with some init logic to enable patching them through the kubernetes client + removed the reliance of mutating status.

gyfora · 2022-05-10T07:11:44Z

cc @morhidi @wangyang0918

...src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java

wangyang0918 · 2022-05-11T04:07:36Z

I want to confirm what is the long term solution. Do we want to keep the manual status update or leverage the java-operator-sdk v3 to avoid the conflicts?

The manual status update has a downside that the CR status could be inconsistent the real status. I think that is why we need the statusCache and have to update the CR status before every reconciliation. I just feel that it makes things more complicated. I know it also has the upside that we could update status in any time and do not need to re-trigger reconciliation. So I am not pretty sure which is the better one.

gyfora · 2022-05-11T05:28:00Z

Thanks for raising this @wangyang0918
To be very specific here, the statusCache is needed because the java-operator-sdk-framework does not know about the status update for already "scheduled" reconcile operations so it might give us a stale one. The statusCache is always consistent with the real status.

I think the we need both features in the long run:

Robust status updates
Persist status mid-reconciliation

V3 of the operator sdk will definitely solve 1. and they might actually solve 2. eventually for us. There is a larger design change that we should do after the release to help with 2:

Instead of storing non-user-facing information into the status we could use an extra configmap per FlinkDeployment. This is the supported approach by the operator framework for storing peristent states for the reconciliation logic. This would already work today so we don't need to wait for any additional feature. Then we could simply move out things like lastReconciledSpec, reconciliationStatus etc that are only important for the operator and update them on the fly.

So to conclude my thoughts:

Long term solution would be a combination of moving to v3 and coming up with something nicer for the on-the-fly status updates which are necessary to keep some things consistent and clean.

wangyang0918 · 2022-05-11T06:08:34Z

@gyfora Thanks for the detailed explanation.

Using the java-operator-sdk v3 for robust status updates and an extra ConfigMap per FlinkDeployment for the on-the-fly status updates makes a lot of sense to me.

Given that the robust status updates could be covered re-trigger a reconciliation. We could upgrade the java-operator-sdk version later. Why do we not introduce the extra ConfigMap now for the on-the-fly status updates?

gyfora · 2022-05-11T06:40:49Z

@wangyang0918 I would not rush the ConfigMap change for the following reasons:

It's not completely clear to me what should go into status and what in the configmap
I am not even sure if we want to do this or not, status updates should be fine in theory and the operator framework might actually support on the fly status updates if we want

I think this requires a FLIP and some careful consideration/feedback. We can introduce this from v1beta1 -> v1 if we want in a backward compatible way.

wangyang0918 · 2022-05-11T06:56:50Z

I get your point and will finish the review today. Let's keep an eye on the java-operator-sdk to see whether on-the-fly status updates could also be supported :)

wangyang0918

Very nice PR. I have no more comments.

+1 for merging.

gyfora force-pushed the FLINK-27551 branch from 0eece24 to de23b48 Compare May 10, 2022 07:05

gyfora requested a review from wangyang0918 May 10, 2022 07:11

[FLINK-27551] Update status manually instead of relying on updatecontrol

cf3284a

gyfora force-pushed the FLINK-27551 branch from de23b48 to cf3284a Compare May 10, 2022 07:35

morhidi reviewed May 10, 2022

View reviewed changes

...src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java Show resolved Hide resolved

gyfora added 2 commits May 10, 2022 15:36

Add retry to status patch

302c4d8

Always set resourceVersion to null before patching status

30209e4

gyfora mentioned this pull request May 10, 2022

[FLINK-27495] Observe last savepoint status directly from cluster #200

Merged

wangyang0918 approved these changes May 11, 2022

View reviewed changes

Added some extra javadocs

7359161

gyfora merged commit c8a4310 into apache:main May 11, 2022

Aitozi mentioned this pull request May 15, 2022

[FLINK-27163] Add label for the session job to help list the session … #215

Closed

gyfora deleted the FLINK-27551 branch June 27, 2022 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-27551] Update status manually instead of relying on updatecontrol #199

[FLINK-27551] Update status manually instead of relying on updatecontrol #199

gyfora commented May 10, 2022

gyfora commented May 10, 2022

wangyang0918 commented May 11, 2022

gyfora commented May 11, 2022

wangyang0918 commented May 11, 2022 •

edited

Loading

gyfora commented May 11, 2022

wangyang0918 commented May 11, 2022

wangyang0918 left a comment

[FLINK-27551] Update status manually instead of relying on updatecontrol #199

[FLINK-27551] Update status manually instead of relying on updatecontrol #199

Conversation

gyfora commented May 10, 2022

gyfora commented May 10, 2022

wangyang0918 commented May 11, 2022

gyfora commented May 11, 2022

wangyang0918 commented May 11, 2022 • edited Loading

gyfora commented May 11, 2022

wangyang0918 commented May 11, 2022

wangyang0918 left a comment

Choose a reason for hiding this comment

wangyang0918 commented May 11, 2022 •

edited

Loading