Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-27551] Update status manually instead of relying on updatecontrol #199

Merged
merged 4 commits into from
May 11, 2022

Conversation

gyfora
Copy link
Contributor

@gyfora gyfora commented May 10, 2022

This PR reworks how status updates of Flink resources are updated in kubernetes. This is necessary due to operator-framework/java-operator-sdk#1198

Based on offline discussion with the JOSDK team this seems to be the safest short term solution.

The caching of the latest status is necessary to avoid race condition between spec updates and status patches that are made by us.

Tests have been updated with some init logic to enable patching them through the kubernetes client + removed the reliance of mutating status.

@gyfora
Copy link
Contributor Author

gyfora commented May 10, 2022

cc @morhidi @wangyang0918

@gyfora gyfora requested a review from wangyang0918 May 10, 2022 07:11
@wangyang0918
Copy link
Contributor

I want to confirm what is the long term solution. Do we want to keep the manual status update or leverage the java-operator-sdk v3 to avoid the conflicts?

The manual status update has a downside that the CR status could be inconsistent the real status. I think that is why we need the statusCache and have to update the CR status before every reconciliation. I just feel that it makes things more complicated. I know it also has the upside that we could update status in any time and do not need to re-trigger reconciliation. So I am not pretty sure which is the better one.

@gyfora
Copy link
Contributor Author

gyfora commented May 11, 2022

Thanks for raising this @wangyang0918
To be very specific here, the statusCache is needed because the java-operator-sdk-framework does not know about the status update for already "scheduled" reconcile operations so it might give us a stale one. The statusCache is always consistent with the real status.

I think the we need both features in the long run:

  1. Robust status updates
  2. Persist status mid-reconciliation

V3 of the operator sdk will definitely solve 1. and they might actually solve 2. eventually for us. There is a larger design change that we should do after the release to help with 2:

Instead of storing non-user-facing information into the status we could use an extra configmap per FlinkDeployment. This is the supported approach by the operator framework for storing peristent states for the reconciliation logic. This would already work today so we don't need to wait for any additional feature. Then we could simply move out things like lastReconciledSpec, reconciliationStatus etc that are only important for the operator and update them on the fly.

So to conclude my thoughts:

Long term solution would be a combination of moving to v3 and coming up with something nicer for the on-the-fly status updates which are necessary to keep some things consistent and clean.

@wangyang0918
Copy link
Contributor

wangyang0918 commented May 11, 2022

@gyfora Thanks for the detailed explanation.

Using the java-operator-sdk v3 for robust status updates and an extra ConfigMap per FlinkDeployment for the on-the-fly status updates makes a lot of sense to me.

Given that the robust status updates could be covered re-trigger a reconciliation. We could upgrade the java-operator-sdk version later. Why do we not introduce the extra ConfigMap now for the on-the-fly status updates?

@gyfora
Copy link
Contributor Author

gyfora commented May 11, 2022

@wangyang0918 I would not rush the ConfigMap change for the following reasons:

  • It's not completely clear to me what should go into status and what in the configmap
  • I am not even sure if we want to do this or not, status updates should be fine in theory and the operator framework might actually support on the fly status updates if we want

I think this requires a FLIP and some careful consideration/feedback. We can introduce this from v1beta1 -> v1 if we want in a backward compatible way.

@wangyang0918
Copy link
Contributor

I get your point and will finish the review today. Let's keep an eye on the java-operator-sdk to see whether on-the-fly status updates could also be supported :)

Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR. I have no more comments.

+1 for merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants