Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Prevent the bootstrapper to run kubeadm init on two nodes #99

Closed
fabriziopandini opened this issue Aug 8, 2019 · 20 comments · Fixed by #122
Closed

Prevent the bootstrapper to run kubeadm init on two nodes #99

fabriziopandini opened this issue Aug 8, 2019 · 20 comments · Fixed by #122
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@fabriziopandini
Copy link
Contributor

/kind feature

Describe the solution you'd like
As discussed on #92

while the controlplane is not ready, the controller should requeue the second bootstrap control plane and try to join it after the controlplane gets ready.

That means that if by chance the second bootstrap control plane has a JoinConfiguration, then it will join (InitConfiguration and Clusterconfiguration will be ignored at this point); If the second bootstrap control plane has not JoinConfiguration the controller will error out and eventually update ErrorMessage/ErrorReason.

Anything else you would like to add:
The init-locker is already in the codebase, see #92; it is missing the plumbing into the controller

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 8, 2019
@ykakarap
Copy link

ykakarap commented Aug 9, 2019

I would like to work on this issue.

@ykakarap
Copy link

ykakarap commented Aug 9, 2019

To use the init-locker here I will need to create it using the newControlPlaneInitLocker(log logr.Logger, configMapClient corev1.ConfigMapsGetter).
Any suggestions on where I can get an instance of corev1.ConfigMapsGetter inside the Reconcile function of kubeadmconfig_controller.go?

Thank you.

@chuckha
Copy link
Contributor

chuckha commented Aug 9, 2019

@ykakarap That would come from the static client. I think what you'd want to do in this case, as we aren't using the static clients, is to change that signature to the client found on the reconciler. I'll meet up with you and talk through it.

@ykakarap
Copy link

ykakarap commented Aug 9, 2019

@chuckha Thanks for the walkthrough. 😄
Working on it now.

@ykakarap
Copy link

ykakarap commented Aug 9, 2019

@fabriziopandini @chuckha I am acquiring the lock right before we start creating the bootstrap data for the init control plane. Ideally, we should release the lock after the cluster.x-k8s.io/control-plane-ready annotation is set.
Since we won't know when that happens from inside this Reconcile function, I am planning to have the lock release call every time we verify that annotation is set. Basically, perform the lock release operation here.

Thoughts?

@ykakarap
Copy link

ykakarap commented Aug 9, 2019

/assign @ykakarap
/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Aug 9, 2019
@fabriziopandini
Copy link
Contributor Author

@ykakarap

I am planning to have the lock release call every time we verify that annotation is set.

I think this is acceptable (there should be also a comment on the code on the same line few lines below), but I leave @chuckha the final word on this

@chuckha
Copy link
Contributor

chuckha commented Aug 12, 2019

We should not be using an annotation here. We should be using the cluster.Status.InfrastructureReady field.

@chuckha
Copy link
Contributor

chuckha commented Aug 12, 2019

Actually cluster.Status.InfrastructureReady doesn't seem to be right.

@vincepri @detiber I'd almost expect a field on the capi Cluster.Status telling us if we're ready for joining nodes to a cluster, wdyt? nvm

@fabriziopandini
Copy link
Contributor Author

AFAIK

  • cluster.Status.InfrastructureReady tell us when we can start joining nodes
  • cluster.x-k8s.io/control-plane-ready annotation tells us when the first control plane is up and running, and so we can start joining nodes

@chuckha
Copy link
Contributor

chuckha commented Aug 12, 2019

oh i see, that is defined in CAPI. ok, missed that. The TODO should be removed and we should import

v1alpha2.ClusterAnnotationControlPlaneReady. Thanks!

Rest of this issue sounds good to me.

@detiber
Copy link
Contributor

detiber commented Aug 12, 2019

I'm definitely not a fan of relying on annotations rather than fields, maybe we need to add a Cluster.Status.ControlPlaneReady field even if it is just set by the Cluster controller when the annotation is set?

@fabriziopandini
Copy link
Contributor Author

I'm +1 on adding Cluster.Status.ControlPlaneReady field

@chuckha
Copy link
Contributor

chuckha commented Aug 12, 2019

I opened the linked issue above.

@ykakarap
Copy link

ykakarap commented Aug 12, 2019

I am wondering how the following situation needs to be handled:

Let's say there are 2 control planes that are both trying to init. One of them will get the lock and will try to create the machine and do the init operation. Let's say while creating the machine it encountered an error because of some bad value in the yaml. This machine will try to reconcile over and over again but because of the bad yaml value it will never create. Because of this, the annotation will never be set.

Since the first control plane was not created, should we allow the second control plane to try and init?
Since the annotation is not set the second control plane will try to init. But, since the init lock is not released it will simply requeue.

Both the control planes (and any subsequent nodes) will be stuck in the reconcile loop with no progress.

Now the question is: should we consider this behavior "working as expected" since the mistake was in a value in the yaml or should we have a mechanism that allows the second control plane to try to init if the first one fails.

@chuckha @fabriziopandini

@fabriziopandini
Copy link
Contributor Author

@ykakarap the scenario you are describing is correct, but unless @chuckha want to block on this case, I think that it is fine to postpone it to a follow-up PR.
For this iteration, let's make sure to release the lock if something fails while preparing the cloud-init for the bootstrap node.

@amy
Copy link
Contributor

amy commented Aug 12, 2019

/milestone 0.1
/priority important-soon

@k8s-ci-robot
Copy link
Contributor

@amy: You must be a member of the kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Bootstrap Provider Kubeadm Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone 0.1
/priority important-soon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 12, 2019
@chuckha
Copy link
Contributor

chuckha commented Aug 13, 2019

No interest in blocking @fabriziopandini. Follow on is fine.

/milestone v0.1

@k8s-ci-robot k8s-ci-robot added this to the v0.1 milestone Aug 13, 2019
@ykakarap
Copy link

No interest in blocking @fabriziopandini. Follow on is fine.

Ok then 👍 . @chuckha @fabriziopandini I will raise a PR for this issue later today or tomorrow morning.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants