Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-67679] add maintenance mode to pause provisioning nodes #1134

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fengxx
Copy link

@fengxx fengxx commented Feb 15, 2022

fix JENKINS-67679 add maintenance mode to pause provisioning nodes

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@Vlatombe Vlatombe changed the title fix JENKINS-67679 add maintenance mode to pause provisioning nodes [JENKINS-67679] add maintenance mode to pause provisioning nodes Feb 21, 2022
@Vlatombe Vlatombe added the enhancement Improvements label Feb 21, 2022
Copy link
Member

@Vlatombe Vlatombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor style items to fix but the feature and logic looks fine.

@jglick
Copy link
Member

jglick commented Feb 22, 2022

Is this really necessary? Can you not get a similar effect by putting some sort of taint on the cluster’s nodes, or making an admission webhook reject new pods? Or would builds fail with an error after timing out trying to schedule a pod?

If this really needs to be done on the Jenkins side, maybe it would be better for core to have the ability to take a Cloud offline.

@Vlatombe
Copy link
Member

If this really needs to be done on the Jenkins side, maybe it would be better for core to have the ability to take a Cloud offline.

This looks like a good idea actually.

@dduportal
Copy link
Contributor

Is this really necessary? Can you not get a similar effect by putting some sort of taint on the cluster’s nodes, or making an admission webhook reject new pods? Or would builds fail with an error after timing out trying to schedule a pod?

The permissions that a Jenkins admin are given are not always the same as the set of permissions of the Kubernetes cluster (even not permitted to access the cluster API). This PR is really legit in term of usage.

If this really needs to be done on the Jenkins side, maybe it would be better for core to have the ability to take a Cloud offline.

That is really an excellent idea: this change would benefit way more users. Where should it be looked (a particular plugin? or the Jenkins core)?

@fengxx
Copy link
Author

fengxx commented Feb 24, 2022

Is this really necessary? Can you not get a similar effect by putting some sort of taint on the cluster’s nodes, or making an admission webhook reject new pods? Or would builds fail with an error after timing out trying to schedule a pod?

If this really needs to be done on the Jenkins side, maybe it would be better for core to have the ability to take a Cloud offline.

maintenance usually includes cordon and drain, cordon will put a taint on node, drain will evict pods to other nodes (drain is necessary for kernel upgrade, docker/containerd uprade etc)

kubectl drain node-name--delete-local-data --ignore-daemonsets

Jenkins job pod will be deleted from the node and job will fail. Even if we enable retry on job level, plugin may launch a pod on other nodes which will also be drained. You may ask why don't put all node on cordon, that is overkill since kubernetes service are not affected by cordon and drain, kubernetes has logic to maintain PodDisruptionBudget in drain operation. Jenkins job is the only victim, we can not freeze whole cluster for jenkins if the cluster is shared.

I tried to workaround the issue by using a system groovy job to put k8s plugin on hold

import jenkins.model.Jenkins
import org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud;

def instance=Jenkins.get()
// we only care first item since we know only one defined
def clouds=instance.clouds.getAll(KubernetesCloud.class).findAll{it -> filter-clause }
clouds.each{ cloud ->
  cloud.containerCap=-1
}

But if jenkins itself is running in k8s cluster and gets restarted, the containerCap will lost due to check (it is a valid check)

    public void setContainerCap(Integer containerCap) {
        this.containerCap = (containerCap != null && containerCap > 0) ? containerCap : null;
    }
    public String getContainerCapStr() {
        // null, serialized Integer.MAX_VALUE, or 0 means no limit
        return (containerCap == null || containerCap == Integer.MAX_VALUE || containerCap == 0) ? "" : String.valueOf(containerCap);
    }

so I think it is better to add a self-explaining flag

@jglick
Copy link
Member

jglick commented Feb 24, 2022

Jenkins job pod will be deleted from the node and job will fail.

FYI: #1083

@jonesbusy
Copy link
Contributor

Hi,

I'm looking a similar feature to put a cloud in maintenance mode.

During my search I found the agent-maintenance-plugin but it cover partially my use case of permanent jenkins agents.

Regards,

@Vlatombe
Copy link
Member

Taking another look at this request: I think the agent-maintenance-plugin could be extended to support clouds. There is an existing CloudProvisioningListener extension point allowing to prevent cloud provisioning on a specific cloud and label (hudson.slaves.CloudProvisioningListener#canProvision(hudson.slaves.Cloud, hudson.slaves.Cloud.CloudState, int)). This would cover the use case for any kind of cloud, as well as providing a central place to configure such maintenance windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants