-
Notifications
You must be signed in to change notification settings - Fork 332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc fixes for Jenkins CI #1141
Conversation
As part of the deprecation of the "stable" Helm charts, the chart `stable/docker-registry` has been deprecated explicitly: helm/charts#24303 In particular, this chart no longer builds successfully, and the version that we had pinned to doesn't work with the latest Kubernetes: ``` stderr: 'Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "Ingress" in version "extensions/v1beta1"' ``` This commit swaps to using the chart pointed to at https://artifacthub.io/packages/helm/twuni/docker-registry, which appears to be the most actively maintained chart following the deprecation of the original. We may want to consider building our own at some point though...
- Update to Kubespray v2.18.1 - Add containerd_insecure_registries configuration - Update test to use `ctr` command
- Increase timeouts - Update to most recent kube-prometheus-stack - Update monitoring test to account for use of GPU operator by default
"registry.local:31500": "http://registry.local:31500" | ||
|
||
# Work-around for https://github.com/kubernetes-sigs/kubespray/issues/8529 | ||
nerdctl_extra_flags: " --insecure-registry" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call out somewhere that we are defaulting to use this flag?
|
||
echo "Test DCGM metrics" | ||
sh ''' | ||
timeout 500 bash -x ./workloads/jenkins/scripts/test-dcgm-metrics.sh slurm-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we removing this test?
fi | ||
|
||
# Commented out test-dcgm-metrics test, as with gpu-operator we no longer expose port 9400 directly on the nodes | ||
#bash -x ./workloads/jenkins/scripts/test-dcgm-metrics.sh slurm-node # We use slurm-node here because it is GPU only, kube-node includes the mgmt plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should key off the "DEEPOPS_K8S_OPERATOR" variable for this. If set to "true we are using the operator. If left blank or set to false we are using device plugin.
We still claim to support both and the nightly testing goes down both paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, need to modify the DCGM tests to still run in some cases, but I will handle that in the other PR where we are adding more K8s test scenarios and already touching the same scripts.
Summary
Following the merge of #1043 to update our Kubespray version and migrate to containerd, we're seeing some failures in Jenkins CI.
This PR is intended to let me work through these failures until we get a successful build. 😄
Changes include:
Test plan
All CI tests should pass