Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc fixes for Jenkins CI #1141

Merged
merged 3 commits into from
Mar 29, 2022
Merged

Misc fixes for Jenkins CI #1141

merged 3 commits into from
Mar 29, 2022

Conversation

ajdecon
Copy link
Collaborator

@ajdecon ajdecon commented Mar 25, 2022

Summary

Following the merge of #1043 to update our Kubespray version and migrate to containerd, we're seeing some failures in Jenkins CI.

This PR is intended to let me work through these failures until we get a successful build. 😄

Changes include:

  • a7c89bf: Update to maintained version of docker-registry chart
  • e271f74: Fixes for local registry test in Kubernetes with use of containerd
  • 5e7c681: Fixes to monitoring tests to update stack to one that works with later Kubernetes, and account for use of GPU operator

Test plan

All CI tests should pass

As part of the deprecation of the "stable" Helm charts, the chart
`stable/docker-registry` has been deprecated explicitly:
helm/charts#24303

In particular, this chart no longer builds successfully, and the version
that we had pinned to doesn't work with the latest Kubernetes:

```
stderr: 'Error: unable to build kubernetes objects from release
manifest: unable to recognize "": no matches for kind "Ingress" in
version "extensions/v1beta1"'
```

This commit swaps to using the chart pointed to at
https://artifacthub.io/packages/helm/twuni/docker-registry, which
appears to be the most actively maintained chart following the
deprecation of the original.

We may want to consider building our own at some point though...
ajdecon added 2 commits March 28, 2022 23:09
- Update to Kubespray v2.18.1
- Add containerd_insecure_registries configuration
- Update test to use `ctr` command
- Increase timeouts
- Update to most recent kube-prometheus-stack
- Update monitoring test to account for use of GPU operator by default
@ajdecon ajdecon requested a review from supertetelman March 28, 2022 23:12
@ajdecon ajdecon marked this pull request as ready for review March 28, 2022 23:12
"registry.local:31500": "http://registry.local:31500"

# Work-around for https://github.com/kubernetes-sigs/kubespray/issues/8529
nerdctl_extra_flags: " --insecure-registry"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call out somewhere that we are defaulting to use this flag?


echo "Test DCGM metrics"
sh '''
timeout 500 bash -x ./workloads/jenkins/scripts/test-dcgm-metrics.sh slurm-node
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this test?

fi

# Commented out test-dcgm-metrics test, as with gpu-operator we no longer expose port 9400 directly on the nodes
#bash -x ./workloads/jenkins/scripts/test-dcgm-metrics.sh slurm-node # We use slurm-node here because it is GPU only, kube-node includes the mgmt plane
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should key off the "DEEPOPS_K8S_OPERATOR" variable for this. If set to "true we are using the operator. If left blank or set to false we are using device plugin.

We still claim to support both and the nightly testing goes down both paths.

Copy link
Collaborator

@supertetelman supertetelman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, need to modify the DCGM tests to still run in some cases, but I will handle that in the other PR where we are adding more K8s test scenarios and already touching the same scripts.

@supertetelman supertetelman merged commit 478fdab into NVIDIA:master Mar 29, 2022
@ajdecon ajdecon mentioned this pull request Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants