This repo contains a device plugin for Nomad to support exposing a number of virtual GPUs for each physical GPU present on the machine. This enables running workloads which don't consume the whole GPU.
This plugin needs the following dependencies to function:
- Nomad 0.9+
- GNU/Linux x86_64 with kernel version > 3.10
- NVIDIA GPU with Architecture > Fermi (2.1)
- NVIDIA drivers >= 340.29 with binary nvidia-smi
- Docker v19.03+
Copy the plugin binary to the plugins directory and configure the plugin in the client config. Also, see the requirements for the official nvidia-plugin.
plugin "nvidia-vgpu" {
config {
ignored_gpu_ids = ["uuid1", "uuid2"]
fingerprint_period = "5s"
vgpus = 16
}
}
Use the device stanza in the job file to schedule with device support.
job "gpu-test" {
datacenters = ["dc1"]
type = "batch"
group "smi" {
task "smi" {
driver = "docker"
config {
image = "nvidia/cuda:11.0-base"
command = "nvidia-smi"
}
resources {
device "letmutx/gpu" {
count = 1
# Add an affinity for a particular model
affinity {
attribute = "${device.model}"
value = "Tesla K80"
weight = 50
}
}
}
}
}
}
- GPU memory allocation/usage is handled in a cooperative manner. This means that one bad GPU process using more memory than assigned can cause starvation for other processes.
- Managing memory isolation per task is left to the user. It depends on a lot of factors like MPS, GPU architecture etc. This doc has some information.
The best way to test the plugin is to go to a target machine with Nvidia GPU and run the plugin using Nomad's plugin launcher with:
make eval