-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-eks: neuron device plugin manifest better reference #29262
Comments
Thank you for the report. I guess we probably need to update this file. Are you interested to submit a PR for that? |
It is repruducible. import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as eks from 'aws-cdk-lib/aws-eks';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as iam from 'aws-cdk-lib/aws-iam';
export class CdkIssueStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const vpc = new ec2.Vpc(this, 'VPC', {
maxAzs: 3
});
const cluster = new eks.Cluster(this, 'EKSCluster', {
vpc,
version: eks.KubernetesVersion.V1_29,
defaultCapacity: 0,
mastersRole: iam.Role.fromRoleArn(this, 'Admin', "xxx", {
mutable: false,
})
});
cluster.addNodegroupCapacity('Inf2NodeGroup', {
instanceTypes: [new ec2.InstanceType('inf2.xlarge')],
minSize: 2,
});
}
} $ kubectl describe daemonset neuron-device-plugin-daemonset -n kube-system
Name: neuron-device-plugin-daemonset
Selector: name=neuron-device-plugin-ds
Node-Selector: <none>
Labels: aws.cdk.eks/prune-xxx
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 2
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: name=neuron-device-plugin-ds
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Containers:
k8s-neuron-device-plugin-ctr:
Image: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
Priority Class Name: system-node-critical
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 37m daemonset-controller Created pod: neuron-device-plugin-daemonset-f578d
Normal SuccessfulCreate 37m daemonset-controller Created pod: neuron-device-plugin-daemonset-d4ksr $ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
aws-node-ghjqh 2/2 Running 0 41m
aws-node-vjq99 2/2 Running 0 42m
coredns-68bd859788-flbr4 1/1 Running 0 45m
coredns-68bd859788-wxtfv 1/1 Running 0 45m
kube-proxy-54klc 1/1 Running 0 41m
kube-proxy-kx9rm 1/1 Running 0 42m
neuron-device-plugin-daemonset-d4ksr 0/1 CrashLoopBackOff 12 (2m37s ago) 39m
neuron-device-plugin-daemonset-f578d 0/1 CrashLoopBackOff 12 (2m22s ago) 39m $ kubectl describe pod neuron-device-plugin-daemonset-d4ksr -n kube-system
Name: neuron-device-plugin-daemonset-d4ksr
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: ip-10-0-240-116.eu-west-1.compute.internal/10.0.240.116
Start Time: Fri, 29 Mar 2024 08:55:24 +0000
Labels: controller-revision-hash=67496f5558
name=neuron-device-plugin-ds
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 10.0.201.70
IPs:
IP: 10.0.201.70
Controlled By: DaemonSet/neuron-device-plugin-daemonset
Containers:
k8s-neuron-device-plugin-ctr:
Container ID: containerd://6e5f8d1ebdc2591edd37ccfe20c79169dc1564d2e163e0d704cbef02d957dda9
Image: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0
Image ID: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin@sha256:6a0df1d6446c96b752f7abbdc9478873e2f3da05989dcaf17667076db8339728
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Fri, 29 Mar 2024 09:31:51 +0000
Finished: Fri, 29 Mar 2024 09:31:51 +0000
Ready: False
Restart Count: 12
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-65qsg (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
kube-api-access-65qsg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly op=Exists
aws.amazon.com/neuron:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 39m default-scheduler Successfully assigned kube-system/neuron-device-plugin-daemonset-d4ksr to ip-10-0-240-116.eu-west-1.compute.internal
Normal Pulled 39m kubelet Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 8.068s (8.068s including waiting)
Normal Pulled 39m kubelet Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 683ms (683ms including waiting)
Normal Pulled 39m kubelet Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 672ms (672ms including waiting)
Normal Started 38m (x4 over 39m) kubelet Started container k8s-neuron-device-plugin-ctr
Normal Pulled 38m kubelet Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 680ms (680ms including waiting)
Normal Pulling 37m (x5 over 39m) kubelet Pulling image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0"
Normal Created 37m (x5 over 39m) kubelet Created container k8s-neuron-device-plugin-ctr
Normal Pulled 37m kubelet Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 678ms (678ms including waiting)
Warning BackOff 4m19s (x163 over 39m) kubelet Back-off restarting failed container k8s-neuron-device-plugin-ctr in pod neuron-device-plugin-daemonset-d4ksr_kube-system(5b998b0a-c411-4aa0-916a-4b08433213f6) $ kubectl logs neuron-device-plugin-daemonset-d4ksr -n kube-system
neuron-device-plugin: 2024/03/29 09:31:51 Fetching devices.
neuron-device-plugin: 2024/03/29 09:31:51 Error to get IB device: open /run/infa-map.json: no such file or directory
neuron-device-plugin: 2024/03/29 09:31:51 No devices found.
neuron-device-plugin: 2024/03/29 09:31:51 Device list: []
neuron-device-plugin: 2024/03/29 09:31:51 Starting FS watcher.
neuron-device-plugin: 2024/03/29 09:31:51 Starting OS watcher.
neuron-device-plugin: 2024/03/29 09:31:51 Error to get devices: open /run/infa-map.json: no such file or directory
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x85bb96]
goroutine 1 [running]:
main.(*DevicePlugin).cleanup(0x0, 0x1, 0x1)
/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:203 +0x26
main.(*DevicePlugin).Start(0x0, 0xc000120048, 0x10)
/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:75 +0x2f
main.(*DevicePlugin).Serve(0x0, 0x9700e4, 0x15, 0xc0000665a0, 0x0)
/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:229 +0x35
main.main()
/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/main.go:64 +0x3a8 |
@wafuwafu13 @pahud please note my suggestion in "possible solution": |
Describe the bug
the neuron device plugin addon used in the cdk uses a custom manifest, see here:
aws-cdk/packages/aws-cdk-lib/aws-eks/lib/cluster.ts
Line 1979 in f3d74bb
which is NOT pointing to the official neuron image (public.ecr.aws/neuron/neuron-device-plugin)
and rbac is missing
going into crashloopback and preventing metrics to be exposed
Expected Behavior
the right files are used
Current Behavior
crashloopback on deployment of inf2.xlarge
Reproduction Steps
deploy on inf2
Possible Solution
the neuron device plugin addon used in the cdk uses a custom manifest, see here:
aws-cdk/packages/aws-cdk-lib/aws-eks/lib/cluster.ts
Line 1979 in f3d74bb
while there is a better existing reference from the Neuron, see description here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html
the yaml to use is https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml
and also rbac needs to be used which is not in the current implementation
const RBAC_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml
Additional Information/Context
No response
CDK CLI Version
2.130.0
Framework Version
No response
Node.js Version
v20.4.0
OS
sonoma 14.3
Language
TypeScript
Language Version
No response
Other information
No response
The text was updated successfully, but these errors were encountered: