Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eks): support INF2 instance types #27373

Merged
merged 7 commits into from
Oct 4, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ class EksClusterInferenceStack extends Stack {
instanceType: new ec2.InstanceType('inf1.2xlarge'),
minCapacity: 1,
});

cluster.addAutoScalingGroupCapacity('InferenceInstances', {
instanceType: new ec2.InstanceType('inf2.xlarge'),
minCapacity: 1,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will have to run the integ test to update the snapshots. do you have capacity to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made further changes: I duplicated the integ test: 1 with asg inf1, the other inf2.
It is in failed state currently, not sure if I need to actually do something manually somewhere:
aws-cdk-eks-cluster-inf1-test: destroy failed Error: The stack named aws-cdk-eks-cluster-inf1-test is in a failed state. You may need to delete it from the AWS console : DELETE_FAILED (The following resource(s) failed to delete: [ClusterNodegroupDefaultCapacityNodeGroupRole55953B04, ClusterInf1InstancesInstanceRole67C931E4]. )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it say why it failed? the integ test should be able to be successfully deployed and deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i can try to run this for you. our eks integ tests take forever and are wonky :(

}
}

Expand Down
20 changes: 20 additions & 0 deletions packages/@aws-cdk/aws-sagemaker-alpha/lib/instance-type.ts
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,26 @@ export class InstanceType {
*/
public static readonly INF1_XLARGE = InstanceType.of('ml.inf1.xlarge');

/**
* ml.inf2.xlarge
*/
public static readonly INF2_XLARGE = InstanceType.of('ml.inf2.xlarge');

/**
* ml.inf2.8xlarge
*/
public static readonly INF2_8XLARGE = InstanceType.of('ml.inf2.8xlarge');

/**
* ml.inf2.24xlarge
*/
public static readonly INF2_24XLARGE = InstanceType.of('ml.inf2.24xlarge');

/**
* ml.inf2.48xlarge
*/
public static readonly INF2_48XLARGE = InstanceType.of('ml.inf2.48xlarge');

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a place to unit test at least one of these in sagemaker, just for sanity?

/**
* ml.m4.10xlarge
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,11 @@ spec:
- inf1.xlarge
- inf1.2xlarge
- inf1.6xlarge
- inf1.4xlarge
kaizencc marked this conversation as resolved.
Show resolved Hide resolved
- inf1.24xlarge
- inf2.xlarge
- inf2.8xlarge
- inf2.24xlarge
- inf2.48xlarge
- matchExpressions:
- key: "node.kubernetes.io/instance-type"
operator: In
Expand All @@ -49,6 +53,10 @@ spec:
- inf1.2xlarge
- inf1.6xlarge
- inf1.24xlarge
- inf2.xlarge
- inf2.8xlarge
- inf2.24xlarge
- inf2.48xlarge
containers:
- image: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0
imagePullPolicy: Always
Expand Down
2 changes: 1 addition & 1 deletion packages/aws-cdk-lib/aws-eks/lib/instance-types.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
export const INSTANCE_TYPES = {
gpu: ['p2', 'p3', 'g2', 'g3', 'g4'],
inferentia: ['inf1'],
inferentia: ['inf1', 'inf2'],
graviton: ['a1'],
graviton2: ['c6g', 'm6g', 'r6g', 't4g'],
graviton3: ['c7g'],
Expand Down
2 changes: 1 addition & 1 deletion packages/aws-cdk-lib/aws-eks/lib/managed-nodegroup.ts
Original file line number Diff line number Diff line change
Expand Up @@ -514,7 +514,7 @@ function isGpuInstanceType(instanceType: InstanceType): boolean {
//compare instanceType to known GPU InstanceTypes
const knownGpuInstanceTypes = [InstanceClass.P2, InstanceClass.P3, InstanceClass.P3DN, InstanceClass.P4DE, InstanceClass.P4D,
InstanceClass.G3S, InstanceClass.G3, InstanceClass.G4DN, InstanceClass.G4AD, InstanceClass.G5, InstanceClass.G5G,
InstanceClass.INF1];
InstanceClass.INF1, InstanceClass.INF2];
return knownGpuInstanceTypes.some((c) => instanceType.sameInstanceClassAs(InstanceType.of(c, InstanceSize.LARGE)));
}

Expand Down
20 changes: 19 additions & 1 deletion packages/aws-cdk-lib/aws-eks/test/cluster.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2173,7 +2173,7 @@ describe('cluster', () => {
},
});
});
test('inference instances are supported', () => {
test('inf1 instances are supported', () => {
// GIVEN
const { stack } = testFixtureNoVpc();
const cluster = new eks.Cluster(stack, 'Cluster', { defaultCapacity: 0, version: CLUSTER_VERSION, prune: false });
Expand All @@ -2191,6 +2191,24 @@ describe('cluster', () => {
Manifest: JSON.stringify([sanitized]),
});
});
test('inf2 instances are supported', () => {
// GIVEN
const { stack } = testFixtureNoVpc();
const cluster = new eks.Cluster(stack, 'Cluster', { defaultCapacity: 0, version: CLUSTER_VERSION, prune: false });

// WHEN
cluster.addAutoScalingGroupCapacity('InferenceInstances', {
instanceType: new ec2.InstanceType('inf2.xlarge'),
minCapacity: 1,
});
const fileContents = fs.readFileSync(path.join(__dirname, '../lib', 'addons/neuron-device-plugin.yaml'), 'utf8');
const sanitized = YAML.parse(fileContents);

// THEN
Template.fromStack(stack).hasResourceProperties(eks.KubernetesManifest.RESOURCE_TYPE, {
Manifest: JSON.stringify([sanitized]),
});
});

test('kubectl resources are always created after all fargate profiles', () => {
// GIVEN
Expand Down
Loading