Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eks): enable ipv6 for eks cluster #25819

Merged
merged 10 commits into from
Jun 8, 2023
Merged

feat(eks): enable ipv6 for eks cluster #25819

merged 10 commits into from
Jun 8, 2023

Conversation

kishiel
Copy link
Contributor

@kishiel kishiel commented Jun 1, 2023

Description

This change enables IPv6 for EKS clusters

Reasoning

  • IPv6-based EKS clusters will enable service owners to minimize or even eliminate the perils of IPv4 CIDR micromanagement
  • IPv6 will enable very-large-scale EKS clusters
  • My working group ( Amazon SDO/ECST ) recently attempted to enable IPv6 using L1 Cfn EKS constructs, but failed after discovering a CDKv2 issue which results in a master-less EKS cluster. Rather than investing in fixing this interaction we agreed to contribute to aws-eks (this PR)

Design

  • This change treats IPv4 as the default networking configuration
  • A new enum IpFamily is introduced to direct users to specify IP_V4 or IP_V6
  • This change adds a new Sam layer dependency Dependency removed after validation it was no longer necessary

Testing

I consulted with some team members about how to best approach testing this change, and I concluded that I should duplicate the eks-cluster test definition. I decided that this was a better approach than redefining the existing cluster test to use IPv6 for a couple of reasons:

  1. EKS still requires IPv4 under the hood
  2. IPv6 CIDR and subnet association isn't exactly straightforward. My example in eks-cluster-ipv6 is the simplest one I could come up with
  3. There's additional permissions and routing configuration that's necessary to get the cluster tests to succeed. The differences were sufficient to motivate splitting out the test, in my opinion.

I ran into several issues running the test suite, primarily related to out-of-memory conditions which no amount of RAM appeared to help. NODE_OPTIONS--max-old-space-size=8192 did not improve this issue, nor did increasing it to 12GB. Edit: This ended up being a simple fix, but annoying to dig out. The fix is export NODE_OPTIONS=--max-old-space-size=8192. Setting this up in my .rc file did not stick, either. MacOS Ventura for those keeping score at home.

The bulk of my testing was performed using a sample stack definition (below), but I was unable to run the manual testing described in aws-eks/test/MANUAL_TEST.md due to no access to the underlying node instances. Edit, I can run the MANUAL_TESTS now if that's deemed necessary.

Updated: This sample stack creates an ipv6 enabled cluster with an example nginx service running.

Sample:

import {
  App, Duration, Fn, Stack,
  aws_ec2 as ec2,
  aws_eks as eks,
  aws_iam as iam,
} from 'aws-cdk-lib';
import { getClusterVersionConfig } from './integ-tests-kubernetes-version';

const app = new App();
const env = { region: 'us-east-1', account: '' };
const stack = new Stack(app, 'my-v6-test-stack-1', { env });

const vpc = new ec2.Vpc(stack, 'Vpc', { maxAzs: 3, natGateways: 1, restrictDefaultSecurityGroup: false });
const ipv6cidr = new ec2.CfnVPCCidrBlock(stack, 'CIDR6', {
  vpcId: vpc.vpcId,
  amazonProvidedIpv6CidrBlock: true,
});

let subnetcount = 0;
let subnets = [...vpc.publicSubnets, ...vpc.privateSubnets];
for ( let subnet of subnets) {
  // Wait for the ipv6 cidr to complete
  subnet.node.addDependency(ipv6cidr);
  _associate_subnet_with_v6_cidr(subnetcount, subnet);
  subnetcount++;
}

const roles = _create_roles();

const cluster = new eks.Cluster(stack, 'Cluster', {
  ...getClusterVersionConfig(stack),
  vpc: vpc,
  clusterName: 'some-eks-cluster',
  defaultCapacity: 0,
  endpointAccess: eks.EndpointAccess.PUBLIC_AND_PRIVATE,
  ipFamily: eks.IpFamily.IP_V6,
  mastersRole: roles.masters,
  securityGroup: _create_eks_security_group(),
  vpcSubnets: [{ subnets: subnets }],
});

// add a extra nodegroup
cluster.addNodegroupCapacity('some-node-group', {
  instanceTypes: [new ec2.InstanceType('m5.large')],
  minSize: 1,
  nodeRole: roles.nodes,
});

cluster.kubectlSecurityGroup?.addEgressRule(
  ec2.Peer.anyIpv6(), ec2.Port.allTraffic(),
);

// deploy an nginx ingress in a namespace
const nginxNamespace = cluster.addManifest('nginx-namespace', {
  apiVersion: 'v1',
  kind: 'Namespace',
  metadata: {
    name: 'nginx',
  },
});

const nginxIngress = cluster.addHelmChart('nginx-ingress', {
  chart: 'nginx-ingress',
  repository: 'https://helm.nginx.com/stable',
  namespace: 'nginx',
  wait: true,
  createNamespace: false,
  timeout: Duration.minutes(5),
});

// make sure namespace is deployed before the chart
nginxIngress.node.addDependency(nginxNamespace);

function _associate_subnet_with_v6_cidr(count: number, subnet: ec2.ISubnet) {
  const cfnSubnet = subnet.node.defaultChild as ec2.CfnSubnet;
  cfnSubnet.ipv6CidrBlock = Fn.select(count, Fn.cidr(Fn.select(0, vpc.vpcIpv6CidrBlocks), 256, (128 - 64).toString()));
  cfnSubnet.assignIpv6AddressOnCreation = true;
}

export function _create_eks_security_group(): ec2.SecurityGroup {
  let sg = new ec2.SecurityGroup(stack, 'eks-sg', {
    allowAllIpv6Outbound: true,
    allowAllOutbound: true,
    vpc,
  });
  sg.addIngressRule(
    ec2.Peer.ipv4('10.0.0.0/8'), ec2.Port.allTraffic(),
  );
  sg.addIngressRule(
    ec2.Peer.ipv6(Fn.select(0, vpc.vpcIpv6CidrBlocks)), ec2.Port.allTraffic(),
  );
  return sg;
}

export namespace Kubernetes {
  export interface RoleDescriptors {
    masters: iam.Role,
    nodes: iam.Role,
  }
}

function _create_roles(): Kubernetes.RoleDescriptors {
  const clusterAdminStatement = new iam.PolicyDocument({
    statements: [new iam.PolicyStatement({
      actions: [
        'eks:*',
        'iam:ListRoles',
      ],
      resources: ['*'],
    })],
  });

  const eksClusterAdminRole = new iam.Role(stack, 'AdminRole', {
    roleName: 'some-eks-master-admin',
    assumedBy: new iam.AccountRootPrincipal(),
    inlinePolicies: { clusterAdminStatement },
  });

  const assumeAnyRolePolicy = new iam.PolicyDocument({
    statements: [new iam.PolicyStatement({
      actions: [
        'sts:AssumeRole',
      ],
      resources: ['*'],
    })],
  });

  const ipv6Management = new iam.PolicyDocument({
    statements: [new iam.PolicyStatement({
      resources: ['arn:aws:ec2:*:*:network-interface/*'],
      actions: [
        'ec2:AssignIpv6Addresses',
        'ec2:UnassignIpv6Addresses',
      ],
    })],
  });

  const eksClusterNodeGroupRole = new iam.Role(stack, 'NodeGroupRole', {
    roleName: 'some-node-group-role',
    assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
    managedPolicies: [
      iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKSWorkerNodePolicy'),
      iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryReadOnly'),
      iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEKS_CNI_Policy'),
      iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore'),
      iam.ManagedPolicy.fromAwsManagedPolicyName('CloudWatchAgentServerPolicy'),
    ],
    inlinePolicies: {
      assumeAnyRolePolicy,
      ipv6Management,
    },
  });

  return { masters: eksClusterAdminRole, nodes: eksClusterNodeGroupRole };
}

Issues

Edit: Fixed

Integration tests, specifically the new one I contributed, failed with an issue in describing a Fargate profile:

2023-06-01T16:24:30.127Z    6f9b8583-8440-4f13-a48f-28e09a261d40    INFO    {
    "describeFargateProfile": {
        "clusterName": "Cluster9EE0221C-f458e6dc5f544e9b9db928f6686c14d5",
        "fargateProfileName": "ClusterfargateprofiledefaultEF-1628f1c3e6ea41ebb3b0c224de5698b4"
    }
}
---------------------------
2023-06-01T16:24:30.138Z    6f9b8583-8440-4f13-a48f-28e09a261d40    INFO    {
    "describeFargateProfileError": {}
}
---------------------------
2023-06-01T16:24:30.139Z    6f9b8583-8440-4f13-a48f-28e09a261d40    ERROR    Invoke Error     {
    "errorType": "TypeError",
    "errorMessage": "getEksClient(...).describeFargateProfile is not a function",
    "stack": [
        "TypeError: getEksClient(...).describeFargateProfile is not a function",
        "    at Object.describeFargateProfile (/var/task/index.js:27:51)",
        "    at FargateProfileResourceHandler.queryStatus (/var/task/fargate.js:83:67)",
        "    at FargateProfileResourceHandler.isUpdateComplete (/var/task/fargate.js:49:35)",
        "    at FargateProfileResourceHandler.isCreateComplete (/var/task/fargate.js:46:21)",
        "    at FargateProfileResourceHandler.isComplete (/var/task/common.js:31:40)",
        "    at Runtime.isComplete [as handler] (/var/task/index.js:50:21)",
        "    at Runtime.handleOnceNonStreaming (/var/runtime/Runtime.js:74:25)"
    ]
}

I am uncertain if this is an existing issue or one introduced by this change, or something related to my local build. Again, I had abundant issues related to building aws-cdk and the test suites depending on Jupiter's position in the sky.

Collaborators

Most of the work in this change was performed by @wlami and @jagu-sayan (thank you!)

Fixes #18423


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@gitpod-io
Copy link

gitpod-io bot commented Jun 1, 2023

@github-actions github-actions bot added effort/small Small work item – less than a day of effort feature-request A feature should be added or improved. p2 beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK labels Jun 1, 2023
@aws-cdk-automation aws-cdk-automation requested a review from a team June 1, 2023 18:26
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.

@kishiel kishiel changed the title feat(aws-eks): enable ipv6 for eks cluster feat(eks): enable ipv6 for eks cluster Jun 1, 2023
@kishiel
Copy link
Contributor Author

kishiel commented Jun 2, 2023

Working on ironing out the build failures. I neglected to enable the IPv6 CIDR in the inbound rules for the cluster security group, and the subnets weren't quite wired up right. There's an outstanding issue with describeFargateProfile which is occurring on this line (shown in the OP) that I'm trying to figure out.

This sample successfully maps a user to the masters RBAC and a node group is eligible to join.

import {
  App, Fn, Stack,
  aws_ec2 as ec2,
  aws_eks as eks,
  aws_iam as iam,
} from 'aws-cdk-lib';
import { getClusterVersionConfig } from './integ-tests-kubernetes-version';

cost myAccount = ''
const app = new App();
const env = { region: 'us-east-1', account: myAccount };
const stack = new Stack(app, 'my-v6-test-stack-1', { env });

const vpc = new ec2.Vpc(stack, 'Vpc', { maxAzs: 3, natGateways: 1, restrictDefaultSecurityGroup: false });
const ipv6cidr = new ec2.CfnVPCCidrBlock(stack, 'CIDR6', {
  vpcId: vpc.vpcId,
  amazonProvidedIpv6CidrBlock: true,
});

let subnetcount = 0;
let subnets = [...vpc.publicSubnets, ...vpc.privateSubnets];
for ( let subnet of subnets) {
  // Wait for the ipv6 cidr to complete
  subnet.node.addDependency(ipv6cidr);
  _associate_subnet_with_v6_cidr(subnetcount, subnet);
  subnetcount++;
}

const cluster = new eks.Cluster(stack, 'Cluster', {
  ...getClusterVersionConfig(stack),
  vpc: vpc,
  clusterName: 'some-eks-cluster',
  defaultCapacity: 1,
  endpointAccess: eks.EndpointAccess.PUBLIC_AND_PRIVATE,
  ipFamily: eks.IpFamily.IP_V6,
  securityGroup: _create_eks_security_group(),
  vpcSubnets: [{ subnets: subnets }],
});

const importedRole = iam.Role.fromRoleName(stack, 'imported-role', 'some-role);
cluster.awsAuth.addMastersRole(importedRole);

function _associate_subnet_with_v6_cidr(count: number, subnet: ec2.ISubnet) {
  const cfnSubnet = subnet.node.defaultChild as ec2.CfnSubnet;
  cfnSubnet.ipv6CidrBlock = Fn.select(count, Fn.cidr(Fn.select(0, vpc.vpcIpv6CidrBlocks), 256, (128 - 64).toString()));
  cfnSubnet.assignIpv6AddressOnCreation = true;
}

export function _create_eks_security_group(): ec2.SecurityGroup {
  let sg = new ec2.SecurityGroup(stack, 'eks-sg', {
    allowAllIpv6Outbound: true,
    allowAllOutbound: true,
    vpc,
  });
  sg.addIngressRule(
    ec2.Peer.ipv4('10.0.0.0/8'), ec2.Port.allTraffic(),
  );
  sg.addIngressRule(
    ec2.Peer.ipv6(Fn.select(0, vpc.vpcIpv6CidrBlocks)), ec2.Port.allTraffic(),
  );
  return sg;
}

@aws-cdk-automation aws-cdk-automation dismissed their stale review June 5, 2023 23:01

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

@kishiel
Copy link
Contributor Author

kishiel commented Jun 6, 2023

All tests are passing on my machine. With any luck the build will succeed here. There are a couple of deficiencies / quality-of-life changes that this PR highlights:

  1. The supplemental permission changes for the aws-cni driver described in this doc should be codified and handled automatically by EKS constructs. Without this a customer of EKS will create an ipv6 cluster without the ability to obtain an IP.
  2. Alternatively, the base role used for aws-cni should be amended to include IPv6 assign/unassign permissions
  3. Stemming from 1, the default node role is difficult to manipulate (to add these permissions) because it's an optional role. There doesn't appear to be a way to define a default node role for a cluster.
  4. EKS cluster security group egress definitions, I believe, automatically grant all traffic out on 0.0.0.0/0. This should be amended to include ::/0 for IPv6 clusters.
  5. Similar to 4, the egress definition for the cluster control plane security group (shows up in additional security groups in the Cluster UI) should amend its egress definition for IPv6. I couldn't find a way to do this manually, but admittedly I didn't dig too hard into it.

Outstanding work for this PR:

The testing surface area for this change is potentially very large given we're completely redefining the networking. I'm happy to add additional tests, but I'd like to have some direction on those tests before I go make them.

@aws-cdk-automation aws-cdk-automation added the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Jun 6, 2023
@@ -365,6 +367,23 @@ removing these ingress/egress rules in order to restrict access to the default s



### @aws-cdk/aws-kms:aliasNameRef
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I added everything in my workspace following this change. I had a merge from aws-cdk:main mid-change that I suspect this came from. Let me confirm that this is just duplicative from that merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this was something from that merge as far as I can tell. I don't know where this came from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this change and rebuilding returns this change

/**
* EKS cluster IP family.
*/
export enum IpFamily {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already an AddressFamily in aws-ec2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 */
export enum AddressFamily {
  IP_V4 = 'IPv4',
  IP_V6 = 'IPv6',
}

While similar it's not the same as the definition this was modeled after (ipFamily is looking for lowercase values of these enums). Although largely redundant I think that keeping the enum name consistent with the property name is more intuitive.

> For more details visit [Configuring the Amazon VPC CNI plugin for Kubernetes to use IAM roles for service accounts](https://docs.aws.amazon.com/eks/latest/userguide/cni-iam-role.html#cni-iam-role-create-role)

```ts
const ipv6Management = new iam.PolicyDocument({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this automatically for the user if the cluster was created with ipFamily === 'ipv6'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. I think the real fix here is just getting the default CNI managed policy to add these permissions. I'm struggling to think of why such a change would be rejected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking a look at making a change to be able to do this. It'll increase the scope of the change, but unless we modify the default node group's role permissions to include ipv6 this will be a headache for everyone because default node groups will never succeed out of the box with this change.

A cheap(er) way to do it would be to just directly modify the nodeGroup role to always include ipv6, but I think having the ability to measure the cluster ipFamily through CDK has value, too.

@mergify mergify bot dismissed otaviomacedo’s stale review June 7, 2023 19:37

Pull request has been modified.

@kishiel
Copy link
Contributor Author

kishiel commented Jun 7, 2023

This update adds the supplemental IPv6 networking permissions to the default node group role if the Cluster ipFamily is IP_V6. This required plumbing the IpFamily throughout the cluster construct.

Edit: This change enabled me to revert some of the customization in the cluster-ipv6 test and restore it to look like the ipv4 cluster test it was cloned from.

@mergify
Copy link
Contributor

mergify bot commented Jun 8, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 7f9b1de
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify mergify bot merged commit 75d1853 into aws:main Jun 8, 2023
@mergify
Copy link
Contributor

mergify bot commented Jun 8, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK effort/small Small work item – less than a day of effort feature-request A feature should be added or improved. p2 pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(aws-eks): IPv6 support for EKS Clusters
3 participants