Evaluate kops vs EKS #28

yuvipanda · 2021-02-14T11:38:58Z

I'm playing with kops for managing AWS Kubernetes clusters instead of EKS. I've been frustrated with EKS a while:

It uses AWS VPC CNI for networking. This is AWS native, but has one major problem for us - pod density! Each pod requires its own Elastic Network Interface. There are fairly strong limits on how many ENIs a node can have. For example, an m5.large node can only have 28 pods! That's very low, and wastes a lot of money.
Managing worker nodes is actually quite painful, at least compared to GKE. Managed Node Groups don't scale down to 0, and so we can't really use them. You really end up managing them via eksctl - the aws CLI is too hard, and terraform is also pretty unnecessarily complex.
It costs 72$ a month. I want a lower base cost.

(1) was the motivating case. (2) makes me feel like I'm already managing a lot of infrastructure intimately - might as well embrace the extra control provided by kops, no?

Going to test it out, and keep this issue updated

yuvipanda · 2021-02-14T11:46:41Z

AWS is aware of problem (1) and have a roadmap for pod density fixes. Realistically, I don't expect that for another 5-6 months at the earliest.

yuvipanda · 2021-02-14T12:08:56Z

My current setup is:

t3.medium master, so it's about 30$ a month base cost. I also want the hub pods in there.
EFS for home directories, so you only pay for what you use
Autoscale to 0, so you only pay for what you use

yuvipanda · 2021-02-14T12:56:23Z

Things I've had to patch so far:

kube-dns needs to tolerate being on master
Need to explicitly remove the node-role master from the master nodes, while leaving the taints. Ohterwise, you can't have external load balancer point to it! See Describe appropriate use of node role labels and fixes that should be made kubernetes/enhancements#1144 and Masters should not be excluded from service load balancers kubernetes/kubernetes#65618

yuvipanda · 2021-02-15T06:21:42Z

So kubernetes/enhancements#1144 ended up needing some work. It needed a couple feature gates enabled. And thankfully, kops makes this easy.

I now have a cluster working almost the way I want with the following kops config.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2021-02-13T23:25:11Z"
  name: farallon-2i2c.k8s.local
spec:
  clusterAutoscaler:
    enabled: true
  api:
    loadBalancer:
      class: Classic
      type: Public
  authorization:
    rbac: {}

  dns:
    kubeDNS:
      provider: CoreDNS
  channel: stable
  cloudProvider: aws
  configBase: s3://2i2c-farallon-pangeo-kops/farallon-2i2c.k8s.local
  containerRuntime: docker
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-us-east-2a
      name: a
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-us-east-2a
      name: a
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
    featureGates:
      LegacyNodeRoleBehavior: "false"
      ServiceNodeExclusion: "false"
  kubeControllerManager:
    featureGates:
      LegacyNodeRoleBehavior: "false"
      ServiceNodeExclusion: "false"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.19.7
  masterPublicName: api.farallon-2i2c.k8s.local
  networkCIDR: 172.20.0.0/16
  networking:
    calico:
      majorVersion: v3
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: us-east-2a
    type: Public
    zone: us-east-2a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-02-13T23:25:12Z"
  labels:
    kops.k8s.io/cluster: farallon-2i2c.k8s.local
  name: master-us-east-2a
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  machineType: t3.medium
  maxSize: 1
  minSize: 0
  nodeLabels:
    kops.k8s.io/instancegroup: master-us-east-2a
  role: Master
  subnets:
  - us-east-2a

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: farallon-2i2c.k8s.local
    hub.jupyter.org/pool-name: notebook-m5-xlarge
  name: notebook-m5-xlarge-2021-02-15
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  cloudLabels:
      k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/pool-name: notebook-pool-m5-xlarge
      k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org_dedicated: user:NoSchedule
      k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org/dedicated: user:NoSchedule
  taints:
   - hub.jupyter.org_dedicated=user:NoSchedule
   - hub.jupyter.org/dedicated=user:NoSchedule
  nodeLabels:
    hub.jupyter.org/pool-name: notebook-m5-xlarge
  machineType: m5.xlarge
  maxSize: 20
  minSize: 0
  role: Node
  subnets:
  - us-east-2a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: farallon-2i2c.k8s.local
    hub.jupyter.org/pool-name: notebook-m5-2xlarge
  name: notebook-m5-2xlarge-2021-02-15
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  cloudLabels:
      k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/pool-name: notebook-pool-m5-2xlarge
      k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org_dedicated: user:NoSchedule
      k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org/dedicated: user:NoSchedule
  taints:
   - hub.jupyter.org_dedicated=user:NoSchedule
   - hub.jupyter.org/dedicated=user:NoSchedule
  nodeLabels:
    hub.jupyter.org/pool-name: notebook-m5-2xlarge
  machineType: m5.2xlarge
  maxSize: 20
  minSize: 0
  role: Node
  subnets:
  - us-east-2a
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: farallon-2i2c.k8s.local
    hub.jupyter.org/pool-name: dask-worker
  name: dask-worker-2021-02-15
spec:
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210119.1
  cloudLabels:
      k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/pool-name: dask-worker
      k8s.io/cluster-autoscaler/node-template/taint/k8s.dask.org_dedicated: worker:NoSchedule
      k8s.io/cluster-autoscaler/node-template/taint/k8s.dask.org/dedicated: worker:NoSchedule
  taints:
   - k8s.dask.org_dedicated=worker:NoSchedule
   - k8s.dask.org/dedicated=worker:NoSchedule
  nodeLabels:
    hub.jupyter.org/pool-name: dask-worker
  machineType: m5.2xlarge
  maxSize: 50
  minSize: 0
  role: Node
  subnets:
  - us-east-2a

I have so far absolutely liked this experience much much more than EKS! There is more control than EKS, which actually is a good thing - since EKS is only 'semi-managed', you are often stuck with places where they have made a constraining decision but not provided enough support to make that work easily. Maybe once managed node groups reach feature parity with other cloud providers.

Something else I really really like is that there's one master node that's running t3.medium, and it contains our JupyterHubs too! So the total base cost becomes a little over 30$ a month. With EKS, you've to pay the 72$ a month master fee, and then enough money for a node to run hub infra. And that extra node needs to be much bigger too, due to low pod density.

I still need to do spot instances, and test dask out.

yuvipanda · 2021-02-15T15:17:10Z

With Calico CNI by default, you get networkpolicy enforcement. So gotta make sure your dask setup works with that.

rothgar · 2021-02-18T06:37:28Z

Thanks for the example config and write up here. From my understanding of what you're trying to do here are some suggestions I would probably investigate if I were in your situation.

You may already have some of these things done. This was the research I've done and some ideas.

Reducing cost

Kops is great and removes a lot of the operational burden from you. If you want to stick with it I would look into using spot worker nodes with mixed ASG instance types. You can get a list of comparable instance types using ec2 instance selector and you'll be able to reduce your cost by a lot.

Two things to consider with spot is I'm not sure if the pods students connect to are stateful. If an instance gets shut down in the middle of a class it could be very disruptive. Especially at the pod density you're running.
For spot instances you'll also need to run a termination handler. You can use the official AWS termination handler.

The second thing I would look at is k3s. If you're already running a single node master k3s could reduce the overhead quite a bit by using SQLite instead of etcd. If it were me I would run k3s via terraform and mount an EBS volume and and store sqlite there. If you make the API server node an ASG with min 1 max 1 and store your state in EBS it'll be easy to replace the instance if/when things go bad.

There's a slightly older terraform project that you can use as a starting point. Even with k3s you can still create Spot based ASG worker nodes. You can also look at what kops is doing with --target terraform and look at the tf files it exports.

I suspect you'd be able to reduce your control plane node size if you use k3s and reduce your worker node cost by more than 50% with Spot.

Faster scaling

This problem was a bit more interesting to look at and I have 3 main recommendations.

Build a custom AMI with images pre-pulled

Since your container images are so large you can solve this problem a couple different ways. You can build a custom AMI with the images pulled and then the instance snapshotted. You'd want to build this AMI often to make sure you don't have to pull too much when nodes join the cluster. If you don't want to build a custom AMI you can also use EBS mounted at /var/run/docker and use snapshots and clone the disk for each worker node. I forget the exact syntax but you can do it in the launch template for the ASG.

It would be great if you had a breakdown of how long it takes to start a pod currently with each stage broken down.
How long does it take to add an instance until it's healthy in the cluster?
How long does it take to pull the image?
How long does it take to start the pod and become healthy?

Control ASG scale up outside of Kubernetes

The cluster autoscaler works for gradual increases but it's not great for sudden bursts and is naturally reactive rather than proactive. I think you mentioned that a teacher has a web portal to create the student notebooks. I'm assuming behind the web app is something that scales up your deployment and then you let the autoscaler add instances.

If you can change the web to scale up the ASG before it scales up the k8s deployment you can probably get ahead of the scaling needs buy 30-60 seconds. You could either have the deployment button talk directly to the aws API and ASG but that means you need to hard code region/ASG information. It would probably be easier to use EventBridge that can send a small amount of data like region and how many pods are going to be added. You can then do some basic math in a lambda to know how many instances to add to the ASG based on your desired pod density.

You can still use the cluster autoscaler to scale down the ASG (eventually to 0) but scaling up fast will always be slower if you're trying to react to scheduled pods or running metrics vs pre-scaling the instances.

Optimize container image

This option is a bit more experimental but would be interesting to look at as a possibility. It would also require you to switch to containerd for your container runtime in Kubernetes but you should probably do that within the next year anyway as docker will be removed at some point.

Looking at containerd plugins like stargz you can have large images that pull data on demand. This means you could probably have most of the initial UI files pulled but then pull the rest of the libraries after a student tries to run their notebook. This option may not work for you but if the other two options don't reduce your startup speed this would probably be the next thing I would look at.

I would love to hear back if you try any of these suggestions and would be curious to know how well they work. Feel free to reach out if you have other questions.

yuvipanda · 2021-02-18T07:27:11Z

Thanks for the suggestions, @rothgar!

Unfortunately, student pods depend on in-memory state, so we can't use spot instances. That's also why we can't aggressively scale down. We do use spot instances with dask though, since those are much more resilient to terminations.

Building AMIs with container images pre-pulled is one of the things I'm most excited about with the move to kops. I just realized that this is possible with eksctl as well. Right now, we run student workflows mostly on Google Cloud - only research flows are on AWS. Will definitely put effort towards this when that changes.

I've never really considered k3s for anything more than single node workflows. We try very hard to be as uniform across cloud providers as possible, so k3s on AWS doesn't seem worth the trouble for the differential. I also don't think the etcd resource cost makes a lot of difference in our case, but will consider it! kops just seems a lot more supported...

stargz is definitely on our radar! Will report back when we start working on it.

Thanks for responding here, @rothgar - we appreciate it. I think pod density is really our biggest blocker with EKS for many use cases, so will look forward to that getting better so we can re-evaluate it vs kops.

rothgar · 2021-02-18T21:36:56Z

I totally understand with spot and in memory state. Custom AMI support for managed node groups came out late last year so it should work for you if you start building an AMI with containers pre-pulled.

We're working on VPC CNI pod density, but it's not quite ready yet. I'm a big fan of cilium too. They have a walk through on using it with EKS that might be helpful if you want to try EKS again https://docs.cilium.io/en/v1.9/gettingstarted/k8s-install-eks/

yuvipanda · 2021-02-19T08:49:56Z

@rothgar ah, thanks! I'm just wary of doing a custom CNI when I've no control over the master Hope that makes sense.

damianavila mentioned this issue Apr 30, 2021

Create AWS deployment infrastructure 2i2c-org/infrastructure#366

Closed

6 tasks

This was referenced May 25, 2021

Guidelines for using kops vs EKS 2i2c-org/infrastructure#431

Closed

[Hub] - Jupyter Meets the Earth 2i2c-org/infrastructure#433

Closed

choldgraf closed this as completed Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate kops vs EKS #28

Evaluate kops vs EKS #28

yuvipanda commented Feb 14, 2021

yuvipanda commented Feb 14, 2021

yuvipanda commented Feb 14, 2021

yuvipanda commented Feb 14, 2021 •

edited

Loading

yuvipanda commented Feb 15, 2021

yuvipanda commented Feb 15, 2021

rothgar commented Feb 18, 2021

yuvipanda commented Feb 18, 2021

rothgar commented Feb 18, 2021

yuvipanda commented Feb 19, 2021

Evaluate kops vs EKS #28

Evaluate kops vs EKS #28

Comments

yuvipanda commented Feb 14, 2021

yuvipanda commented Feb 14, 2021

yuvipanda commented Feb 14, 2021

yuvipanda commented Feb 14, 2021 • edited Loading

yuvipanda commented Feb 15, 2021

yuvipanda commented Feb 15, 2021

rothgar commented Feb 18, 2021

Reducing cost

Faster scaling

Build a custom AMI with images pre-pulled

Control ASG scale up outside of Kubernetes

Optimize container image

yuvipanda commented Feb 18, 2021

rothgar commented Feb 18, 2021

yuvipanda commented Feb 19, 2021

yuvipanda commented Feb 14, 2021 •

edited

Loading