Skip to content

Manual Autoscaling Kubernetes Setup

Adam Novak edited this page Sep 20, 2021 · 6 revisions

We have ambitions to give Toil the ability to launch autoscaling Kubernetes clusters. Here's how to create one manually. We plan to automate this process in future versions of Toil, and also to apply some of these techniques to our shared-tenancy Kubernetes cluster here at the GI.

This setup uses numerous pieces of software, because Kubernetes does not include batteries. We use (roughly in order):

  • kubeadm, to install Kubernetes.
  • Kubernetes itself, to be Kubernetes.
  • The Kubernetes AWS Cloud Provider, to let Kubernetes understand the correspondence between Kubernetes Nodes and AWS Instances. (This should also teach it to use AWS load balancers and persistent storage, but those features haven't been tested in this setup.)
  • Flannel, to provide the internal inter-pod VPN that Kubernetes requires but does not include.
  • Kubelet Rubber Stamp, a signing bot for the internal Kubernetes certificate authority that can automatically issue certificates for new nodes that join the cluster. Note that this could allow a rogue worker to intercept secure traffic from the cluster to any domain, by pretending that it has that hostname and getting a certificate issued to it by the leader. Kubernetes docs now cover this.
  • The Kubernetes Cluster Autoscaler, to scale the cluster up and down.
  • The Kubernetes Metrics Server, which collects RAM and CPU usage information from all the nodes to present when you run kubectl top nodes.
  • A custom service to delete Kubernetes nodes that have been scaled away, as Kubernetes is supposed to do this itself, but doesn't.

We do not install the Kubernetes dashboard.

To make an autoscaling Kubernetes cluster on AWS:

  • (Optional) Make sure the VPCs in your AWS account have been configured to assign IPv6 addresses, so your nodes will have persistent addresses if they restart.
  • Decide on a cluster name (clustername).
    • Substitute the cluster name into the Leader Script and Worker Script (below).
  • Decide on an AWS Region (us-west-2) and Availability Zone (us-west-2c).
  • Decide on a maximum cluster size in nodes.
  • Decide on an Owner tag value (anovak@soe.ucsc.edu)
  • Visit https://stable.release.flatcar-linux.net/amd64-usr/current/flatcar_production_ami_hvm_us-west-2.txt (modified for your Region) to obtain the AMI name for the current version of Flatcar (ami-0dcea309142a5bb25).
  • Create (or already have) a Key Pair.
    • Note that only old-style, RSA keys are supported.
  • Create a Security Group.
    • When creating it, give it a rule to allow SSH from anywhere over IPv4 and IPv6.
    • Name it clustername-sg.
    • After creating it, add a rule allowing all traffic (TCP and UDP) from the security group itself.
    • Give it an Owner tag with the appropriate value.
  • Create an IAM Role for the leader.
    • Name it clustername-leader-role.
    • Add an Owner tag with the appropriate value.
    • After creating the role, "Add inline policy".
    • Add the Leader Policy (below).
  • Create an IAM Role for the workers.
    • Name it clustername-worker-role.
    • Add an Owner tag with the appropriate value.
    • After creating the role, "Add inline policy".
    • Add the Worker Policy (below).
  • Create a leader Instance.
    • Use the Flatcar AMI previously determined.
    • Choose a t2.large instance type.
    • (Optional) Choose the subnet associated with your chosen Availability Zone.
    • Choose the clustername-leader-role IAM role.
    • (Optional?) Upgrade its root volume from the default 8 GB size to 32 GB
    • Give it an Owner tag with the appropriate value.
    • Give it a Name tag of clustername-leader.
    • Give it a tag named kubernetes.io/cluster/clustername, with an empty value.
    • Put it in the security group previously created.
    • When launching, use the Key Pair previously created or identified.
  • When the leader starts up, SSH to core@<leader IP>.
    • Run the Leader Script (below).
    • Collect the environment variable values printed at the end, or saved to /etc/worker.env by the script, and paste them into the Worker Script (below).
  • Create a Launch Template for the workers.
    • Name it clustername-worker-template
    • Add a Template Tag of Owner, with the appropriate value.
    • Select the Flatcar AMI previously determined.
    • Select the t2.large instance type (or other instance type if desired for your nodes).
    • Select the Key Pair previously created or identified.
    • (Optional?) Under "Storage (volumes)", select the root volume and expand it to 32 GB (or however much EBS storage you want for your nodes).
    • Under "Resource tags", add the following tags:
      • An Owner tag, with the appropriate value, set to propagate to instances and volumes.
      • A Name tag, with the value testcluster-worker, set to propagate to instances and volumes.
    • Expand "Advanced details".
      • Set "IAM instance profile" to the Instance Profile corresponjding to the worker IAM Role that you created earlier.
      • Paste the modified version of the Worker Script (below), with the variable values from the leader pasted in, and inclusing the #! line, into the "User data" box.
  • Create an Auto Scaling Group for the workers.
    • Name it clustername-asg
    • Choose the Launch Template previously created.
    • Select one or more Subnets (ideally the one in your selected Availability Zone where your leader is running).
    • Set "Desired capacity" to 0
    • Set "Minimum capacity" to 0
    • Set "Maximum capacity" to your desired maximum cluster size.
    • Add the following tags:
      • An Owner tag, with the appropriate value, with "tag new instances" checked.
      • A tag named kubernetes.io/cluster/clustername, with no value, with "tag new instances" checked.
      • A tag named k8s.io/cluster-autoscaler/clustername, with no value, without "tag new instances".
      • A tag named k8s.io/cluster-autoscaler/enabled, with no value, without "tag new instances".
      • A tag k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage, with value "24G" (or another lower bound on how much ephemeral storage a new node will provide), without "tag new instances".
  • Submit work using kubectl as core on the leader.

The cluster should be working at this point. It will automatically scale up as work is available. Note that it will not scale away the last worker, because as soon as the cluster scales up for real work, the metrics service pod finds a node to run on on a best-effort basis, and remains there.

Future Work

Metrics Server Improvements

We could run it on the leader, but we'd definitely need to impose a limit then, because we don't want it to take down the whole leader if it turns out that the cluster has have too many nodes for the available leader memory. We could set it up with a "pod nanny" to automatically resize it according to the number of nodes in the cluster, but the metrics server pod actually has to start running before the pod nanny can scale it, and it's not clear what would happen if the metrics server pod managed to hit its resource limits before the nanny could scale it up.

Using Instance Ephemeral Storage

When we deploy workers on instance types with ephemeral local storage, we need the sort of scripting that Toil has to RAID, format, and mount the ephemeral storage. We also will need to configure Kubernetes to store its temporary data (Docker images, pod ephemeral storage) there instead of on /.

Handling Reboots

Because EC2 instances don't have persistent hostnames or addresses across reboots by default (when no IPv6 is available), the workers will be unable to find the leader if the leader ever reboots or is otherwise replaced. Also, if a leader or worker reboots, its Kubernetes node name isn't going to match its hostname anymore, unless Kubernetes is smart enough to change it.

Leader Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:AttachVolume",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CreateRoute",
                "ec2:CreateSecurityGroup",
                "ec2:CreateTags",
                "ec2:CreateVolume",
                "ec2:DeleteRoute",
                "ec2:DeleteSecurityGroup",
                "ec2:DeleteVolume",
                "ec2:DescribeInstances",
                "ec2:DescribeLaunchTemplateVersions",
                "ec2:DescribeRegions",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVolumes",
                "ec2:DescribeVpcs",
                "ec2:DetachVolume",
                "ec2:ModifyInstanceAttribute",
                "ec2:ModifyVolume",
                "ec2:RevokeSecurityGroupIngress",
                "elasticloadbalancing:AddTags",
                "elasticloadbalancing:ApplySecurityGroupsToLoadBalancer",
                "elasticloadbalancing:AttachLoadBalancerToSubnets",
                "elasticloadbalancing:ConfigureHealthCheck",
                "elasticloadbalancing:CreateListener",
                "elasticloadbalancing:CreateLoadBalancer",
                "elasticloadbalancing:CreateLoadBalancerListeners",
                "elasticloadbalancing:CreateLoadBalancerPolicy",
                "elasticloadbalancing:CreateTargetGroup",
                "elasticloadbalancing:DeleteListener",
                "elasticloadbalancing:DeleteLoadBalancer",
                "elasticloadbalancing:DeleteLoadBalancerListeners",
                "elasticloadbalancing:DeleteTargetGroup",
                "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
                "elasticloadbalancing:DeregisterTargets",
                "elasticloadbalancing:DescribeListeners",
                "elasticloadbalancing:DescribeLoadBalancerAttributes",
                "elasticloadbalancing:DescribeLoadBalancerPolicies",
                "elasticloadbalancing:DescribeLoadBalancers",
                "elasticloadbalancing:DescribeTargetGroups",
                "elasticloadbalancing:DescribeTargetHealth",
                "elasticloadbalancing:DetachLoadBalancerFromSubnets",
                "elasticloadbalancing:ModifyListener",
                "elasticloadbalancing:ModifyLoadBalancerAttributes",
                "elasticloadbalancing:ModifyTargetGroup",
                "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
                "elasticloadbalancing:RegisterTargets",
                "elasticloadbalancing:SetLoadBalancerPoliciesForBackendServer",
                "elasticloadbalancing:SetLoadBalancerPoliciesOfListener",
                "iam:CreateServiceLinkedRole",
                "kms:DescribeKey"
            ],
            "Resource": "*"
        }
    ]
}

Worker Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeRegions",
                "ecr:GetAuthorizationToken",
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:BatchGetImage"
            ],
            "Resource": "*"
        }
    ]
}

Leader Script

#!/usr/bin/env bash

# Set cluster name here
CLUSTER_NAME="clustername"

### SHARED SECTION ###

# Set deployment versions here
CNI_VERSION="v0.8.2"
CRICTL_VERSION="v1.17.0"
CNI_DIR=/opt/cni/bin
DOWNLOAD_DIR=/opt/bin
# This is the version of Kubernetes to use
# Get current from: curl -sSL https://dl.k8s.io/release/stable.txt
KUBERNETES_VERSION="v1.19.3"
# This is the version of the Kubernetes "release" tooling to get the service and config files from
# It doesn't auto-update.
RELEASE_VERSION="v0.4.0"
# Now we need the basic cluster services
# Version of Flannel networking to get the YAML from
FLANNEL_VERSION="v0.13.0"
# Version of node CSR signign bot to run
RUBBER_STAMP_VERSION=v0.3.1
# Version of the autoscaler to run
AUTOSCALER_VERSION="1.19.0"
# Version of metrics service to install for `kubectl top nodes`
METRICS_API_VERSION=v0.3.7

sudo mkdir -p ${CNI_DIR}
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-amd64-${CNI_VERSION}.tgz" | sudo tar -C $CNI_DIR -xz
sudo mkdir -p ${DOWNLOAD_DIR}
curl -L "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-amd64.tar.gz" | sudo tar -C $DOWNLOAD_DIR -xz

cd $DOWNLOAD_DIR
sudo curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${KUBERNETES_VERSION}/bin/linux/amd64/{kubeadm,kubelet,kubectl}
sudo chmod +x {kubeadm,kubelet,kubectl}

curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service
sudo mkdir -p /etc/systemd/system/kubelet.service.d
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

sudo systemctl enable --now kubelet

### LEADER SECTION ###

# Set up as a leader

sudo kubeadm init --config /dev/stdin <<'EOF'
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
  kubeletExtraArgs:
    volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
    cloud-provider: aws
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
controllerManager:
  extraArgs:
    flex-volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
networking:
  serviceSubnet: "10.96.0.0/12"
  podSubnet: "10.244.0.0/16"
  dnsDomain: "cluster.local"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serverTLSBootstrap: true
rotateCertificates: true
cgroupDriver: systemd
EOF

mkdir -p $HOME/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install network
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/${FLANNEL_VERSION}/Documentation/kube-flannel.yml

# Deploy rubber stamp CSR signing bot
kubectl apply -f https://raw.githubusercontent.com/kontena/kubelet-rubber-stamp/release/${RUBBER_STAMP_VERSION}/deploy/service_account.yaml
kubectl apply -f https://raw.githubusercontent.com/kontena/kubelet-rubber-stamp/release/${RUBBER_STAMP_VERSION}/deploy/role.yaml
kubectl apply -f https://raw.githubusercontent.com/kontena/kubelet-rubber-stamp/release/${RUBBER_STAMP_VERSION}/deploy/role_binding.yaml
kubectl apply -f https://raw.githubusercontent.com/kontena/kubelet-rubber-stamp/release/${RUBBER_STAMP_VERSION}/deploy/operator.yaml

# Set up autoscaler
curl -sSL https://raw.githubusercontent.com/kubernetes/autoscaler/cluster-autoscaler-${AUTOSCALER_VERSION}/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml | sed "s|--nodes={{ node_asg_min }}:{{ node_asg_max }}:{{ name }}|--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/${CLUSTER_NAME}|" | sed 's|kubernetes.io/role: master|node-role.kubernetes.io/master: ""|' | sed 's|operator: "Equal"|operator: "Exists"|' | sed '/value: "true"/d' | sed 's|path: "/etc/ssl/certs/ca-bundle.crt"|path: "/usr/share/ca-certificates/ca-certificates.crt"|' | kubectl apply -f -

# Set up metrics server, which needs serverTLSBootstrap and rubber stamp, and insists on running on a worker
curl -sSL https://github.com/kubernetes-sigs/metrics-server/releases/download/${METRICS_API_VERSION}/components.yaml | sed 's/          - --secure-port=4443/          - --secure-port=4443\n          - --kubelet-preferred-address-types=Hostname/' | kubectl apply -f -

# Set up a node cleaner-upper, because Kubernetes isn't promptly deleting gone nodes.
# See https://github.com/kubernetes/autoscaler/issues/3639
sudo tee ${DOWNLOAD_DIR}/cleanup-nodes.sh <<'EOF'
#!/usr/bin/env bash
# cleanup-nodes.sh: constantly clean up NotReady nodes that are tainted as having been deleted
set -e

export PATH="${PATH}:$(dirname "${BASH_SOURCE[0]}")"

while true ; do
    echo "$(date | tr -d '\n'): Checking for scaled-in nodes..."
    for NODE_NAME in $(kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o json | jq -r '.items[] | select(.spec.taints) | select(.spec.taints[] | select(.key == "ToBeDeletedByClusterAutoscaler")) | select(.spec.taints[] | select(.key == "node.kubernetes.io/unreachable")) | select(.status.conditions[] | select(.type == "Ready" and .status == "Unknown")) | .metadata.name' | tr '\n' ' ') ; do
        # For every node that's tainted as ToBeDeletedByClusterAutoscaler, and
        # as node.kubernetes.io/unreachable, and hasn't dialed in recently (and
        # is thus in readiness state Unknown)
        echo "Node ${NODE_NAME} is supposed to be scaled away and also gone. Removing from cluster..."
        # Drop it
        kubectl --kubeconfig /etc/kubernetes/admin.conf delete node ${NODE_NAME}
    done
    sleep 30
done
EOF
sudo chmod 755 ${DOWNLOAD_DIR}/cleanup-nodes.sh

sudo tee /etc/systemd/system/cleanup-nodes.service <<EOF
[Unit]
Description=Remove scaled-in nodes
After=kubelet.service
[Service]
ExecStart=${DOWNLOAD_DIR}/cleanup-nodes.sh
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now cleanup-nodes

# Grab some joining info
export JOIN_TOKEN=$(kubeadm token create --ttl 0)
export JOIN_CERT_HASH=sha256:$(openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //')
export JOIN_ENDPOINT="$(hostname):6443"

# Save the joining info
env | grep JOIN_ | sudo tee /etc/worker.env

Worker Script

#!/usr/bin/env bash

# Set cluster name here
CLUSTER_NAME="clustername"

# Set join info here
JOIN_TOKEN=
JOIN_CERT_HASH=
JOIN_ENDPOINT=

### SHARED SECTION ###

# Set deployment versions here
CNI_VERSION="v0.8.2"
CRICTL_VERSION="v1.17.0"
CNI_DIR=/opt/cni/bin
DOWNLOAD_DIR=/opt/bin
# This is the version of Kubernetes to use
# Get current from: curl -sSL https://dl.k8s.io/release/stable.txt
KUBERNETES_VERSION="v1.19.3"
# This is the version of the Kubernetes "release" tooling to get the service and config files from
# It doesn't auto-update.
RELEASE_VERSION="v0.4.0"
# Now we need the basic cluster services
# Version of Flannel networking to get the YAML from
FLANNEL_VERSION="v0.13.0"
# Version of node CSR signign bot to run
RUBBER_STAMP_VERSION=v0.3.1
# Version of the autoscaler to run
AUTOSCALER_VERSION="1.19.0"
# Version of metrics service to install for `kubectl top nodes`
METRICS_API_VERSION=v0.3.7

sudo mkdir -p ${CNI_DIR}
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-amd64-${CNI_VERSION}.tgz" | sudo tar -C $CNI_DIR -xz
sudo mkdir -p ${DOWNLOAD_DIR}
curl -L "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-amd64.tar.gz" | sudo tar -C $DOWNLOAD_DIR -xz

cd $DOWNLOAD_DIR
sudo curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${KUBERNETES_VERSION}/bin/linux/amd64/{kubeadm,kubelet,kubectl}
sudo chmod +x {kubeadm,kubelet,kubectl}

curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service
sudo mkdir -p /etc/systemd/system/kubelet.service.d
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

sudo systemctl enable --now kubelet

### WORKER SECTION ###

# Be a worker

sudo kubeadm join ${JOIN_ENDPOINT} --config /dev/stdin <<EOF
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
nodeRegistration:
  kubeletExtraArgs:
    volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
    cloud-provider: aws
discovery:
  bootstrapToken:
    apiServerEndpoint: ${JOIN_ENDPOINT}
    token: ${JOIN_TOKEN}
    caCertHashes:
    - "${JOIN_CERT_HASH}"
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd
EOF