This go application manages the scaling out of CircleCI runners. It's composed of 2 different types of background workers:
- Discovery Worker: discovers new resources classes that should be scaled and spawns new scaling workers
- Scaling Worker: it checks unclaimed tasks on CircleCI for the resource class it manages, it scales the ASG related to that resource class and waits for instances to be up before continuing.
It supports both EC2 and Kubernetes-based runners.
There was no open-source client for the CircleCI Runner API, so the one we are using is automatically generated by oapi-codegen based on the OpenAPI Definition available here
If you make any changes to the YAML file then run to generate the code:
$ make generate-client
We are providing a Helm Chart for deploying the service
For the AWS EC2 Scaler, you need to create an IAM role with rights to autoscaling actions on your autoscaling groups. There's an example terraform module here.
You'll need to make sure the service account deployed by the chart uses the IAM Role if you are running it on Kubernetes.
We currently don't have any public repositories for the Docker Image or the Helm chart, but is something we are looking into.
All configurations are loaded from environment variables using envconfig.
Configuration Name | Environment Variable | Default Value | Description |
---|---|---|---|
KubernetesScalerEnabled | APP_KUBERNETES_SCALER_ENABLED | true | Enable the kubernetes discovery and autoscaler |
KubernetesNamespace | APP_KUBERNETES_NAMESPACE | circleci-runners | Kubernetes namespace to use for runer discovery and scaling |
CircleToken | APP_CIRCLE_TOKEN | CircleCI API Token to use for the runners API | |
CircleResourceNamespace | APP_CIRCLE_RESOURCE_NAMESPACE | vela-games | CircleCI resource namespace to use for runner discovery |
As part of a previous project, we open-sourced a terraform module to manage runners' autoscaling groups. We recommend you use this same module as it already has the necessary code to support this.
The service will discover all resource classes it has to scale by getting all autoscaling groups with the tag resource-class
, after that, it will manage the desired capacity of the ASG based on the unclaimed tasks for the resource class.
This only handles scaling-out runners, to scale in we depend on a self-hosted runner configuration to kill itself after a certain timeout is reached, after the process is killed we run a script on the instance to detach it from the ASG and shut it down.
⚠️ DEPRECATED: We created this feature as a POC for scaling runners on Kubernetes before CircleCI released their Container Runner. We still use this feature internally at Vela but it never reached GA status. We recommend using CircleCI's official operator.
For this feature, we used a suspended CronJob as a pod template for the runners. The autoscaler discovers the resource classes managed by k8s by looking for all CronJobs in the configured namespace with the labels resource-class-org
and resource-class-name
. Here's an example of a CronJob you can use:
apiVersion: batch/v1
kind: CronJob
metadata:
name: small-runner
namespace: circleci-runners
labels:
resource-class-org: "vela-games"
resource-class-name: "k8s-small"
spec:
suspend: true
schedule: "* * * * *"
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
jobTemplate:
spec:
ttlSecondsAfterFinished: 21600
template:
metadata:
labels:
resource-class-org: "vela-games"
resource-class-name: "k8s-small"
spec:
containers:
- name: ci-small
image: circleci-image:latest
command: ["/opt/circleci/start.sh"]
env:
- name: LAUNCH_AGENT_RUNNER_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: LAUNCH_AGENT_API_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: api-token
key: LAUNCH_AGENT_API_AUTH_TOKEN
optional: false
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 2000m
memory: 1Gi
ephemeral-storage: "15Gi"
requests:
cpu: 2000m
memory: 1Gi
ephemeral-storage: "15Gi"
restartPolicy: OnFailure
At the moment we are not providing any publicly accessible base images for the k8s runners, but we are providing an example here