This example shows how to leverage Serverless VPC Access and Cloud Functions to organize a highly performant TCP healthcheck for unmanaged GCE instances. Healthchecker Cloud Function uses goroutines to achieve parallel healthchecking for multiple instances and handles up to 1 thousand VMs checked in less than a second execution time.
NOTE: Managed Instance Groups has autohealing functionality out of the box, current example is more applicable for standalone VMs or VMs in an unmanaged instance group.
The example contains the following components:
- Cloud Scheduler to initiate a healthcheck on a schedule.
- Serverless VPC Connector to allow Cloud Functions TCP level access to private GCE instances.
- Healthchecker Cloud Function to perform TCP checks against GCE instances.
- Restarter PubSub topic to keep track of instances which are to be restarted.
- Restarter Cloud Function to perform GCE instance reset for instances which are failing TCP healthcheck.
The resources created in this example are shown in the high level diagram below:
Healthchecker cloud function has the following configuration options:
FILTER
to filter list of GCE instances the health check will be targeted to. For instance(name = nginx-*) AND (labels.env = dev)
GRACE_PERIOD
time period to prevent instance check of newly created instanced allowing services to start on the instance.MAX_PARALLELISM
- max amount of healthchecks performed in parallel, be aware that every check requires an open TCP connection which is limited.PUBSUB_TOPIC
topic to publish the message with instance metadata.RECHECK_INTERVAL
time period for performing recheck, when a check is failed it will be rechecked before marking as unhealthy.TCP_PORT
port used for health checkingTIMEOUT
the timeout time of a TCP probe.
NOTE: In the current example healthchecker
is used along with the restarter
cloud function, but restarter can be replaced with another function like Pubsub2Inbox for email notifications.
Clone this repository or open it in cloud shell, then go through the following steps to create resources:
terraform init
terraform apply -var project_id=my-project-id
Once done testing, you can clean up resources by running terraform destroy
. To persist state, check out the backend.tf.sample
file.
Configure gcloud
with the project used for the deployment
gcloud config set project <MY-PROJECT-ID>
Wait until cloud scheduler executes the healthchecker function
gcloud scheduler jobs describe healthchecker-schedule
Check the healthchecker function logs to ensure instance is checked and healthy
gcloud functions logs read cf-healthchecker --region=europe-west1
#cf-healthchecker ywn0mojbmgnw 2022-03-15 21:40:01.446 Function execution took 419 ms, finished with status code: 200
#cf-healthchecker ywn0mojbmgnw 2022-03-15 21:40:01.442 1 instances found to be health checked.
#cf-healthchecker ywn0mojbmgnw 2022-03-15 21:40:01.028 Function execution started
Stop nginx
service on the test instance
gcloud compute ssh --zone europe-west1-b nginx-test -- 'sudo systemctl stop nginx'
Wait a few minutes to allow scheduler to execute another healthcheck and examine the function logs
gcloud functions logs read cf-healthchecker --region=europe-west1
#cf-healthchecker ywn0bmojtrji 2022-03-15 21:59:21.202 Instance restart task has been sent for instance nginx-test
#cf-healthchecker ywn0bmojtrji 2022-03-15 21:59:21.201 Restart message published with id=4211063168407327
#cf-healthchecker ywn0bmojtrji 2022-03-15 21:59:20.919 Healthcheck failed for instance nginx-test
#cf-healthchecker ywn0bmojtrji 2022-03-15 21:59:10.914 Instance nginx-test is not responding, will recheck.
#cf-healthchecker ywn0bmojtrji 2022-03-15 21:59:10.910 1 instances found to be health checked.
#cf-healthchecker ywn0bmojtrji 2022-03-15 21:59:10.522 Function execution started
Examine cf-restarter
function logs
gcloud functions logs read cf-restarter --region=europe-west1
#cf-restarter yj6qiott5c4p 2022-03-15 21:59:24.625 Function execution took 975 ms, finished with status: 'ok'
#cf-restarter yj6qiott5c4p 2022-03-15 21:59:24.623 Instance nginx-test has been reset.
#cf-restarter yj6qiott5c4p 2022-03-15 21:59:23.653 Function execution started
Verify that nginx
service is running again and uptime shows that instance has been reset
gcloud compute ssh --zone europe-west1-b nginx-test -- 'sudo systemctl status nginx'
gcloud compute ssh --zone europe-west1-b nginx-test -- 'uptime'
name | description | type | required | default |
---|---|---|---|---|
billing_account | Billing account id used as default for new projects. | string |
✓ | |
project_id | Project id to create a project when project_create is true , or to be used when false . |
string |
✓ | |
grace_period | Grace period for an instance startup. | string |
"180s" |
|
location | App Engine location used in the example (required for CloudFunctions). | string |
"europe-west" |
|
project_create | Create project instead of using an existing one. | bool |
false |
|
region | Compute region used in the example. | string |
"europe-west1" |
|
root_node | The resource name of the parent folder or organization for project creation, in 'folders/folder_id' or 'organizations/org_id' format. | string |
null |
|
schedule | Cron schedule for executing compute instances healthcheck. | string |
"*/5 * * * *" # every five minutes" |
|
tcp_port | TCP port to run healthcheck against. | string |
"80" #http" |
|
timeout | TCP probe timeout. | string |
"1000ms" |
name | description | sensitive |
---|---|---|
cloud-function-healthchecker | Cloud Function Healthchecker instance details. | |
cloud-function-restarter | Cloud Function Healthchecker instance details. | |
pubsub-topic | Restarter PubSub topic. |