From 86dd1bd595a1622067c9c2c55c3075537a0fc36b Mon Sep 17 00:00:00 2001 From: Stuart Wallace Date: Wed, 26 May 2021 21:29:05 +0100 Subject: [PATCH] Some grammar/spelling fixes --- README.md | 121 ++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 94 insertions(+), 27 deletions(-) diff --git a/README.md b/README.md index dc66367..f47fab9 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,9 @@ This GitHub repo demonstrates the usage of a couple of Terraform features (you can skip directly to them [here](#trick1) and [here](#trick2)) that can reduce the downtime or unavailability of Terraform managed -resources. For this demonstration we've built an extremely simple Go HTTP server. We package this server +resources. For this demonstration, we've built a simple Go HTTP server. We package this server into a custom machine image using [Packer](https://www.packer.io) and the [UpCloud](https://upcloud.com) -Packer [plugin](https://github.com/UpCloudLtd/packer-plugin-upcloud). This custom image is deployed on to +Packer [plugin](https://github.com/UpCloudLtd/packer-plugin-upcloud). This custom image is deployed onto an UpCloud server using Terraform. ## Repo Structure @@ -25,7 +25,7 @@ of the demo application after every commit. #### GitHub Action -GitHub Actions is a CI/CD tool baked into GitHub allowing us to define workflows for our code in the same +GitHub Actions is a CI/CD tool baked into GitHub, allowing us to define workflows for our code in the same repository as the code itself. The GitHub Action workflow doesn't change throughout the demo. It consists of 4 jobs: @@ -36,9 +36,9 @@ The GitHub Action workflow doesn't change throughout the demo. It consists of 4 [`hashicorp-contib/setup-packer`](https://github.com/hashicorp-contrib/setup-packer) action as we also install a Packer plugin manually. * `terraform` - This is a fairly vanilla Terraform build job but we do use the [`upctl`](https://github.com/UpCloudLtd/upcloud-cli) CLI tool to determine the latest custom machine image name. -* `image_tidy` - This use the `upctl` CLI tool to delete all but that last 5 images. +* `image_tidy` - This uses the `upctl` CLI tool to delete all but the last 5 images. -To replicate this workflow you will need to do the following in a fork of this repo: +To replicate this workflow, you will need to do the following in a fork of this repo: * Edit `.github/workflows/build.yaml` and change `UPCLOUD_USERNAME` environment variables to a user that exists in your UpCloud account. I recommend creating a new user. @@ -49,27 +49,27 @@ To replicate this workflow you will need to do the following in a fork of this r #### Terraform Cloud -Terraform Cloud is a managed service from HasiCorp that will run our Terraforom applies and manage and secure our +Terraform Cloud is a managed service from HasiCorp that will run our Terraform applies and manage and secure our infrastructure state. It eliminates the need for custom tooling to manage Terraform in production. When running Terraform -in a CI/CD pipeline we need to be able to store our remote state, this is one of the reasons we are using Terrafrom Cloud. +in a CI/CD pipeline, we need to store our remote state. This is one of the reasons we are using Terraform Cloud. -To replicate this workflow you will need to do the following (assuming you've signed up to Terrafrom Cloud): +To replicate this workflow, you will need to do the following (assuming you've signed up to Terraform Cloud): * Optionally, create a new organisation. You will need this to configure GitHub Actions. * Create a new workspace. You will need this to configure GitHub Actions. * Create a new Terraform Cloud API token for your user. You will also need this to configure GitHub Actions. -* In the workspace variable tab set these environment variables: +* In the workspace variable tab, set these environment variables: * `UPCLOUD_USERNAME` - The name of a user within your account. I recommend creating a new user. * `UPCLOUD_PASSWORD` - The password of the above user. Remember to tick the 'sensitive' checkbox. #### Packer -If you are following along the Packer configuration installs an SSH public key `packer/artefact/id_rsa_upcloud.pub` -which is useful for my debugging but you should replace this with the public part of a private SSH key you generate. +If you are following along, the Packer configuration installs an SSH public key `packer/artefact/id_rsa_upcloud.pub` +which is useful for debugging but, you should replace this with the public part of a private SSH key you generate. #### CI/CD Conclusion -With the above changes you should be able use the GitHub Action workflow to build and run your own experiments. +With the above changes, you can use the GitHub Action workflow to build and run your own experiments. ## Initial State (tag: [1.0](https://github.com/opencredo/upcloudzerodowntime/tree/1.0)) @@ -86,7 +86,7 @@ that does this: ### Packer -In the initial state we are using the `file` provisioner to copy across the following pertinent files: +In the initial state, we are using the `file` provisioner to copy across the following pertinent files: * `demoapp` - Our HTTP server. * `demoapp.service` - A systemd unit that starts the HTTP server when the machine comes up. @@ -123,7 +123,7 @@ We are simply deploying an UpCloud server in the London region with a public IP ### Experiment -If we were commit our code and config at this point it should all deploy nicely. We can determine the IP address of our +If we were commit our code and config at this point, it should all deploy nicely. We can determine the IP address of our server by using the `upctl` CLI tool (this snippet assumes you only have 1 server in your account): ```bash @@ -164,7 +164,7 @@ $ curl http://94.237.121.69:8080 (1.0) Hello, "/" ``` -At this point it is probably worth starting something to monitor this endpoint. For this experiment I've used +At this point, it is probably worth starting something to monitor this endpoint. For this experiment, I've used [pinghttp](https://git.sr.ht/~yoink00/pinghttp) but anything allows you to see the up/down time of the endpoint will work: @@ -184,7 +184,7 @@ we will see our endpoint go down and never return (though it might, see below). ## Floating IP (tag: [1.1](https://github.com/opencredo/upcloudzerodowntime/tree/1.1)) The IP addresses assigned to an UpCloud server are fairly random. While you may get the same IP address between -server destory/create cycles it isn't guaranteed. To overcome this we will need to use UpCloud's floating IP +server destroy/create cycles, it isn't guaranteed. To overcome this, we will need to use UpCloud's floating IP facility. There are no changes to the Go HTTP server or our CI/CD pipeline. @@ -200,8 +200,8 @@ resource "upcloud_floating_ip_address" "app_ip" { ``` This resource will assign a public IPv4 address to our servers network interface as identified by the `mac_address` -value. As of today, this isn't quite enough as the metadata server (a UpCloud provided web service that tells a -server information about itself) doesn't automatically refresh. To workaround this we can turn it off and on again +value. As of today, this isn't quite enough as the metadata server (an UpCloud provided web service that tells a +server information about itself) doesn't automatically refresh. To work around this, we can turn it off and on again with a Terraform `null_resource`: ```terraform @@ -223,7 +223,7 @@ resource "null_resource" "metadata_update" { } ``` -This `null_resource` is triggered by any change to the `upcloud_floating_ip_address.mac_address` value (so when we're +This `null_resource` is triggered by any change to the `upcloud_floating_ip_address.mac_address` value (so when assigning the IP to a new server). It uses a `local-exec` provisioner to run a small bash script on the machine performing the deployment (a HashiCorp Terraform Cloud server in our case). @@ -231,7 +231,7 @@ This script uses the UpCloud JSON API to turn off the metadata service and turn ### Packer -In order to make use of the floating IP address the Linux operating system needs to be made aware of it. To do this, +To use the floating IP address, we need to make the Linux operating system aware of it. To do this, the following new files are copied to the custom machine image: * `floating_ip.sh` - A shell script that pings the metadata service looking for a floating IP. If it already has @@ -240,14 +240,14 @@ the following new files are copied to the custom machine image: The `shell` provisioner is updated to perform the following actions: -* Copies the above 2 files to the correct locations and ensure their ownership and permissions are correct. +* Copies the above 2 files to the correct locations and make sure their ownership and permissions are correct. * Enables the `floating_ip.service` systemd unit * Updates the `/etc/network/interfaces` configuration to source configuration from `/etc/network/interfaces.d` (where `floating_ip.sh` will write the floating IP configuration) ### Experiment -If we commit this code at this point we should get a new server with a floating IP. +If we commit this code, we should get a new server with a floating IP. ```bash $ upctl server show (upctl server list -o json | jq '.[0] | .uuid' -r) @@ -282,7 +282,7 @@ $ upctl server show (upctl server list -o json | jq '.[0] | .uuid' -r) We can see from the output that there is a new IP address with `(f)` next to it. This is the floating IP. -We can start a new endpoint monitoring tool (`pinghttp` for example) using that new IP address and we should +We can start a new endpoint monitoring tool (`pinghttp` for example) using that new IP address, and we should get a response as before. If we bump the version number (change `const Version = "1.1"` in `go/main.go`) and push that code to GitHub @@ -290,7 +290,7 @@ we will see our endpoint go down and approximately a minute later come back up a ## Reducing Downtime (tag [1.2](https://github.com/opencredo/upcloudzerodowntime/tree/1.2)) -In the Terraform job logs in GitHub Actions we can see the following: +In the Terraform job logs in GitHub Actions, we can see the following: ![Destroy Before Create Logs](docs/destroy_before_create.png) @@ -298,8 +298,8 @@ This shows that the _old_ server is being destroyed __before__ the _new_ server our downtime. We can mitigate this by using Terraform's `lifecycle` [block](https://www.terraform.io/docs/language/meta-arguments/lifecycle.html). With this configuration we can ask Terraform to create the _new_ server __before__ it destroys the _old_ server. -This configuration is available on all Terraform resources but you should consider if it is needed as it can affect -your costs (as you'll have parallel resources) and some resources may need their old version to free up a dependency. +This configuration is available on all Terraform resources but, you should consider if it is needed as it can affect +your costs (as you'll have parallel resources) and, some resources may need their old version to free up a dependency. In our case, though, this will be useful so we can add this block to the `upcloud_server` resource: @@ -343,5 +343,72 @@ We can use the `local-exec` provisioner to do this again. We add the following to the `terraform/main.tf` file: ```terraform + provisioner "local-exec" { + interpreter = ["/bin/bash", "-c"] + + command = <<-EOF + fail_count=0 + + while true; do + response=$(curl --write-out %%{http_code} --silent --output /dev/null http://${self.network_interface[0].ip_address}:8080) + echo "Response: $response" + + if [[ "$response" == "200" ]]; then + echo "Application is available" + exit 0 + fi + + fail_count=$((fail_count + 1)) + + if (( fail_count > 30 )); then + echo "Application is still unavailable" + exit 2 + fi + + echo "Sleeping" + sleep 10 + done + EOF + } +``` + +This `local-exec` provisioner calls a short bash script that uses `curl` to get the status code of a request to the +new `upcloud_server` resource's IP address (`self.network_interface[0].ip_address`). If the response is `200` then +the new resource is considered available and, the Terraform apply can continue. It will attempt to probe the endpoint for +5 minutes (30 fails * 10s sleep per loop) and, if it is still not up during this period, it will fail. + +This failure will leave the old infrastructure untouched but the new but failed `upcloud_server` resource will not be +cleared up. A future Terraform apply with a fix will tidy it up. + +### Experiment + +To test this, we can change our Go HTTP server to add some downtime. The `pinghttp` tool we've used before considers +`500` response to be downtime so we can make a small change to return that status code for 2 minutes: + +```go + startedTime := time.Now() + + http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) { + if time.Since(startedTime) < 2*time.Minute { + w.WriteHeader(500) + w.Write([]byte("Server Error")) + return + } + fmt.Fprintf(w, "(%s) Hello, %q", Version, html.EscapeString(r.URL.Path)) + } +``` + +With this code in place, we can push our code to GitHub again and monitor for downtime. We should see the deployment take +longer but, there still only be around 5 seconds of downtime. + +We can examine the Terraform logs to see this in action: + +![Server Provisioner Log 1](docs/server_provisioner_1.png) + +We can see the new server resource being created and, we can also see our `local-exec` provisioner starting and getting +error responses from the server. Later in the log (after our 2 minutes), we can see: + +![Server Provisioner Log 2](docs/server_provisioner_2.png) -``` \ No newline at end of file +We can see the `local-exec` provisioner receiving a `200` response and the creation of the new server resource completing. +The rest of the Terraform apply can now continue and, we see our new version come up.