Skip to content

Commit

Permalink
Some grammar/spelling fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
yoink00 committed May 26, 2021
1 parent 693ff8e commit 86dd1bd
Showing 1 changed file with 94 additions and 27 deletions.
121 changes: 94 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

This GitHub repo demonstrates the usage of a couple of Terraform features (you can skip directly to them
[here](#trick1) and [here](#trick2)) that can reduce the downtime or unavailability of Terraform managed
resources. For this demonstration we've built an extremely simple Go HTTP server. We package this server
resources. For this demonstration, we've built a simple Go HTTP server. We package this server
into a custom machine image using [Packer](https://www.packer.io) and the [UpCloud](https://upcloud.com)
Packer [plugin](https://github.com/UpCloudLtd/packer-plugin-upcloud). This custom image is deployed on to
Packer [plugin](https://github.com/UpCloudLtd/packer-plugin-upcloud). This custom image is deployed onto
an UpCloud server using Terraform.

## Repo Structure
Expand All @@ -25,7 +25,7 @@ of the demo application after every commit.

#### GitHub Action

GitHub Actions is a CI/CD tool baked into GitHub allowing us to define workflows for our code in the same
GitHub Actions is a CI/CD tool baked into GitHub, allowing us to define workflows for our code in the same
repository as the code itself.

The GitHub Action workflow doesn't change throughout the demo. It consists of 4 jobs:
Expand All @@ -36,9 +36,9 @@ The GitHub Action workflow doesn't change throughout the demo. It consists of 4
[`hashicorp-contib/setup-packer`](https://github.com/hashicorp-contrib/setup-packer) action as we also install a
Packer plugin manually.
* `terraform` - This is a fairly vanilla Terraform build job but we do use the [`upctl`](https://github.com/UpCloudLtd/upcloud-cli) CLI tool to determine the latest custom machine image name.
* `image_tidy` - This use the `upctl` CLI tool to delete all but that last 5 images.
* `image_tidy` - This uses the `upctl` CLI tool to delete all but the last 5 images.

To replicate this workflow you will need to do the following in a fork of this repo:
To replicate this workflow, you will need to do the following in a fork of this repo:

* Edit `.github/workflows/build.yaml` and change `UPCLOUD_USERNAME` environment variables to a user that exists
in your UpCloud account. I recommend creating a new user.
Expand All @@ -49,27 +49,27 @@ To replicate this workflow you will need to do the following in a fork of this r

#### Terraform Cloud

Terraform Cloud is a managed service from HasiCorp that will run our Terraforom applies and manage and secure our
Terraform Cloud is a managed service from HasiCorp that will run our Terraform applies and manage and secure our
infrastructure state. It eliminates the need for custom tooling to manage Terraform in production. When running Terraform
in a CI/CD pipeline we need to be able to store our remote state, this is one of the reasons we are using Terrafrom Cloud.
in a CI/CD pipeline, we need to store our remote state. This is one of the reasons we are using Terraform Cloud.

To replicate this workflow you will need to do the following (assuming you've signed up to Terrafrom Cloud):
To replicate this workflow, you will need to do the following (assuming you've signed up to Terraform Cloud):

* Optionally, create a new organisation. You will need this to configure GitHub Actions.
* Create a new workspace. You will need this to configure GitHub Actions.
* Create a new Terraform Cloud API token for your user. You will also need this to configure GitHub Actions.
* In the workspace variable tab set these environment variables:
* In the workspace variable tab, set these environment variables:
* `UPCLOUD_USERNAME` - The name of a user within your account. I recommend creating a new user.
* `UPCLOUD_PASSWORD` - The password of the above user. Remember to tick the 'sensitive' checkbox.

#### Packer

If you are following along the Packer configuration installs an SSH public key `packer/artefact/id_rsa_upcloud.pub`
which is useful for my debugging but you should replace this with the public part of a private SSH key you generate.
If you are following along, the Packer configuration installs an SSH public key `packer/artefact/id_rsa_upcloud.pub`
which is useful for debugging but, you should replace this with the public part of a private SSH key you generate.

#### CI/CD Conclusion

With the above changes you should be able use the GitHub Action workflow to build and run your own experiments.
With the above changes, you can use the GitHub Action workflow to build and run your own experiments.

## Initial State (tag: [1.0](https://github.com/opencredo/upcloudzerodowntime/tree/1.0))

Expand All @@ -86,7 +86,7 @@ that does this:

### Packer

In the initial state we are using the `file` provisioner to copy across the following pertinent files:
In the initial state, we are using the `file` provisioner to copy across the following pertinent files:
* `demoapp` - Our HTTP server.
* `demoapp.service` - A systemd unit that starts the HTTP server when the machine comes up.

Expand Down Expand Up @@ -123,7 +123,7 @@ We are simply deploying an UpCloud server in the London region with a public IP

### Experiment

If we were commit our code and config at this point it should all deploy nicely. We can determine the IP address of our
If we were commit our code and config at this point, it should all deploy nicely. We can determine the IP address of our
server by using the `upctl` CLI tool (this snippet assumes you only have 1 server in your account):

```bash
Expand Down Expand Up @@ -164,7 +164,7 @@ $ curl http://94.237.121.69:8080
(1.0) Hello, "/"
```

At this point it is probably worth starting something to monitor this endpoint. For this experiment I've used
At this point, it is probably worth starting something to monitor this endpoint. For this experiment, I've used
[pinghttp](https://git.sr.ht/~yoink00/pinghttp) but anything allows you to see the up/down time of the
endpoint will work:

Expand All @@ -184,7 +184,7 @@ we will see our endpoint go down and never return (though it might, see below).
## Floating IP (tag: [1.1](https://github.com/opencredo/upcloudzerodowntime/tree/1.1))

The IP addresses assigned to an UpCloud server are fairly random. While you may get the same IP address between
server destory/create cycles it isn't guaranteed. To overcome this we will need to use UpCloud's floating IP
server destroy/create cycles, it isn't guaranteed. To overcome this, we will need to use UpCloud's floating IP
facility.

There are no changes to the Go HTTP server or our CI/CD pipeline.
Expand All @@ -200,8 +200,8 @@ resource "upcloud_floating_ip_address" "app_ip" {
```

This resource will assign a public IPv4 address to our servers network interface as identified by the `mac_address`
value. As of today, this isn't quite enough as the metadata server (a UpCloud provided web service that tells a
server information about itself) doesn't automatically refresh. To workaround this we can turn it off and on again
value. As of today, this isn't quite enough as the metadata server (an UpCloud provided web service that tells a
server information about itself) doesn't automatically refresh. To work around this, we can turn it off and on again
with a Terraform `null_resource`:

```terraform
Expand All @@ -223,15 +223,15 @@ resource "null_resource" "metadata_update" {
}
```

This `null_resource` is triggered by any change to the `upcloud_floating_ip_address.mac_address` value (so when we're
This `null_resource` is triggered by any change to the `upcloud_floating_ip_address.mac_address` value (so when
assigning the IP to a new server). It uses a `local-exec` provisioner to run a small bash script on the machine performing
the deployment (a HashiCorp Terraform Cloud server in our case).

This script uses the UpCloud JSON API to turn off the metadata service and turn it back on again.

### Packer

In order to make use of the floating IP address the Linux operating system needs to be made aware of it. To do this,
To use the floating IP address, we need to make the Linux operating system aware of it. To do this,
the following new files are copied to the custom machine image:

* `floating_ip.sh` - A shell script that pings the metadata service looking for a floating IP. If it already has
Expand All @@ -240,14 +240,14 @@ the following new files are copied to the custom machine image:

The `shell` provisioner is updated to perform the following actions:

* Copies the above 2 files to the correct locations and ensure their ownership and permissions are correct.
* Copies the above 2 files to the correct locations and make sure their ownership and permissions are correct.
* Enables the `floating_ip.service` systemd unit
* Updates the `/etc/network/interfaces` configuration to source configuration from `/etc/network/interfaces.d` (where
`floating_ip.sh` will write the floating IP configuration)

### Experiment

If we commit this code at this point we should get a new server with a floating IP.
If we commit this code, we should get a new server with a floating IP.

```bash
$ upctl server show (upctl server list -o json | jq '.[0] | .uuid' -r)
Expand Down Expand Up @@ -282,24 +282,24 @@ $ upctl server show (upctl server list -o json | jq '.[0] | .uuid' -r)

We can see from the output that there is a new IP address with `(f)` next to it. This is the floating IP.

We can start a new endpoint monitoring tool (`pinghttp` for example) using that new IP address and we should
We can start a new endpoint monitoring tool (`pinghttp` for example) using that new IP address, and we should
get a response as before.

If we bump the version number (change `const Version = "1.1"` in `go/main.go`) and push that code to GitHub
we will see our endpoint go down and approximately a minute later come back up again.

## <a name="trick1"></a>Reducing Downtime (tag [1.2](https://github.com/opencredo/upcloudzerodowntime/tree/1.2))

In the Terraform job logs in GitHub Actions we can see the following:
In the Terraform job logs in GitHub Actions, we can see the following:

![Destroy Before Create Logs](docs/destroy_before_create.png)

This shows that the _old_ server is being destroyed __before__ the _new_ server is created and this time is the bulk of
our downtime. We can mitigate this by using Terraform's `lifecycle` [block](https://www.terraform.io/docs/language/meta-arguments/lifecycle.html). With this configuration we can ask Terraform to create the _new_ server __before__ it
destroys the _old_ server.

This configuration is available on all Terraform resources but you should consider if it is needed as it can affect
your costs (as you'll have parallel resources) and some resources may need their old version to free up a dependency.
This configuration is available on all Terraform resources but, you should consider if it is needed as it can affect
your costs (as you'll have parallel resources) and, some resources may need their old version to free up a dependency.

In our case, though, this will be useful so we can add this block to the `upcloud_server` resource:

Expand Down Expand Up @@ -343,5 +343,72 @@ We can use the `local-exec` provisioner to do this again.
We add the following to the `terraform/main.tf` file:

```terraform
provisioner "local-exec" {
interpreter = ["/bin/bash", "-c"]
command = <<-EOF
fail_count=0
while true; do
response=$(curl --write-out %%{http_code} --silent --output /dev/null http://${self.network_interface[0].ip_address}:8080)
echo "Response: $response"
if [[ "$response" == "200" ]]; then
echo "Application is available"
exit 0
fi
fail_count=$((fail_count + 1))
if (( fail_count > 30 )); then
echo "Application is still unavailable"
exit 2
fi
echo "Sleeping"
sleep 10
done
EOF
}
```

This `local-exec` provisioner calls a short bash script that uses `curl` to get the status code of a request to the
new `upcloud_server` resource's IP address (`self.network_interface[0].ip_address`). If the response is `200` then
the new resource is considered available and, the Terraform apply can continue. It will attempt to probe the endpoint for
5 minutes (30 fails * 10s sleep per loop) and, if it is still not up during this period, it will fail.

This failure will leave the old infrastructure untouched but the new but failed `upcloud_server` resource will not be
cleared up. A future Terraform apply with a fix will tidy it up.

### Experiment

To test this, we can change our Go HTTP server to add some downtime. The `pinghttp` tool we've used before considers
`500` response to be downtime so we can make a small change to return that status code for 2 minutes:

```go
startedTime := time.Now()

http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
if time.Since(startedTime) < 2*time.Minute {
w.WriteHeader(500)
w.Write([]byte("Server Error"))
return
}
fmt.Fprintf(w, "(%s) Hello, %q", Version, html.EscapeString(r.URL.Path))
}
```

With this code in place, we can push our code to GitHub again and monitor for downtime. We should see the deployment take
longer but, there still only be around 5 seconds of downtime.

We can examine the Terraform logs to see this in action:

![Server Provisioner Log 1](docs/server_provisioner_1.png)

We can see the new server resource being created and, we can also see our `local-exec` provisioner starting and getting
error responses from the server. Later in the log (after our 2 minutes), we can see:

![Server Provisioner Log 2](docs/server_provisioner_2.png)

```
We can see the `local-exec` provisioner receiving a `200` response and the creation of the new server resource completing.
The rest of the Terraform apply can now continue and, we see our new version come up.

0 comments on commit 86dd1bd

Please sign in to comment.