Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for critical GKE/GCP Terraform Bugs #1373

Merged
merged 2 commits into from
Mar 2, 2020
Merged

Conversation

markmandel
Copy link
Member

@markmandel markmandel commented Feb 28, 2020

This fixes several critical issues, and also cleans up other aspect along the way. Primarily that the firewall rule would never allow traffic to the cluster, since they were erroneously configured and on a
completely different network to the cluster.

Bug Fixes:

  1. Remove the creation of a separate network for each cluster
  2. Make sure the firewall rules don't filter on tags, but target on network tags.
  3. Be able to pass in a network argument (defaulting to "default"), such that if you want to run a cluster on a different network it is easy to configure - this means that the firewall rule is always on the same network as the cluster.
  4. Controller deployment was set to "Always" image pull policy on both the controller and the sidecar. This should never be configured this way for production systems, and is therefore an unsafe default.
  5. Switched provider away from using google-beta since it is not the stable provider.

Cleanup:

  1. Remove duplicate helm.tf and cluster.tf in root ./install/terraform path, as they were diverging, and served no value. Both ./gke and ./gke-local installation scripts have also been updated, as well as the documentation.
  2. Needed to upgrade Terraform in the build shell, since all modules used a later version.
  3. Cleaned up deprecation comments
  4. Removed the broken symlinks in ./build

I would like to also make it easier to have only one GCP firewall rule for n number of clusters, but we can tackle that in a later PR.

Closes #1372

@markmandel markmandel added kind/bug These are bugs. kind/breaking Breaking change area/operations Installation, updating, metrics etc labels Feb 28, 2020
@markmandel markmandel added this to the 1.4.0 milestone Feb 28, 2020
@markmandel markmandel requested a review from aLekSer February 28, 2020 00:54
@markmandel
Copy link
Member Author

/cc @chrisst who provided much of these Terraform cleanup / review previously. If you would like to take a look that would be great.

@@ -23,24 +23,31 @@ data "google_client_config" "default" {}
# Run `terraform taint null_resource.test-setting-variables` before second execution
resource "null_resource" "test-setting-variables" {
provisioner "local-exec" {
command = "${format("echo Current variables set as following - name: %s, project: %s, machineType: %s, initialNodeCount: %s, zone: %s",
command = <<EOT
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason for this cleanup was because the previous formatting totally broke the Terraform plugin for IntelliJ.

Also, I think it reads nicer?

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 9a22ad46-4c02-42ee-bbeb-25dd481dac9b

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1373/head:pr_1373 && git checkout pr_1373
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.4.0-af9b89a

@google-oss-robot
Copy link

@drichardson: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

- network - the name of the network you want your cluster and firewall rules to reside (default is "default")

{{% alert title="Warning" color="warning"%}}
On the lines that read `source = "git::https://github.com/googleforgames/agones.git//install/terraform/modules/gke/?ref=master"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @aLekSer had left this pointed at the master branch (rather than a release branch) on purpose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on that, will provide more details soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a brittle approach to me? How do we ensure that we don't accidentally break production systems with development changes?

Copy link
Collaborator

@aLekSer aLekSer Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we can use both options. The master branch was used so that anyone can play and test terraform apply with these modules before the first release of these terraform configs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we keep optimising for our development, rather than our users -- we should really be optimising for our users.

We can build extra tools for ourselves to handle our development scenarios, but we shouldn't be foisting that onto our users, in my humble opinion.

releases.

For example, if you are targeting release {{< release-branch >}}, then you will want to have
`source = "git::https://github.com/googleforgames/agones.git//install/terraform/modules/gke/?ref=release-{{< release-branch >}}"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the right approach, can we can do this automatically instead of asking users to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So stack overflow says this can't be dynamic by design (since terraform has to know the value is the same at init and apply time), but I didn't actually try it.

I'll try it and find out.

@@ -14,7 +14,7 @@


// Run:
// terraform apply [-var agones_version="1.3.0"]
// terraform apply -var project="<YOUR_GCP_ProjectID>" [-var agones_version="1.1.0"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the agones version to something older? If anything, what about using 1.4.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copy/pasted the previous command, but you are right, should update it to latest. Will do 👍

@@ -87,6 +87,17 @@ Configurable parameters:
- agones_version - the version of agones to install (default is the latest version from the [Helm repository](https://agones.dev/chart/stable))
- machine_type - machine type for hosting game servers (default is "n1-standard-4")
- node_count - count of game server nodes for the default node pool (default is "4")
- network - the name of the network you want your cluster and firewall rules to reside (default is "default")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't customizable in the gke-local or gke-module. They both hard code the network to default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good point! I was thinking of the terraform/gke module. Will remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @chrisst 's comments, actually added them to the module definitions - so leaving this in, and adding zone.

Copy link
Collaborator

@aLekSer aLekSer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First want to say that overall I agree with these changes.
But after fetching them, I got an issue with previously working make target:

make gcloud-terraform-cluster
Error: No configuration files

Apply requires configuration to be present.

#1372 (comment)

@markmandel
Copy link
Member Author

markmandel commented Feb 28, 2020

make gcloud-terraform-cluster

Ah good catch. I will fix this.

I figure long term, this make target will go away, as it will get merged into our make gcloud-test-cluster command, and there will be no difference between the two.

// Update ?ref= to the agones release you are installing. For example, ?ref=release-1.3.0 corresponds
// to Agones version 1.3.0
// ***************************************************************************************************
source = "git::https://github.com/googleforgames/agones.git//install/terraform/modules/gke/?ref=master"

cluster = {
"zone" = "us-west1-c"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should make a var out of zone also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done.

}
}

module "helm_agones" {

source = "../../../install/terraform/modules/helm"

agones_version = var.agones_version
Copy link
Collaborator

@aLekSer aLekSer Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next line:

values_file            = ""

Should be changed to

values_file            = var.values_file

and please create a new variable values_file
It should contain
-var values_file="../../../helm/agones/values.yaml"
when we install from current branch, not release master.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we want to do this?

Do we want to be sending users to look into our helm files? Seems like something they shouldn't need or want to do?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for developer tooling, we can fix this after 1.4, but what this is optional:

  1. there could be values from the helm chart;
  2. For development we should have an option to change something to values file and it would be reflected in the deployment

repository = "${data.helm_repository.agones.metadata.0.name}"
chart = "${var.chart}"
repository = data.helm_repository.agones.metadata.0.name
chart = var.chart
timeout = 420

# Use terraform of the latest >=0.12 version
Copy link
Collaborator

@aLekSer aLekSer Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"${length(var.values_file) == 0 ? "" : file("${var.values_file}")}",
This is where values_file would be used (line 94)
https://github.com/googleforgames/agones/pull/1373/files#diff-56b861bb69f02c3b94e1de0aef454864R94

}
}
resource "google_container_cluster" "primary" {
name = var.cluster["name"]
location = var.cluster["zone"]
project = var.cluster["project"]
provider = google-beta
network = var.cluster["network"]
provider = google
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to specify this. The provider is effectively determined by string parsing the resource name so this resource will default to google if not explicitly overridden.

Also FWIW we don't consider the beta provider as an unstable provider. Instead we just provide it as a way for people to call the beta GCP APIs if they are interested in using features/fields that are only available in the `beta version of the product. There are a few more details about the provider versions here. That being said if you aren't using beta features then there is no need to use the beta provider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup and explanation 👍 removing the reference.

My concern with the google beta api surface isn't stable, was that if we happen to inadvertently use a beta feature. And since we aren't, there doesn't seem to be a good reason to take the risk.


cluster = {
"zone" = "us-west1-c"
"name" = var.name
"machineType" = var.machine_type
"initialNodeCount" = var.node_count
"project" = var.project
"network" = "default"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you planning on changing this to a var and setting the default in the variable instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Done.

@markmandel
Copy link
Member Author

@aLekSer looks like you are handling all the fixes for terraform.mk n the developer tooling?

I figure it's fine to leave that broken for this PR and for stable release, since that's developer tooling, and it doesn't matter if we fix it post 1.4.0 - then post stable release we can look at fixing that after stable release. How do you feel about that?

This fixes several critical issues, and also cleans up other aspect
along the way. Primarily that the firewall rule would never allow
traffic to the cluster, since they were erroneously configured and on a
completely different network to the cluster.

Bug Fixes:
1. Remove the creation of a separate network for each cluster
1. Make sure the firewall rules don't filter on tags, but target on
   network tags.
1. Be able to pass in a `network` argument (defaulting to "default"),
such that if you want to run a cluster on a different network it is easy
to configure - this means that the firewall rule is always on the same
network as the cluster.
1. Controller deployment was set to "Always" image pull policy on both
the controller and the sidecar. This should never be configured this way
for production systems, and is therefore an unsafe default.

Cleanup:
1. Remove duplicate helm.tf and cluster.tf in root ./install/terraform
path, as they were diverging, and served no value. Both ./gke and
./gke-local installation scripts have also been updated, as well as the
documentation.
1. Needed to upgrade Terraform in the build shell, since all modules
used a later version.
1. Cleaned up deprecation comments
1. Removed the broken symlinks in ./build

I would like to also make it easier to have only one GCP firewall rule
for n number of clusters, but we can tackle that in a later PR.

Closes #1372
@markmandel markmandel force-pushed the cleanup-terraform-gke branch from af9b89a to ce97e66 Compare February 29, 2020 23:38
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 26ae2521-d4da-4658-86db-2605827175b1

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: be84611c-4f1f-42fe-b4d0-2441051336e5

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1373/head:pr_1373 && git checkout pr_1373
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.4.0-ce97e66

Copy link
Collaborator

@aLekSer aLekSer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with that approach, I assume that terraform.mk could be fixed with this terraform configs structure.
Doing the fix in a separate PR

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aLekSer, drichardson, markmandel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot
Copy link

New changes are detected. LGTM label has been removed.

@aLekSer
Copy link
Collaborator

aLekSer commented Mar 2, 2020

LGTM

@aLekSer aLekSer added the lgtm label Mar 2, 2020
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: d89274bf-e741-45b7-bbac-5a3588364fd6

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1373/head:pr_1373 && git checkout pr_1373
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.4.0-9a21475

@markmandel markmandel merged commit 5413b28 into master Mar 2, 2020
@markmandel markmandel deleted the cleanup-terraform-gke branch March 2, 2020 18:05
ilkercelikyilmaz pushed a commit to ilkercelikyilmaz/agones that referenced this pull request Oct 23, 2020
This fixes several critical issues, and also cleans up other aspect
along the way. Primarily that the firewall rule would never allow
traffic to the cluster, since they were erroneously configured and on a
completely different network to the cluster.

Bug Fixes:
1. Remove the creation of a separate network for each cluster
1. Make sure the firewall rules don't filter on tags, but target on
   network tags.
1. Be able to pass in a `network` argument (defaulting to "default"),
such that if you want to run a cluster on a different network it is easy
to configure - this means that the firewall rule is always on the same
network as the cluster.
1. Controller deployment was set to "Always" image pull policy on both
the controller and the sidecar. This should never be configured this way
for production systems, and is therefore an unsafe default.

Cleanup:
1. Remove duplicate helm.tf and cluster.tf in root ./install/terraform
path, as they were diverging, and served no value. Both ./gke and
./gke-local installation scripts have also been updated, as well as the
documentation.
1. Needed to upgrade Terraform in the build shell, since all modules
used a later version.
1. Cleaned up deprecation comments
1. Removed the broken symlinks in ./build

I would like to also make it easier to have only one GCP firewall rule
for n number of clusters, but we can tackle that in a later PR.

Closes googleforgames#1372
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/operations Installation, updating, metrics etc kind/breaking Breaking change kind/bug These are bugs. lgtm size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Terraform: clean up duplicate tf config files.
7 participants