-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add terraform for utility cluster. Add name override to gke #30847
Conversation
under the License. | ||
--> | ||
|
||
# Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add more details about what the intent is to use this cluster instead of "datastores"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@damondouglas would you mind taking a look at this one when you have a chance? |
Assigning reviewers. If you would like to opt out of this review, comment R: @shunping added as fallback since no labels match configuration Available commands:
The PR bot will only process comments in the main thread (not review comments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this.
value = google_container_cluster.default.endpoint | ||
} | ||
|
||
output cluster_ca_certificate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding outputs :-). Could you tell me what this output is needed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This output is used in the upper module for helm to authenticate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the provisioning of the Kubernetes cluster and any workloads that depend on it should be in separate terraform modules. Then one would just follow typical gcloud command to connect.
.test-infra/terraform/google-cloud-platform/google-kubernetes-engine/cluster.tf
Show resolved
Hide resolved
value = google_container_cluster.default.endpoint | ||
} | ||
|
||
output cluster_ca_certificate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the provisioning of the Kubernetes cluster and any workloads that depend on it should be in separate terraform modules. Then one would just follow typical gcloud command to connect.
source = "../google-kubernetes-engine" | ||
project = "apache-beam-testing" | ||
network = "default" | ||
subnetwork = "default-f91f013bcf8bd369" | ||
region = "us-central1" | ||
cluster_name_prefix = "beam-utility" | ||
service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com" | ||
cluster_name_override = "beam-utility" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one can just create a new tfvars
file storing these values and have the workflow involve provisioning the Kubernetes cluster first, separate from the strimzi workload.
resource "helm_release" "strimzi-helm-release" { | ||
name = "strimzi" | ||
namespace = "strimzi" | ||
create_namespace = true | ||
repository = "https://strimzi.io/charts/" | ||
chart = "strimzi-kafka-operator" | ||
version = "0.40.0" | ||
|
||
atomic = "true" | ||
timeout = 500 | ||
|
||
set { | ||
name = "watchAnyNamespace" | ||
value = "true" | ||
} | ||
depends_on = [ module.gke.google_container_cluster ] | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be in its own module separate from the GKE cluster provisioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is possible to put it in its own module but the idea behind the utility-cluster
folder is to use the GKE module and install everything that is needed for that exact purpose via terraform and in one step. It does not make sense for me to separate out a module as there is no intetion to reuse this due to its specific purpose. Other clusters can crate different folders for different purposes.
Let me know if this is fine and if not ill try to come up with different structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, I find co-mingling GKE provisioning with Kubernetes workload provisioning in the same terraform module to lead to problems in the future. I personally would like to see it in a separate module. I'm more than willing to defer to another Apache Beam committer's opinion, if they think the co-mingling design is ok and have a logical well articulated reason. Otherwise, I'm not comfortable approving this PR with the current design.
In summary, my design preference is:
- separate GKE provisioning module - a version controlled tfvars file in the existing
.test-infra/terraform/google-cloud-platform/google-kubernetes-engine
folder could work - separate folder responsible for provisioning the strimzi cluster
…e and implement as var file
@damondouglas I have added a number of changes that implement most of what has been discussed. Please take a look when you have time. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See akvelon#487. It was easier to create akvelon#487 instead of commenting throughout this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making the changes. Additional questions/comments:
- Is .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced/kustomization.yaml still needed?
- When I tested the strimzi helm chart, only the strimzi operator deployment started but nothing else related to kafka.
- Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.
region = "us-central1" | ||
router = "default-us-central1-router" | ||
router_nat = "default-us-central1-router-nat" | ||
cluster_name_override = "beam-utility" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we name this something more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we should keep it this as we should add more to this cluster instead of creating multiple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because auto pilot scales to the workload, we can have multiple clusters focused on a specific resource need. That's the reason for having this re-usable GKE auto pilot creating solution. I'd argue that beam-utility
will not make sense to someone trying to fix or add to the infrastructure later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would kafka-workflows be precise enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@volatilemolotov Thank you for listening. That would be great.
.test-infra/kafka/strimzi/README.md
Outdated
``` | ||
KafkaIO.write().withBootstrapServers("10.128.0.14:9094") | ||
``` | ||
TODO: DEFINE HOW TO CONNECT TO CLUSTER; see .test-infra/kafka/bitnami/README.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you be finishing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i have added lines to README that explain how its done
*/ | ||
|
||
bucket = "b507e468-52e9-4e72-83e5-ecbf563eda12" | ||
prefix = ".test-infra/terraform/google-cloud-platform/google-kubernetes-engine/beam-utility" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After changing the name of the cluster, could you also change this prefix to match?
variable "cluster_name_override" { | ||
type = string | ||
description = "Use this to override naming and omit the postfix. Leave empty to use prefix-suffix format" | ||
default = "" | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this variable and just have the prefix to keep it simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a predictable name so we dont have to change x number of workflows each time we redeploy for any reason. I would like to keep it this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just keep the kafka cluster running continually and delete the topics after the workflows execute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had flakey tests in this repository due to waiting on spinning up new clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way we ensure its fresh each time which is easier then to maintain a kafka instance and make sure it does not break between different tests. We could delete topics but still there could be issues.
The kustomization is used in workflow that use these clusters to bring up kafka for their testing |
Tested it out in a project that only had APIs enabled and a default VPC. It works once i provided the subnet , router and nat |
region = "us-central1" | ||
router = "default-us-central1-router" | ||
router_nat = "default-us-central1-router-nat" | ||
cluster_name_override = "beam-utility" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because auto pilot scales to the workload, we can have multiple clusters focused on a specific resource need. That's the reason for having this re-usable GKE auto pilot creating solution. I'd argue that beam-utility
will not make sense to someone trying to fix or add to the infrastructure later.
Could you explain your testing approach because the following in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/prerequisites.tf:
should have given you an error when you tested because https://github.com/apache/beam/pull/30847/files#diff-e53f48e6ee35cb4d93d7b0750674c071edb78e05e90cbadda94492ef2be95cc1R27 in |
In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. Thank you so much for your patience.
region = "us-central1" | ||
router = "default-us-central1-router" | ||
router_nat = "default-us-central1-router-nat" | ||
cluster_name_override = "beam-utility" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@volatilemolotov Thank you for listening. That would be great.
router_nat = "default-us-central1-router-nat" | ||
cluster_name_override = "beam-utility" | ||
cluster_name_prefix = "beam-utility" | ||
service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In
google_service_account
datasource email is allowed https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/service_account#argument-reference
Thank you for confirming and testing this. I recommend either changing the variable name to service_account_email
and providing the email or service_account_id
and changing the tfvars
to be an id only. Personally, I prefer an ID since it means less data in the configuration but still works in the same project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the id. According to the datasource argument spec:
The following arguments are supported:
[account_id](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/service_account#account_id) - (Required) The Google service account ID. This be one of:
The name of the service account within the project (e.g. my-service)
The fully-qualified path to a service account resource (e.g. projects/my-project/serviceAccounts/...)
The email address of the service account (e.g. my-service@my-project.iam.gserviceaccount.com)
I would think that fully qualified path would be ID but that just gives out more info. I will default to just name here as it gives out the least info. Let me know if that is ok.
.test-infra/kafka/strimzi/README.md
Outdated
|
||
``` | ||
kubectl get svc beam-testing-cluster-kafka-external-bootstrap --namespace strimzi | ||
DIR=.test-infra/kafka/strimzi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we:
- Move the terraform module into
01-strimzi-operator
folder? - Keeping .test-infra/kafka/strimzi/README.md where it is: change DIR=.test-infra/kafka/01-strimzi-operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also left the README.md in the strimzi folder and updated the DIR instruction
|
||
Simply deploy the cluster by using kustomize plugin of kubectl | ||
``` | ||
kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two points:
- When I tried this, I got the error:
error: unable to find one of 'kustomization.yaml', 'kustomization.yml' or 'Kustomization' in directory '.test-infra/kafka/strimzi/02-kafka-persistent'
This worked:
kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced
- Solution deployed into the default namespace. Was this intended? Original solution was in the default namespace. I don't mind either way.
Following specifies the namespace.
kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced --namespace=strimzi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the path.
Yeah it was supposed to be able to deploy to any namespace. I decided to put strimzi namespace into the instruction for the sake of completeness
``` | ||
and wait until the cluster is deployed | ||
``` | ||
kubectl wait kafka beam-testing-cluster --for=condition=Ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept getting a timeout. I didn't have time to investigate this. Either investigate this or recommend using https://k9scli.io/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a timeout. Value of 1200 seems long but there are cases when deployment takes longer due to how Autopilot scales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all this work.
Adds terraform for utility cluster which is to be used for test infra.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.