Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add terraform for utility cluster. Add name override to gke #30847

Merged
merged 18 commits into from
Apr 26, 2024

Conversation

volatilemolotov
Copy link
Contributor

Adds terraform for utility cluster which is to be used for test infra.

  • Uses existing GKE module from .test-infra/terraform
  • Adds a name override to GKE module, when a predictable name is needed.This is mostly so we can have a stable name in all the workflows that need to access this cluster
  • Install kafka using helm instead of versioned manifests
  • Remove namespace from kafka kustomization to allow installing in workflow temporary namespace
  • Bump kafka version to support newer operator

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions github-actions bot added the infra label Apr 4, 2024
under the License.
-->

# Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add more details about what the intent is to use this cluster instead of "datastores"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@andreydevyatkin andreydevyatkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@volatilemolotov volatilemolotov marked this pull request as ready for review April 4, 2024 13:48
@damccorm
Copy link
Contributor

damccorm commented Apr 4, 2024

@damondouglas would you mind taking a look at this one when you have a chance?

Copy link
Contributor

github-actions bot commented Apr 4, 2024

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @shunping added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

@damondouglas damondouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this.

value = google_container_cluster.default.endpoint
}

output cluster_ca_certificate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding outputs :-). Could you tell me what this output is needed for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the provisioning of the Kubernetes cluster and any workloads that depend on it should be in separate terraform modules. Then one would just follow typical gcloud command to connect.

value = google_container_cluster.default.endpoint
}

output cluster_ca_certificate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the provisioning of the Kubernetes cluster and any workloads that depend on it should be in separate terraform modules. Then one would just follow typical gcloud command to connect.

Comment on lines 20 to 27
source = "../google-kubernetes-engine"
project = "apache-beam-testing"
network = "default"
subnetwork = "default-f91f013bcf8bd369"
region = "us-central1"
cluster_name_prefix = "beam-utility"
service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com"
cluster_name_override = "beam-utility"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one can just create a new tfvars file storing these values and have the workflow involve provisioning the Kubernetes cluster first, separate from the strimzi workload.

Comment on lines 19 to 36
resource "helm_release" "strimzi-helm-release" {
name = "strimzi"
namespace = "strimzi"
create_namespace = true
repository = "https://strimzi.io/charts/"
chart = "strimzi-kafka-operator"
version = "0.40.0"

atomic = "true"
timeout = 500

set {
name = "watchAnyNamespace"
value = "true"
}
depends_on = [ module.gke.google_container_cluster ]
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be in its own module separate from the GKE cluster provisioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is possible to put it in its own module but the idea behind the utility-cluster folder is to use the GKE module and install everything that is needed for that exact purpose via terraform and in one step. It does not make sense for me to separate out a module as there is no intetion to reuse this due to its specific purpose. Other clusters can crate different folders for different purposes.
Let me know if this is fine and if not ill try to come up with different structure

Copy link
Contributor

@damondouglas damondouglas Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, I find co-mingling GKE provisioning with Kubernetes workload provisioning in the same terraform module to lead to problems in the future. I personally would like to see it in a separate module. I'm more than willing to defer to another Apache Beam committer's opinion, if they think the co-mingling design is ok and have a logical well articulated reason. Otherwise, I'm not comfortable approving this PR with the current design.

In summary, my design preference is:

  1. separate GKE provisioning module - a version controlled tfvars file in the existing .test-infra/terraform/google-cloud-platform/google-kubernetes-engine folder could work
  2. separate folder responsible for provisioning the strimzi cluster

@volatilemolotov
Copy link
Contributor Author

@damondouglas I have added a number of changes that implement most of what has been discussed. Please take a look when you have time. Thanks

Copy link
Contributor

@damondouglas damondouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See akvelon#487. It was easier to create akvelon#487 instead of commenting throughout this PR.

@github-actions github-actions bot added build and removed build labels Apr 19, 2024
@github-actions github-actions bot added build and removed build labels Apr 22, 2024
@github-actions github-actions bot added build and removed build labels Apr 22, 2024
@github-actions github-actions bot added build and removed build labels Apr 22, 2024
@github-actions github-actions bot added build and removed build labels Apr 22, 2024
Copy link
Contributor

@damondouglas damondouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the changes. Additional questions/comments:

  1. Is .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced/kustomization.yaml still needed?
  2. When I tested the strimzi helm chart, only the strimzi operator deployment started but nothing else related to kafka.
  3. Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.

region = "us-central1"
router = "default-us-central1-router"
router_nat = "default-us-central1-router-nat"
cluster_name_override = "beam-utility"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we name this something more specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we should keep it this as we should add more to this cluster instead of creating multiple

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because auto pilot scales to the workload, we can have multiple clusters focused on a specific resource need. That's the reason for having this re-usable GKE auto pilot creating solution. I'd argue that beam-utility will not make sense to someone trying to fix or add to the infrastructure later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would kafka-workflows be precise enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volatilemolotov Thank you for listening. That would be great.

```
KafkaIO.write().withBootstrapServers("10.128.0.14:9094")
```
TODO: DEFINE HOW TO CONNECT TO CLUSTER; see .test-infra/kafka/bitnami/README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be finishing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i have added lines to README that explain how its done

*/

bucket = "b507e468-52e9-4e72-83e5-ecbf563eda12"
prefix = ".test-infra/terraform/google-cloud-platform/google-kubernetes-engine/beam-utility"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After changing the name of the cluster, could you also change this prefix to match?

Comment on lines +34 to +39
variable "cluster_name_override" {
type = string
description = "Use this to override naming and omit the postfix. Leave empty to use prefix-suffix format"
default = ""
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this variable and just have the prefix to keep it simple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a predictable name so we dont have to change x number of workflows each time we redeploy for any reason. I would like to keep it this way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just keep the kafka cluster running continually and delete the topics after the workflows execute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had flakey tests in this repository due to waiting on spinning up new clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we ensure its fresh each time which is easier then to maintain a kafka instance and make sure it does not break between different tests. We could delete topics but still there could be issues.

@volatilemolotov
Copy link
Contributor Author

  • Is .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced/kustomization.yaml still needed?
  • When I tested the strimzi helm chart, only the strimzi operator deployment started but nothing else related to kafka.

The kustomization is used in workflow that use these clusters to bring up kafka for their testing

@volatilemolotov
Copy link
Contributor Author

volatilemolotov commented Apr 22, 2024

Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.

Tested it out in a project that only had APIs enabled and a default VPC. It works once i provided the subnet , router and nat

@github-actions github-actions bot added build and removed build labels Apr 23, 2024
@github-actions github-actions bot added build and removed build labels Apr 23, 2024
region = "us-central1"
router = "default-us-central1-router"
router_nat = "default-us-central1-router-nat"
cluster_name_override = "beam-utility"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because auto pilot scales to the workload, we can have multiple clusters focused on a specific resource need. That's the reason for having this re-usable GKE auto pilot creating solution. I'd argue that beam-utility will not make sense to someone trying to fix or add to the infrastructure later.

@damondouglas
Copy link
Contributor

damondouglas commented Apr 23, 2024

Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.

Tested it out in a project that only had APIs enabled and a default VPC. It works once i provided the subnet , router and nat

Could you explain your testing approach because the following in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/prerequisites.tf:

// Query the Service Account.
data "google_service_account" "default" {
  depends_on = [google_project_service.required]
  account_id = var.service_account_id
}

should have given you an error when you tested because https://github.com/apache/beam/pull/30847/files#diff-e53f48e6ee35cb4d93d7b0750674c071edb78e05e90cbadda94492ef2be95cc1R27 in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/beam-utility.apache-beam-testing.tfvars (service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com") is an email and not the service account ID.

@volatilemolotov
Copy link
Contributor Author

Could you tell me the outcome of your testing these changes in a new GCP project, not apache-beam-testing.

Tested it out in a project that only had APIs enabled and a default VPC. It works once i provided the subnet , router and nat

Could you explain your testing approach because the following in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/prerequisites.tf:

// Query the Service Account.
data "google_service_account" "default" {
  depends_on = [google_project_service.required]
  account_id = var.service_account_id
}

should have given you an error when you tested because https://github.com/apache/beam/pull/30847/files#diff-e53f48e6ee35cb4d93d7b0750674c071edb78e05e90cbadda94492ef2be95cc1R27 in .test-infra/terraform/google-cloud-platform/google-kubernetes-engine/beam-utility.apache-beam-testing.tfvars (service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com") is an email and not the service account ID.

In google_service_account datasource email is allowed https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/service_account#argument-reference

Copy link
Contributor

@damondouglas damondouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. Thank you so much for your patience.

region = "us-central1"
router = "default-us-central1-router"
router_nat = "default-us-central1-router-nat"
cluster_name_override = "beam-utility"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volatilemolotov Thank you for listening. That would be great.

router_nat = "default-us-central1-router-nat"
cluster_name_override = "beam-utility"
cluster_name_prefix = "beam-utility"
service_account_id = "beam-github-actions@apache-beam-testing.iam.gserviceaccount.com"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In google_service_account datasource email is allowed https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/service_account#argument-reference

Thank you for confirming and testing this. I recommend either changing the variable name to service_account_email and providing the email or service_account_id and changing the tfvars to be an id only. Personally, I prefer an ID since it means less data in the configuration but still works in the same project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the id. According to the datasource argument spec:

The following arguments are supported:

[account_id](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/service_account#account_id) - (Required) The Google service account ID. This be one of:

The name of the service account within the project (e.g. my-service)

The fully-qualified path to a service account resource (e.g. projects/my-project/serviceAccounts/...)

The email address of the service account (e.g. my-service@my-project.iam.gserviceaccount.com)

I would think that fully qualified path would be ID but that just gives out more info. I will default to just name here as it gives out the least info. Let me know if that is ok.


```
kubectl get svc beam-testing-cluster-kafka-external-bootstrap --namespace strimzi
DIR=.test-infra/kafka/strimzi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we:

  1. Move the terraform module into 01-strimzi-operator folder?
  2. Keeping .test-infra/kafka/strimzi/README.md where it is: change DIR=.test-infra/kafka/01-strimzi-operator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also left the README.md in the strimzi folder and updated the DIR instruction


Simply deploy the cluster by using kustomize plugin of kubectl
```
kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two points:

  1. When I tried this, I got the error:
error: unable to find one of 'kustomization.yaml', 'kustomization.yml' or 'Kustomization' in directory '.test-infra/kafka/strimzi/02-kafka-persistent'

This worked:

kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced
  1. Solution deployed into the default namespace. Was this intended? Original solution was in the default namespace. I don't mind either way.

Following specifies the namespace.

kubectl apply -k .test-infra/kafka/strimzi/02-kafka-persistent/overlays/gke-internal-load-balanced --namespace=strimzi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the path.
Yeah it was supposed to be able to deploy to any namespace. I decided to put strimzi namespace into the instruction for the sake of completeness

```
and wait until the cluster is deployed
```
kubectl wait kafka beam-testing-cluster --for=condition=Ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept getting a timeout. I didn't have time to investigate this. Either investigate this or recommend using https://k9scli.io/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a timeout. Value of 1200 seems long but there are cases when deployment takes longer due to how Autopilot scales.

@volatilemolotov volatilemolotov mentioned this pull request Apr 26, 2024
3 tasks
@github-actions github-actions bot added build and removed build labels Apr 26, 2024
Copy link
Contributor

@damondouglas damondouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all this work.

@damondouglas damondouglas merged commit 28a2682 into apache:master Apr 26, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants