Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Feat: ELBv2/TGs - Add health check customization #4849

Merged
merged 3 commits into from
Apr 19, 2024

Conversation

mtulio
Copy link
Contributor

@mtulio mtulio commented Mar 6, 2024

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

This change exposes the health check configuration for listeners of both load balancers, primary and secondary. When the target group is created for the API, the user will be able to customize only the health check probes, according to the limit ranges allowed by AWS.

It will allow providers to customize the API and additional listeners' target health checks to ensure existing implementations.

A significant improvement will be in the additional listeners which is currently set with basic health checks following the same protocol of the listener (TCP). Exposing this value will allow customized health check.

The example below shows how we are using to customize the API target group (from the default provided by CAPA), alongside setting custom health check parameters for additional listeners, like overriding the protocol to HTTPS, check path /healthz, setting custom probe timers:

awsCluster := &capa.AWSCluster{
  ObjectMeta: metav1.ObjectMeta{
	Name:      clusterID.InfraID,
	Namespace: capiutils.Namespace,
  },
  Spec: capa.AWSClusterSpec{
	ControlPlaneLoadBalancer: &capa.AWSLoadBalancerSpec{
		Name:                ptr.To(clusterID.InfraID + "-int"),
		LoadBalancerType:    capa.LoadBalancerTypeNLB,
		Scheme:              &capa.ELBSchemeInternal,
                HealthCheckProtocol: &capa.ELBProtocolHTTPS,
		HealthCheck: &capa.TargetGroupHealthCheck{
			IntervalSeconds:         ptr.To(int64(10)),
			TimeoutSeconds:          ptr.To(int64(10)),
			ThresholdCount:          ptr.To(int64(2)),
			UnhealthyThresholdCount: ptr.To(int64(2)),
		},
		AdditionalListeners: []capa.AdditionalListenerSpec{
			{
				Port:     22623,
				Protocol: capa.ELBProtocolTCP,
				HealthCheck: &capa.TargetGroupHealthCheck{
					Protocol:                ptr.To("HTTPS"),
					Path:                    ptr.To("/healthz"),
					IntervalSeconds:         ptr.To(int64(10)),
					TimeoutSeconds:          ptr.To(int64(10)),
					ThresholdCount:          ptr.To(int64(2)),
					UnhealthyThresholdCount: ptr.To(int64(2)),
				},
			},
		},
	},
	SecondaryControlPlaneLoadBalancer: &capa.AWSLoadBalancerSpec{
		Name:                ptr.To(clusterID.InfraID + "-ext"),
		LoadBalancerType:    capa.LoadBalancerTypeNLB,
		Scheme:              &capa.ELBSchemeInternetFacing,
                HealthCheckProtocol: &capa.ELBProtocolHTTPS,
		HealthCheck: &capa.TargetGroupHealthCheck{
			IntervalSeconds:         ptr.To(int64(10)),
			TimeoutSeconds:          ptr.To(int64(10)),
			ThresholdCount:          ptr.To(int64(2)),
			UnhealthyThresholdCount: ptr.To(int64(2)),
		},
        },
  },
}

This PR is creating dedicated the health check structures, a more restricted one to API and field-free for additional listeners, protecting to expose the API target group for both load balancers when changing standardized fields like protocol, port and path..

Which issue(s) this PR fixes:

Fixes #4884

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

Exposing the health check attributes for the target group for the control plane load balancers, allowing customized health checks for API or additional listeners.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 6, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mtulio
Copy link
Contributor Author

mtulio commented Mar 6, 2024

fwiw @vincepri @patrickdillon @r4f4

@mtulio mtulio force-pushed the CORS-3294-elbv2-healthcheck branch from cf16eea to 6063882 Compare March 6, 2024 02:33
@mtulio
Copy link
Contributor Author

mtulio commented Mar 6, 2024

The provision of API and additional listeners are working as expected, I am investigating unit failures for fuzzy test TestFuzzyConversion[2]

[1] e2e deployment w/ custom target health check

Screenshot from 2024-03-05 23-34-49
Screenshot from 2024-03-05 23-35-46

[2] fuzzy test failures

--- FAIL: TestFuzzyConversion (16.75s)
    --- FAIL: TestFuzzyConversion/for_AWSCluster (3.69s)
        --- FAIL: TestFuzzyConversion/for_AWSCluster/hub-spoke-hub (0.00s)
            conversion.go:241: 
                  &v1beta2.AWSCluster{
...
                - 		Annotations:                nil,
                + 		Annotations:                map[string]string{},
...
                - 			AdditionalControlPlaneIngressRules: []v1beta2.IngressRule{},
                + 			AdditionalControlPlaneIngressRules: nil,
...
                - 			HealthCheck:              &v1beta2.TargetGroupHealthCheck{Path: &"ĆĠ成", Port: &"缆zp%!ʛ(MISSING)(", TimeoutSeconds: &5940725222272119504},
                + 			HealthCheck:              nil,
...
                - 				AvailabilityZones: []string{},
                + 				AvailabilityZones: nil,
...
FAIL
FAIL	sigs.k8s.io/cluster-api-provider-aws/v2/api/v1beta1	16.761s

api/v1beta2/network_types.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2024
@mtulio mtulio force-pushed the CORS-3294-elbv2-healthcheck branch from 7f36dc3 to 75c2074 Compare March 19, 2024 14:25
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 19, 2024
@mtulio mtulio force-pushed the CORS-3294-elbv2-healthcheck branch from 75c2074 to 36a6ef8 Compare March 19, 2024 14:59
@mtulio
Copy link
Contributor Author

mtulio commented Mar 19, 2024

PR comments addressed.
The issue was already created to formalize the "feature request" flow :)

#4884

cc @nrb @vincepri @faiq

@mtulio mtulio marked this pull request as ready for review March 19, 2024 15:16
@mtulio mtulio changed the title WIP | ✨ Feat/Add ELBv2/TGs health check configuration/customization ✨ Feat/Add ELBv2/TGs health check configuration/customization Mar 19, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 16, 2024
@nrb
Copy link
Contributor

nrb commented Apr 16, 2024

/lgtm as well

@mtulio
Copy link
Contributor Author

mtulio commented Apr 17, 2024

@richardcase would you mind taking a look in this proposal, please?

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

Thanks for this @mtulio .

Once both the e2e passes we can unhold

/hold

From my side this looks good, so:

/approve

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2024
@richardcase
Copy link
Member

Timed out, retrying in case its a flake:

/test pull-cluster-api-provider-aws-e2e

@nrb
Copy link
Contributor

nrb commented Apr 19, 2024

Yet another time out with GPU instances...

/test pull-cluster-api-provider-aws-e2e

@faiq
Copy link
Contributor

faiq commented Apr 19, 2024

@nrb is there a bug/issue for the GPU tests?

@nrb
Copy link
Contributor

nrb commented Apr 19, 2024

@nrb is there a bug/issue for the GPU tests?

Not specifically yet, just #4937 and #4783 (comment)

@mtulio
Copy link
Contributor Author

mtulio commented Apr 19, 2024

Once both the e2e passes we can unhold

/hold

From my side this looks good, so:

/approve

@richardcase @nrb e2e* are now passing. 🎉

@nrb
Copy link
Contributor

nrb commented Apr 19, 2024

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
mtulio added 3 commits April 19, 2024 15:45
Add tests to validate optional health check override for
all ELBv2 instances and it's listeners (api and additionals).
Some deployments requires customizations in the
health check configurations, such as protocol,
probe periods and checks.

This change introduce health check override for
all Load Balancers and listeners (API and additional).

The override for the API target group are limited to the probe
configuration, and customizing the Path, Port and Protocol for the
Target Group for the Kube API Server is not allowed.
Expose the target health check attributes/API
allowing customizations for both API (default from LB),
and additional listeners, for each Load Balancer.

Considering the risk of wrong configurations, the The customization
for the target group attributes of API listener is limited to the
health check probe configurations (interval, timeout, threshold, etc).

The health check for the additional listeners can be customized,
including health check protocol, port, path, etc
@r4f4
Copy link
Contributor

r4f4 commented Apr 19, 2024

@mtulio seems like the same problem here: needs a rebase on top of master.

@mtulio mtulio force-pushed the CORS-3294-elbv2-healthcheck branch from 97489d2 to 5b2dabf Compare April 19, 2024 18:46
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024
@mtulio
Copy link
Contributor Author

mtulio commented Apr 19, 2024

@mtulio: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-verify 97489d2 link unknown /test pull-cluster-api-provider-aws-verify

Rebased to fix codegen issue

@mtulio
Copy link
Contributor Author

mtulio commented Apr 19, 2024

@nrb passing now. Would you mind to res-stamp lgtm cleaned after rebase? Thanks :)

@nrb
Copy link
Contributor

nrb commented Apr 19, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024
@k8s-ci-robot k8s-ci-robot merged commit a7673e1 into kubernetes-sigs:main Apr 19, 2024
18 checks passed
@mtulio mtulio deleted the CORS-3294-elbv2-healthcheck branch April 19, 2024 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support customization of health check attributes for the LB's target groups for both Load Balancers
9 participants