Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite L4 healthchecks creation and deletion #1705

Merged

Conversation

cezarygerard
Copy link
Contributor

  • create a common singleton like struct fot l4 health checks
  • new struct holds mutex for common resources
  • delete shared healtcheck firewall rules safely

without proper guarding of shared healtcheck firewall rules, the NetLB controller may delete the ILB healthcheck firewall or vice versa

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 5, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @cezarygerard. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 5, 2022
@k8s-ci-robot k8s-ci-robot requested review from rramkumar1 and thockin May 5, 2022 15:41
@cezarygerard
Copy link
Contributor Author

/assign @bowei

pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/loadbalancers/interfaces.go Outdated Show resolved Hide resolved
pkg/loadbalancers/interfaces.go Outdated Show resolved Hide resolved
pkg/loadbalancers/l4netlb.go Outdated Show resolved Hide resolved
pkg/loadbalancers/l4netlb_test.go Outdated Show resolved Hide resolved
@bowei
Copy link
Member

bowei commented May 5, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 5, 2022
@bowei
Copy link
Member

bowei commented May 5, 2022

/assign

@cezarygerard
Copy link
Contributor Author

@bowei I have just answered all questions from Katarzyna.
Please review this code as we need this before we go GA with RBS feature

Copy link
Member

@bowei bowei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Biggest question is why introduce a singleton? Seems like it's mostly to reduce verbosity?

cmd/glbc/main.go Outdated
@@ -19,6 +19,7 @@ package main
import (
"context"
"fmt"
"k8s.io/ingress-gce/pkg/healthchecks"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put non-golang libs on a separate block for imports

@@ -43,5 +43,7 @@ ${GLBC} \
--running-in-cluster=false \
--logtostderr --v=${V} \
--config-file-path=${GCECONF} \
--run-l4-controller \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in a separate commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undone the change

@@ -104,8 +104,8 @@ func EnsureL4FirewallRuleDeleted(cloud *gce.Cloud, fwName string) error {
}

func firewallRuleEqual(a, b *compute.Firewall) bool {
return a.Description == b.Description &&
len(a.Allowed) == 1 && len(a.Allowed) == len(b.Allowed) &&
// let's skip description not to trigger flood of updates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Don't compare the "description" field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

changed the logic a little bit for non-shared health checks and firewalls according to #1705 (comment)

return a.Description == b.Description &&
len(a.Allowed) == 1 && len(a.Allowed) == len(b.Allowed) &&
// let's skip description not to trigger flood of updates
return len(a.Allowed) == 1 && len(a.Allowed) == len(b.Allowed) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put one predicate per line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

changed the logic a little bit for non-shared health checks and firewalls according to #1705 (comment)

pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
}

func Initialize(cloud *gce.Cloud, recorderFactory events.RecorderProducer) {
instance = &l4HealthChecks{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need to take the mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a mutex guard,
btw see my comment on singleton pattern

@@ -42,44 +47,111 @@ const (
gceHcUnhealthyThreshold = int64(3)
)

// EnsureL4HealthCheck creates a new HTTP health check for an L4 LoadBalancer service, based on the parameters provided.
var instance *l4HealthChecks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the rationale behind adding a singleton. Seems like it mostly just removes the Cloud and recorder argument? If we find that a problem because it's too verbose -- maybe just put those in a single struct and pass that as an argument instead of having to deal with the complexities of having a Singleton?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not about cloud and recorder args.

We have a real bug in healthcheck deletion.

Deletion of shared health check tries to remove firewall rule for health check
and the healthcheck from ILB and NetLB use the same firewall rule

Checking if healthcheck-under-deletion is in use is not enough (deleting NetLB Healthcheck will not fail if ILB Healthcheck is in use)
cross-checking firewalls is still subject to race conditions (one healthcheck is deleted, the other is added ad the same time)

so we need a way to safely healthchecks using the same mutex object across the NetLB controller and ILB controller.

Creating a common l4HealthChecks struct makes a good place for cross-checking healthchecks and simplifies code in NetLB controller and ILB controller.

Copy link
Contributor Author

@cezarygerard cezarygerard May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the open question whether it is worth to create the singleton

it make initialisation in l4.go and l4netlb.go much easier
but on the other hand the tests setups are more complicated

maybe it is enough to create an instance in main and pass it via context, just like any other single instance struct in this code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to sketch the old way for dependency injection: from context down to controllers na handlers
cezarygerard#3

it does not seem much simpler

}
}

func Fake(cloud *gce.Cloud, recorderFactory events.RecorderProducer) *l4HealthChecks {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it doesn't actually create a fake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right,
My rationale was to name it Fake so it is clearly only for test purposes.

@@ -104,8 +102,8 @@ func EnsureL4FirewallRuleDeleted(cloud *gce.Cloud, fwName string) error {
}

func firewallRuleEqual(a, b *compute.Firewall) bool {
return a.Description == b.Description &&
len(a.Allowed) == 1 && len(a.Allowed) == len(b.Allowed) &&
// let's skip description not to trigger flood of updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we wants to do this. For non-shared HC description contains ServiceIP: ip, so if IP will change then the health check description will be obsolete, it probably is needed only for debugging purposes but still it might be confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this logic, for non shared health checks

pkg/healthchecks/healthchecks_l4.go Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Show resolved Hide resolved
pkg/healthchecks/healthchecks_l4.go Show resolved Hide resolved
pkg/loadbalancers/l4netlb_test.go Show resolved Hide resolved
@bowei
Copy link
Member

bowei commented May 9, 2022

Can you make sure the Git commits follow good conventions (e.g. have one line + description describing the bug being fixed etc.)

pkg/loadbalancers/l4netlb_test.go Outdated Show resolved Hide resolved
len(a.Allowed) == 1 && len(a.Allowed) == len(b.Allowed) &&
func firewallRuleEqual(a, b *compute.Firewall, skipDescription bool) bool {
fwrEqual := len(a.Allowed) == 1 &&
len(a.Allowed) == len(b.Allowed) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instaed of creating new variable fwrEqual you can do :

 .. previous check..
 && (skipDescription || a.Description == b.Description)
 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored this whole func

}

// GetL4 returns singleton instance, must be run after InitializeL4
func GetL4() *l4HealthChecks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you put Init func here? (and make Init function not exported).
Checking instance == nil is cheap and then you don't have to worry that someone will call GetL4 before Init function.
I also think that it would be better to create this object inside context and keep it there (sinc all controllers have context and context is explicitly initialise before controllers are run so it will be safe) instead of this singleton pattern but this is just my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] It's probably more "Go"ish to call this L4()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kl52752

  • The l4HealthChecks needs recorder factory which is not easily accessible from l4.go and l4netlb.go
  • I have tried to create L4HealthChecks in context but I get import cycle error due to pkg/firewalls using the context

@bowei
Changed.


// EnsureL4HealthCheck creates a new HTTP health check for an L4 LoadBalancer service, and the firewall rule required
// for the healthcheck. If healthcheck is shared (external traffic policy 'cluster') then firewall rules will be shared
// regardless of scope param.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this comment is not complete. I don't understand (without knowing the context) firewall rules will be shared regardless of scope param what is scope param? Just write that ILB creates Global HC and NetLB Regional and firewall rule is shared for both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we should be very explicit in our comments, otherwise it ends up with more questions than answers:

Suggest (if accurate)

EnsureL4HealthCheck and firewall rules exist for the L4
LoadBalancer Service.

The healthcheck and firewall will be shared between different K8s
Services for ExternalTrafficPolicy = Cluster, as the same
configuration is used across all Services of this type.

Firewall rules are always created at in the Global scope (vs
Regional). This means that one Firewall rule is created for
Services of different scope (Global vs Regional).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, used your comment

pkg/loadbalancers/l4netlb_test.go Show resolved Hide resolved
@cezarygerard cezarygerard force-pushed the awesome-firewall-healthchecks branch from 570dc6e to a10c879 Compare May 9, 2022 23:01
@cezarygerard
Copy link
Contributor Author

@bowei done

var (
// instance is a sinngleton instance, created by InitializeL4
instance *l4HealthChecks
// mutex for preventing multiple initialization
Copy link
Member

@bowei bowei May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this instanceLock

// instanceLock to prevent duplicate initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@bowei bowei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure we have sufficient logging for debugging.

)

type l4HealthChecks struct {
mutex sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document what this mutex is for? what is it used to protect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

added comment
renamed to sharedResourcesLock as it was in previous implementation


// InitializeL4 creates singleton instance, must be run before GetL4() func
func InitializeL4(cloud *gce.Cloud, recorderFactory events.RecorderProducer) {
if instance == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if instance != nil {
  log.Error  double init
  return
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error, not warning?
ok, done error

pkg/healthchecks/healthchecks_l4.go Outdated Show resolved Hide resolved
// If the healthcheck already exists, it is updated as needed.
func EnsureL4HealthCheck(cloud *gce.Cloud, svc *corev1.Service, namer namer.L4ResourcesNamer, sharedHC bool, scope meta.KeyType, l4Type utils.L4LBType) (string, string, int32, string, error) {
func (l4hc *l4HealthChecks) EnsureL4HealthCheck(svc *corev1.Service, namer namer.L4ResourcesNamer, sharedHC bool, scope meta.KeyType, l4Type utils.L4LBType, nodeNames []string) (string, string, string, string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not a huge change, we should change the return value to a struct with named fields.

type EnsureL4HealthCheckResult struct {
  HCLink string
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

but to declare the return struct I must have moved the interfaces to the to the provider package instead of client package to avoid circular dependencies

hcPath, hcPort = helpers.GetServiceHealthCheckPathPort(svc)
hcPath, hcPort := helpers.GetServiceHealthCheckPathPort(svc)
if sharedHC {
hcPath, hcPort = gce.GetNodesHealthCheckPath(), gce.GetNodesHealthCheckPort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should log.V(3).Infof or higher

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return "", nil
}

func (l4hc *l4HealthChecks) healthCheckFirewallSafeToDelete(hcName string, sharedHC bool, l4Type utils.L4LBType) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should name this func closer to what it indicates rather than the interpreted meaning.

isn't this just healthcheckExists()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is actually inversed

if healthcheck exists then the function will return false
also it always returns true for shared healthchecks

maybe: healthCheckFirewallInUse ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it still inverst entire logic


func (l4hc *l4HealthChecks) deleteFirewall(name string, svc *corev1.Service) error {
err := firewalls.EnsureL4FirewallRuleDeleted(l4hc.cloud, name)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err == nil { return nil }

log.Error the error

Also, we should have a comment saying why we ignore the FIrewallXPNError

should we use err.As() instead of the type case?

Copy link
Contributor Author

@cezarygerard cezarygerard May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment

Log is written inside firewalls.EnsureL4FirewallRuleDeleted for XPN firewall rules errors

The type cast code has been moved from l4.go/l4netlb.go without changing the logic, therefore I am hesitant to change to errors.As(...)

// L4HealthChecks defines methods for creating and deleting health checks (and their firewall rules) for l4 services
type L4HealthChecks interface {
// EnsureL4HealthCheck creates health check (and firewall rule) for l4 service
EnsureL4HealthCheck(svc *corev1.Service, namer namer.L4ResourcesNamer, sharedHC bool, scope meta.KeyType, l4Type utils.L4LBType, nodeNames []string) (string, string, string, string, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about this return type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

but to declare the struct I must have moved the interfaces to the to the provider package instead of client package to avoid circular dependencies

err := firewalls.EnsureL4FirewallRuleDeleted(l.cloud, name)
if err != nil {
if fwErr, ok := err.(*firewalls.FirewallXPNError); ok {
l.recorder.Eventf(l.Service, corev1.EventTypeNormal, "XPN", fwErr.Message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably log info here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the info is logged in the firewalls.EnsureL4FirewallRuleDeleted if XPN error occurs

result.Error = err
}
result.GCEResourceInError = resourceInError
result.Error = err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to break if we get an error from this loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, customer might have updated the service from ExternalTrafficPolicy=Local to ExternalTrafficPolicy=Cluster, and now they are deleting the service.

We do not try to cleanup firewall rules during ExternalTrafficPolicy change, so we must cleanup here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if shared HC deletion failed there stil may be non shared HC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment describing why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more comment line here
there is also full rationale in the comments just before the for loop

@cezarygerard cezarygerard force-pushed the awesome-firewall-healthchecks branch 2 times, most recently from eccf309 to abf0c05 Compare May 17, 2022 17:58
@cezarygerard
Copy link
Contributor Author

@bowei
done and answered all comments

}
}

srcAndTargetEqual := utils.EqualStringSets(a.SourceRanges, b.SourceRanges) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's easier to maintain if we make this straight line code:

if ! utils.EqualStringSets() { return false }
if ! utils.EqualStringSets() { return false }
if ! skipDescription && a.Description != b.Description {
  return false
}
return true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// instance is a singleton instance, created by InitializeL4
instance *l4HealthChecks
// instanceLock to prevent duplicate initialization.
instanceLock = &sync.Mutex{}
Copy link
Member

@bowei bowei May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go style puts the mutex in the same block as the protected var at the top:

struct {
  lock
  var
  var

  nonProtectedVar
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)

type l4HealthChecks struct {
// Lock access to shared resources - node healthcecks and their firewall rules
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be more explicit:

// sharedResourceLock serializes operations on the healthcheck and firewall
// resources shared across multiple Services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return
}

instanceLock.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to put this first, no? the instance != nil needs to be guarded.

Copy link
Contributor Author

@cezarygerard cezarygerard May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the singleton has 2 checks,

  1. to only lock the mutex if the instance is not initialised
    and
  2. to make sure that instance was not initialised between first check and lock acquisition.

The code used to look like this:

	if instance == nil {
		initLock.Lock()
		defer initLock.Unlock()

		if instance == nil {
			instance = &l4HealthChecks{
				cloud:           cloud,
				recorderFactory: recorderFactory,
			}
		}
	}

in the comnent #1705 (comment)
you asked me to log double init attempts

so I changed the above code into

	if instance != nil {
		klog.Error("Multiple L4 Healthchecks initialization attempts")
		return
	}

	instanceLock.Lock()
	defer instanceLock.Unlock()

	if instance == nil {
		instance = &l4HealthChecks{
			cloud:           cloud,
			recorderFactory: recorderFactory,
		}
		klog.V(3).Infof("Initialized L4 Healthchecks")
	}

which is equivalent of double checked locking.

I can just lock out entire function,

	instanceLock.Lock()
	defer instanceLock.Unlock()
	
	if instance != nil {
		klog.Error("Multiple L4 Healthchecks initialization attempts")
		return
	}

	instance = &l4HealthChecks{
		cloud:           cloud,
		recorderFactory: recorderFactory,
	}
	klog.V(3).Infof("Initialized L4 Healthchecks")

as InitializeL4 is only called once (not in every healthchecks.GetL4() call )

are you ok with the last option?

instanceLock.Lock()
defer instanceLock.Unlock()

if instance == nil {
Copy link
Member

@bowei bowei May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this if any more

Copy link
Contributor Author

@cezarygerard cezarygerard May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see the previous comment

if err != nil {
return false, fmt.Errorf("Failed to create composite key for healthcheck %s - %w", hcName, err)
}
_, err = composite.GetHealthCheck(l4hc.cloud, key, meta.VersionGA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to take the lock here to avoid something creating in parallel to the deletion?
This has a lock read unlock; A ; lock write unlock issue -- there might be some operation occurring at A that is a race.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't get this race issue here
Could you please elaborate?

please note that EnsureL4HealthCheck and DeleteHealthCheck execute
l4hc.sharedResourcesLock.Lock()
very early before any API calls

so
between healthCheckFirewallSafeToDelete call and actual deletion, the sharedResourcesLock is still possessed by the same thread

@cezarygerard cezarygerard force-pushed the awesome-firewall-healthchecks branch from abf0c05 to d65a8ca Compare May 26, 2022 15:54
@cezarygerard
Copy link
Contributor Author

@bowei @kl52752
I have answered all your comments

- create a common singleton-like struct fot l4 health checks
- new struct holds mutex for common resources (healthchecks and their firewall rules used for
- [bugfix] delete shared healtcheck firewall rules safely -  cross-check between ILB and NLB healthchecks is firewall rules are in use

Logging
New log line idicating firewall rule not deleted due to cross-check: "Failed to delete health check firewall rule %s: health check in use."

Testing
- healthcheck management is mostly covered (by existing tests), they required little update.
- added test cases for sahred firewall rule deletion(lack of), named TestHealthCheckFirewallDeletionWithILB and TestHealthCheckFirewallDeletionWithNetLB
- run test manual tests
@cezarygerard cezarygerard force-pushed the awesome-firewall-healthchecks branch from d65a8ca to 217b108 Compare May 27, 2022 10:54
@cezarygerard
Copy link
Contributor Author

/test pull-ingress-gce-test

EnsureL4HealthCheck: replace lengthy return value list in with named struct
Improve firewall rule comparison
Added debug logs
Improved go fmt
Renamed function names
And many more small ones
@cezarygerard cezarygerard force-pushed the awesome-firewall-healthchecks branch from 91c10b1 to f8353dc Compare May 31, 2022 09:54
HCName string
HCLink string
HCFirewallRuleName string
GceResourceInError string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to GCEResourceInError

@bowei
Copy link
Member

bowei commented Jun 1, 2022

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2022
@kl52752
Copy link
Contributor

kl52752 commented Jun 2, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, cezarygerard, kl52752

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 5382feb into kubernetes:master Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants