-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BackendConfig security policy not enforced #616
Comments
There has been discussion on making load balancer status more extensible via CRDs. The easiest thing to do is to expose via an annotation. |
/lifecycle frozen |
Hi. Thank you @bowei for your quick answer but I don't think I understand it. The BackendConfig object is already a CRD right? Actually, to better explain my problem, I followed this document: My main issues are:
Anyway, I wasn't able to find reproducible steps yet because the bug is slippery. For example, I kept on purpose a running cluster for days with the edge case from above but between today and yesterday without my intervention, it just fixed itself. I guess something happened in the Google managed part of my cluster. |
This sounds like a bug in the controller. Is your mitigating step just "triggering an update" on the BackendConfig / Ingress?
If the BackendConfig object is deleted entirely, SecurityPolicy will not be detached from the corresponding Load Balancing resource. By deleting a BackendConfig it basically means leaving the Load Balancer as is (without resetting the configuration previously provided by this BackendConfig). |
What I did most of the times is:
So yeah, I trigger an update in the BackendConfig. I do not touch the ingress myself. However, my Helloworld app Continuous Deployment pipeline recreate the ingress on every deploy. I imagine that it triggers a Load Balancer recreation right? This maybe a factor which obfuscated the problem since it can trigger anytime I will disable it during my tests. I think the policy takes a few minutes to be enforced. This does not help when testing.
Looks like it. But I want to be sure. I am trying to isolate the problem again with the info you gave me. I will keep you updated. Thanks for your help. |
@jpigree Thanks for the info. It might be helpful to also check the ingress controller logs if you have access to the master (I know many don't). |
@MrHohn I don't have access to the ingress controller logs and "kubectl describe backendconfig" does not print any information about the policy enforcement status. I have a question though. I look at the loadbalancer created by the gce-ingress on the GCP console and I wonder why do I have two backend services for the same instance group. One with the security policy activated and the other not. Could this be an issue? I checked but I still have those 2 backend services even when the policy is enforced successfully. I am still running my reproduction scripts but the security policy can take a pretty long time to be enforced (> 10 minutes) so I added latency which really slow my tests. |
Hi again. I have a few results. I couldn't reproduce the bug consistently but I can show logs proving that the BackendConfig does not work very well. @MrHohn Can I send them to you by email? I will have to anonymise everything otherwise. |
@jpigree Sure thing, please send them to zihongz@google.com, thanks! |
This is expected. Ingress controller creates 1 backendService (google cloud resource) for each linked nodePort service (k8s resource). And they all share the same unmanaged instance group. In your case, one of the backendServices maybe created for the default backend service, which is deployed upon cluster creation. Will post more updates after reading through the logs. |
@jpigree Thanks for the detailed info in the email. I followed similar procedures as your test but wasn't able to reproduce the issue --- I confirmed the security policy is attached after various combination of actions (detach or not detach backendConfig before Ingress recreation, etc.). One observation is that in the test log you sent over, I don't see any "Security Policy is not enforced" log printed out. Instead, I only saw "Security Policy was not set" logs, which indicates the test timed out waiting for a 503 code to be returned. This might be caused by the corresponding load balancing resources took too long to become ready, instead of the security policy is not enforced. Can you check if that is the case (e.g. check on the backendService resource directly to see if security policy is attached when the test times out)? Though I saw the timeout in test is set to 20*40=800 seconds. Not sure why the LB resource took so long to be provisioned. |
@MrHohn. Thanks for the review. Yes. I couldn't reproduce with my scripts because I kept hitting timeouts. I did kinda reproduce one time: i had "HTTP 200" during 10 minutes or so before the security policy kicks in. I don't know if this considered acceptable. I didn't sent you those logs because my scripts weren't finished yet (the code won't match with the logs). But after that, I mostly kept hitting timeouts on successive runs despite them being huge. This is strange because with the nginx-ingress-controller which spawns TCP Load Balancer, I never waited more than 5 minutes. So you tried on another cluster with the same version than I("1.11.5-gke.5" )? And you didn't had any timeout issue and your policy was successfully attached? Did you test by verifying in the console or by accessing the URL? Another possibility is my problem came from cloud armor. Because there are many managed parts and very few logs accessible (my company won't activate stackdriver for this project), this is hard for me to identify what went wrong. I think I will try again to reproduce tomorrow and if I can't I will just close the ticket. However, even if I do reproduce once, I don't think it will help you much to identify it as it doesn't happen consistently. I will add a check to my tests using gcloud to know if the API confirm that the policy is attached or not after each curl. Thank for your time. |
@MrHohn. So I did other tests and I still couldn't reproduce. Actually, I think it works either way better than before (perhaps due to gcloud upgrades/fixes) or I just know better what to expect. Indeed, the inertia (sometimes more than 15 minutes long) between the k8s object creation and the policy enforcement could have misslead me. Moreover, I also added the policy attachement status to the logs and it shows clearly that the "gce-ingress-controller" does it job. My main pain point now is the random HTTP 200, HTTP 502, or network errors I get during the "LB init time". But this is due to the LB or cloud armor so this isn't related to this ticket. I will need to check their SLA. I attached a few logs to this comment to illustrate all I said. logs.TXT The best way to view them is to download them and do "cat FILES | less". Just in case, I implemented a cron validation CI job which check if my services are firewalled or not. So, if the issue happens (again), I will have logs. I will close the ticket. Thanks for your time. |
@jpigree Thanks for the updates. Indeed from your logs I can now see the 200 happens without security policy enforced. As you mentioned, this likely happened during the initialization of the load balancing resource, and we got users reported before that random codes may be returned before load balancing is ready. IMHO the fact that google cloud LB resource doesn't report status also make adding status to k8s ingress harder. |
Hi. I created a GKE cluster in version "1.11.5-gke.5" (with autoscaling activated) and I use the pre-installed gce ingress controller to expose my applications on the WAN. However, I need to firewall them (filtering on source IP) so I use the "security policy" field in the BackendConfig object to enforce my cloud armor policy on the load balancer. However I have a hard time making it work consistently.
Indeed, I often end up in a state where the cloud armor policy is not enforced without any change on the BackendConfig object. And actually the only steps to make it work again are to empty the "security policy" field/recreate the BackendConfig object until it works.
Here is my current configuration for a simple helloworld application:
This configuration stopped working when I recreated my cluster and reapplied my manifests. I think this is due to my recreation, because I did it with terraform which didn't emptied the "securityPolicy" field of the BackendConfig object before deletion. Is this the expected behaviour though? What can I do to recover when this happens?
When debugging, I saw that describing the backendconfig object does not tell the state of the load balancer. Is there another way of getting those informations?
Finally, I am a bit scared to use the BackendConfig to firewall my services right now, because it can potentially expose my services to the WAN even if my desired state explicitly tells otherwise without throwing an error.
I will gladly take advices here. Thanks for your help.
The text was updated successfully, but these errors were encountered: