BackendConfig security policy not enforced #616

jpigree · 2019-01-23T04:24:12Z

Hi. I created a GKE cluster in version "1.11.5-gke.5" (with autoscaling activated) and I use the pre-installed gce ingress controller to expose my applications on the WAN. However, I need to firewall them (filtering on source IP) so I use the "security policy" field in the BackendConfig object to enforce my cloud armor policy on the load balancer. However I have a hard time making it work consistently.

Indeed, I often end up in a state where the cloud armor policy is not enforced without any change on the BackendConfig object. And actually the only steps to make it work again are to empty the "security policy" field/recreate the BackendConfig object until it works.

Here is my current configuration for a simple helloworld application:

apiVersion: cloud.google.com/v1beta1
kind: BackendConfig
metadata:
  name: internal-http
  labels:
    app.kubernetes.io/name: "helloworld"
spec:
  securityPolicy:
    name: "<cloud armor policy name>"
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: helloworld
  labels:
    app.kubernetes.io/name: "helloworld"
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "helloworld"
    spec:
      containers:
      - name: helloworld
        image: <a simple flask application answering with helloworld when receiving HTTP GET on /hello>
        ports:
          - name: http
            containerPort: 5000
---
apiVersion: v1
kind: Service
metadata:
  name: helloworld
  labels:
     app.kubernetes.io/name: "helloworld"
  annotations:
    beta.cloud.google.com/backend-config: '{"ports": {"http":"internal-http"}}'
spec:
  type: "NodePort"
  ports:
  - port: 5000
    name: http
  selector:
    app.kubernetes.io/name: "helloworld"
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: helloworld
  labels:
    app.kubernetes.io/name: "helloworld"
  annotations:
    kubernetes.io/ingress.class: "gce"
    kubernetes.io/ingress.allow-http: false
    certmanager.k8s.io/cluster-issuer: letsencrypt
    kubernetes.io/tls-acme: “true”
    kubernetes.io/ingress.global-static-ip-name: "<the static reserved ip name>"
spec:
  tls:
  - hosts:
    - <helloworld fqdn>
    secretName: <secret containing Letsencrypt certificate>
  rules:
  - host: <helloworld fqdn>
    http:
      paths:
      - path: /hello
        backend:
          serviceName: helloworld
          servicePort: 5000

This configuration stopped working when I recreated my cluster and reapplied my manifests. I think this is due to my recreation, because I did it with terraform which didn't emptied the "securityPolicy" field of the BackendConfig object before deletion. Is this the expected behaviour though? What can I do to recover when this happens?

When debugging, I saw that describing the backendconfig object does not tell the state of the load balancer. Is there another way of getting those informations?

Finally, I am a bit scared to use the BackendConfig to firewall my services right now, because it can potentially expose my services to the WAN even if my desired state explicitly tells otherwise without throwing an error.

I will gladly take advices here. Thanks for your help.

bowei · 2019-01-23T07:34:37Z

There has been discussion on making load balancer status more extensible via CRDs. The easiest thing to do is to expose via an annotation.

bowei · 2019-01-23T07:34:53Z

/lifecycle frozen

jpigree · 2019-01-23T19:16:28Z

Hi. Thank you @bowei for your quick answer but I don't think I understand it. The BackendConfig object is already a CRD right?

Actually, to better explain my problem, I followed this document:
=> https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-armor-backendconfig

My main issues are:

I have a nasty edge case when I set the BackendConfig securityPolicy up without errors but my services are not firewalled as they should. And I have trouble finding logs, statuses in Kubernetes and GCP to actually understand what is happening.
Is it safe to remove a BackendConfig object with a securityPolicy set and recreate it later multiple times?
The documentation says we should empty the BackendConfig field to detach the policy from the ingress but what happens when I delete the object entirely?

Anyway, I wasn't able to find reproducible steps yet because the bug is slippery. For example, I kept on purpose a running cluster for days with the edge case from above but between today and yesterday without my intervention, it just fixed itself. I guess something happened in the Google managed part of my cluster.

MrHohn · 2019-01-23T19:28:54Z

Indeed, I often end up in a state where the cloud armor policy is not enforced without any change on the BackendConfig object. And actually the only steps to make it work again are to empty the "security policy" field/recreate the BackendConfig object until it works.

This sounds like a bug in the controller. Is your mitigating step just "triggering an update" on the BackendConfig / Ingress?

The documentation says we should empty the BackendConfig field to detach the policy from the ingress but what happens when I delete the object entirely?

If the BackendConfig object is deleted entirely, SecurityPolicy will not be detached from the corresponding Load Balancing resource. By deleting a BackendConfig it basically means leaving the Load Balancer as is (without resetting the configuration previously provided by this BackendConfig).

jpigree · 2019-01-23T19:54:55Z

Is your mitigating step just "triggering an update" on the BackendConfig / Ingress?

What I did most of the times is:

empty the "securityPolicy.name" field in the BackendConfig object
set the "securityPolicy.name" field in the BackendConfig with the cloudarmor policy name I want to be enforced again

So yeah, I trigger an update in the BackendConfig.

I do not touch the ingress myself. However, my Helloworld app Continuous Deployment pipeline recreate the ingress on every deploy. I imagine that it triggers a Load Balancer recreation right? This maybe a factor which obfuscated the problem since it can trigger anytime I will disable it during my tests.

I think the policy takes a few minutes to be enforced. This does not help when testing.

This sounds like a bug in the controller. Is your mitigating step just "triggering an update" on the BackendConfig / Ingress?

Looks like it. But I want to be sure. I am trying to isolate the problem again with the info you gave me. I will keep you updated. Thanks for your help.

MrHohn · 2019-01-23T23:00:34Z

@jpigree Thanks for the info. It might be helpful to also check the ingress controller logs if you have access to the master (I know many don't).

jpigree · 2019-01-24T02:45:39Z

@MrHohn I don't have access to the ingress controller logs and "kubectl describe backendconfig" does not print any information about the policy enforcement status.
This complicates troubleshooting.

I have a question though. I look at the loadbalancer created by the gce-ingress on the GCP console and I wonder why do I have two backend services for the same instance group. One with the security policy activated and the other not.

Could this be an issue? I checked but I still have those 2 backend services even when the policy is enforced successfully.

I am still running my reproduction scripts but the security policy can take a pretty long time to be enforced (> 10 minutes) so I added latency which really slow my tests.

jpigree · 2019-01-24T05:52:23Z

Hi again. I have a few results. I couldn't reproduce the bug consistently but I can show logs proving that the BackendConfig does not work very well.

@MrHohn Can I send them to you by email? I will have to anonymise everything otherwise.

MrHohn · 2019-01-24T06:37:07Z

@jpigree Sure thing, please send them to zihongz@google.com, thanks!

MrHohn · 2019-01-24T21:23:45Z

Could this be an issue? I checked but I still have those 2 backend services even when the policy is enforced successfully.

This is expected. Ingress controller creates 1 backendService (google cloud resource) for each linked nodePort service (k8s resource). And they all share the same unmanaged instance group. In your case, one of the backendServices maybe created for the default backend service, which is deployed upon cluster creation.

Will post more updates after reading through the logs.

MrHohn · 2019-01-25T01:24:59Z

@jpigree Thanks for the detailed info in the email. I followed similar procedures as your test but wasn't able to reproduce the issue --- I confirmed the security policy is attached after various combination of actions (detach or not detach backendConfig before Ingress recreation, etc.).

One observation is that in the test log you sent over, I don't see any "Security Policy is not enforced" log printed out. Instead, I only saw "Security Policy was not set" logs, which indicates the test timed out waiting for a 503 code to be returned. This might be caused by the corresponding load balancing resources took too long to become ready, instead of the security policy is not enforced.

Can you check if that is the case (e.g. check on the backendService resource directly to see if security policy is attached when the test times out)?

Though I saw the timeout in test is set to 20*40=800 seconds. Not sure why the LB resource took so long to be provisioned.

jpigree · 2019-01-25T02:45:21Z

@MrHohn. Thanks for the review. Yes. I couldn't reproduce with my scripts because I kept hitting timeouts. I did kinda reproduce one time: i had "HTTP 200" during 10 minutes or so before the security policy kicks in. I don't know if this considered acceptable. I didn't sent you those logs because my scripts weren't finished yet (the code won't match with the logs). But after that, I mostly kept hitting timeouts on successive runs despite them being huge.

This is strange because with the nginx-ingress-controller which spawns TCP Load Balancer, I never waited more than 5 minutes.

So you tried on another cluster with the same version than I("1.11.5-gke.5" )? And you didn't had any timeout issue and your policy was successfully attached? Did you test by verifying in the console or by accessing the URL?

Another possibility is my problem came from cloud armor. Because there are many managed parts and very few logs accessible (my company won't activate stackdriver for this project), this is hard for me to identify what went wrong.

I think I will try again to reproduce tomorrow and if I can't I will just close the ticket. However, even if I do reproduce once, I don't think it will help you much to identify it as it doesn't happen consistently. I will add a check to my tests using gcloud to know if the API confirm that the policy is attached or not after each curl.

Thank for your time.

jpigree · 2019-01-28T23:37:01Z

@MrHohn. So I did other tests and I still couldn't reproduce. Actually, I think it works either way better than before (perhaps due to gcloud upgrades/fixes) or I just know better what to expect. Indeed, the inertia (sometimes more than 15 minutes long) between the k8s object creation and the policy enforcement could have misslead me.

Moreover, I also added the policy attachement status to the logs and it shows clearly that the "gce-ingress-controller" does it job. My main pain point now is the random HTTP 200, HTTP 502, or network errors I get during the "LB init time". But this is due to the LB or cloud armor so this isn't related to this ticket. I will need to check their SLA.

I attached a few logs to this comment to illustrate all I said.

logs.TXT
logs-containing-http200.TXT

The best way to view them is to download them and do "cat FILES | less".

Just in case, I implemented a cron validation CI job which check if my services are firewalled or not. So, if the issue happens (again), I will have logs.

I will close the ticket. Thanks for your time.

MrHohn · 2019-01-29T23:31:12Z

@jpigree Thanks for the updates. Indeed from your logs I can now see the 200 happens without security policy enforced. As you mentioned, this likely happened during the initialization of the load balancing resource, and we got users reported before that random codes may be returned before load balancing is ready.

IMHO the fact that google cloud LB resource doesn't report status also make adding status to k8s ingress harder.

bowei added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 23, 2019

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 23, 2019

jpigree changed the title ~~BackendConfig security policy~~ BackendConfig security policy not enforced Jan 23, 2019

bowei added the kind/bug Categorizes issue or PR as related to a bug. label Jan 28, 2019

jpigree closed this as completed Jan 28, 2019

rramkumar1 removed kind/feature Categorizes issue or PR as related to a new feature. kind/bug Categorizes issue or PR as related to a bug. labels Feb 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BackendConfig security policy not enforced #616

BackendConfig security policy not enforced #616

jpigree commented Jan 23, 2019

bowei commented Jan 23, 2019

bowei commented Jan 23, 2019

jpigree commented Jan 23, 2019

MrHohn commented Jan 23, 2019

jpigree commented Jan 23, 2019

MrHohn commented Jan 23, 2019

jpigree commented Jan 24, 2019

jpigree commented Jan 24, 2019

MrHohn commented Jan 24, 2019

MrHohn commented Jan 24, 2019

MrHohn commented Jan 25, 2019

jpigree commented Jan 25, 2019

jpigree commented Jan 28, 2019

MrHohn commented Jan 29, 2019

BackendConfig security policy not enforced #616

BackendConfig security policy not enforced #616

Comments

jpigree commented Jan 23, 2019

bowei commented Jan 23, 2019

bowei commented Jan 23, 2019

jpigree commented Jan 23, 2019

MrHohn commented Jan 23, 2019

jpigree commented Jan 23, 2019

MrHohn commented Jan 23, 2019

jpigree commented Jan 24, 2019

jpigree commented Jan 24, 2019

MrHohn commented Jan 24, 2019

MrHohn commented Jan 24, 2019

MrHohn commented Jan 25, 2019

jpigree commented Jan 25, 2019

jpigree commented Jan 28, 2019

MrHohn commented Jan 29, 2019