Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readinessProbe w/http-admin does not create valid heath check on GKE loadbalancer #113

Closed
NoelJames opened this issue Jan 31, 2020 · 10 comments

Comments

@NoelJames
Copy link

NoelJames commented Jan 31, 2020

Describe the bug

On gke , load balancer fails because the correct health check for public is not found.

To Reproduce

My fan out ingress contains multiple hosts, the hydra section looks like this

    - host:  hydra.example.com
      http:
        paths:
        - backend:
            serviceName: releasename-hydra-public
            servicePort: http

Expected behavior

A valid health check would be found

Environment

  • GKE
  • HELM 3
  • hydra helm chart v0.0.43

Additional context

Changing the readiness probe in deployment.yaml like so. Fixes the issue for me

          readinessProbe:
            httpGet:
              path: /health/ready
              port: http-public
@christian-roggia
Copy link
Contributor

christian-roggia commented Mar 9, 2020

The problem here is a little more complex than that, GCE / GKE ingress has many limitations and among them, there is the problem that it doesn't correctly pick up the readinessProbe path on many occasions. This bit me more times than I would like to admit, debugging this issue is extremely time expensive and information is not transparent at all on GCP.

Now, the issue is the following:

  1. it is not possible to provide multiple readinessProbe
  2. GCE ingress correctly picks up the health check path /health/ready for http-admin
  3. GCE doesn't find any readinessProbe rule for http-public and uses the default / path
  4. the default / path returns 404 and not 200, this makes the health check fails and the backend will result UNHEALTHY and therefore the ingress won't come up online.

Switching readiness probe to http-public will result in http-admin to fail I am afraid. The same problem will happen, but the ports will be switched, one succeeds and the other fails.

Merging the two ingress together in a single ingress doesn't fix the issue.

A solution to this issue might be healthcheck configuration via BackendConfig CDR, see kubernetes/ingress-gce#1010 and https://cloud.google.com/kubernetes-engine/docs/concepts/backendconfig, that will be included in the version 1.10 of the GCE ingress.

Another solution would be to make two separate deployments for hydra-public and hydra-admin.

@aeneasr
Copy link
Member

aeneasr commented Mar 13, 2020

Wow thank you for the detailed write-up! That sounds really frustrating and should definitely be fixed. I think we can deploy two instances of Hydra to resolve this issue on GKE.

However, this would not work with the in-memory database which is what some deployments are currently using. We're however thinking about removing in-memory in favor of SQLite (which also supports in-memory but would use a mount in helm).

Is there any other way we can work around this for GKE?

Personally I have to say that I had so many issues with the GKE Ingress from being very slow to update to not supporting basic features like path rewrites that we ended up using Nginx ingress on GKE. While this doesn't support some features like Global Forwarding Rules (I think that's the name?) it doesn't actually cause 20minutes downtimes when the GCE ingress is updating :D

@christian-roggia
Copy link
Contributor

Hello @aeneasr, I'm glad my insights about this issue were useful!

I agree that GCE ingress isn't where it should be, it's an obsolete piece of software and its development is going forward at a very slow pace. On the other hand, as you also already mentioned, it is the default ingress on GCP and it supports some Google-specific features that NGINX and other ingresses do not.

I am looking forward to the SQLite solution and I think it's a step in the right direction for this specific GKE-related issue.

Is there any other way we can work around this for GKE?

There is this issue kubernetes/ingress-gce#647 that describes a problem similar to this one. Maybe quickly going through the ticket might give some ideas on how to deal with it.

A quick workaround to solve this issue could be what was described in this ticket kubernetes/ingress-gce#674 which is to return a 200 HTTP status on the root path / of the application when the User-Agent has prefix GoogleHC. It's not too ugly nor complex and it's highly unlikely (if not impossible) that this User-Agent will be used by your customers for other purposes.

As an additional note, this issue kubernetes/ingress-gce#42 might also be interesting for this issue.

@NoelJames
Copy link
Author

Yeah, considered the status change on / too, felt wrong. Im thinking 2 hydra instances might be the simplest approach.

Thanks for the insights on GCE load balancers! I didn't really know there was a thing (but recall other issues 🤷‍♀).

Thanks

PS: Hydra is an awesome!

@christian-roggia
Copy link
Contributor

A working solution is now available.

/ping @NoelJames @aeneasr

GKE supported versions

NOTE: This solution works only from GKE version 1.17.6-gke.11 according to https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-features - BackendConfig - Custom load balancer health check configuration.

VPC-native and Network Endpoint Group (NEG)

If you enabled Network Endpoint Group (NEG) aka VPC-native is enabled in your cluster, you first need to execute the following command to create a new firewall rule:

gcloud compute firewall-rules create hydra-http-public --source-ranges=130.211.0.0/22,35.191.0.0/16 --network=default --allow=tcp:4444

According to Google, this is a short-term workaround:

The short-term workaround (until automated NEG-based health checks are supported) is to manually deploy a firewall rule to allow Google Cloud health check probes to access NEG IP:port endpoints directly.

Solution

The following resource has to be manually created before the deployment of the Service / Ingress:

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: http-public
  namespace: hydra
spec:
  healthCheck:
    checkIntervalSec: 5
    timeoutSec: 3
    healthyThreshold: 1
    unhealthyThreshold: 3
    type: HTTP
    requestPath: /health/ready
    port: 4444

The following annotations have to be added to the Service of the public endpoint:

cloud.google.com/neg: '{"ingress": true}'
cloud.google.com/backend-config: '{"default": "http-public"}'

@aeneasr
Copy link
Member

aeneasr commented Aug 3, 2020

Awesome, thank you for the update! Does that mean that we need to change something in the chart?

@christian-roggia
Copy link
Contributor

I would like to work on a PR for this specific issue, but in case I won't be able to work on it on a short-term I'd like to lay down some recommendations for whoever would like to propose a PR before I do:

Limitations

There should be a check on the Kubernetes version, as mentioned above this feature was introduced only starting from the version 1.17.6-gke.11. I am pretty confident that in Helm it is possible to verify the current version and disable features on unsupported releases.

Annotations

The following annotations should be configured automatically IMO as they are not trivial and it takes quite some time to find them in the GKE documentation (they are buried in some exotic pages about Load Balancers and Ingresses):

cloud.google.com/neg: '{"ingress": true}'
cloud.google.com/backend-config: '{"default": "http-public"}'

BackendConfig

The following resource should be added to the resources of the chart and should be toggled by the typical enabled flag:

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: http-public
  namespace: hydra
spec:
  healthCheck:
    checkIntervalSec: 5
    timeoutSec: 3
    healthyThreshold: 1
    unhealthyThreshold: 3
    type: HTTP
    requestPath: /health/ready
    port: 4444

Template

The values.yaml should probably implement something on the line of:

# The following configuration enabled a custom BackendConfig and HealthCheck on GKE.
# This configuration *must* be enabled if you want to use an Ingress on the "public" endpoint on GKE.
# If you want to enable TLS on this port, please change the protocol to "HTTPS", additionally, you will need to add the annotation "cloud.google.com/app-protocols: '{"4444": "HTTPS"}'" to the Service "public".
# If you are running a VPC-native cluster, please check the issue https://github.com/ory/k8s/issues/113 for current limitations.
backendConfig:
  enabled: false
  path: /health/ready
  port: 4444
  protocol: HTTP
  interval: 60
  timeout: 60
  healthyThreshold: 1
  unhealthyThreshold: 10

The backend-config.yaml should probably look like this:

{{- if .Values.backendConfig.enabled  }}
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: {{ include "hydra.fullname" . }}
spec:
  healthCheck:
    checkIntervalSec: {{ .Values.backendConfig.interval }}
    timeoutSec: {{ .Values.backendConfig.timeout }}
    healthyThreshold: {{ .Values.backendConfig.healthyThreshold }}
    unhealthyThreshold: {{ .Values.backendConfig.unhealthyThreshold }}
    type: {{ .Values.backendConfig.protocol }}
    requestPath: {{ .Values.backendConfig.path }}
    port: {{ .Values.backendConfig.port }}
{{- end }}

Additional information

As mentioned before this CDR does not work properly for VPC-native clusters, therefore I find it appropriate to point out this issue, the documentation of Google, or a warning that mentions the workaround necessary, i.e. an additional firewall rule that has to be configured manually:

gcloud compute firewall-rules create hydra-http-public --source-ranges=130.211.0.0/22,35.191.0.0/16 --network=default --allow=tcp:4444

Finally, if you want to enable TLS, you have to follow these steps:
a) The spec.healthCheck.type in the BackendConfig must be set to HTTPS
b) The service requires an additional notation: cloud.google.com/app-protocols: '{"4444": "HTTPS"}' where 4444 is the port number where TLS has been enabled.

The second point has cost me an entire day of GKE documentation and tests.
Hopefully, this comment will save some time to somebody else who wants to enable TLS with custom Backend and health checks.

@aeneasr
Copy link
Member

aeneasr commented Aug 4, 2020

Awesome, thank you for the great write-up! This will certainly help with implementation. I'll also not be able to work in this in the near future so if anyone wants to pick this up please do :)

My only suggestion would be to probably make Values.backendConfig obvious to be GKE only - maybe with Values.gkeBackendConfig or something along those lines.

@theFong
Copy link

theFong commented Jan 15, 2021

Thank you @christian-roggia for the detailed answer and follow ups! Can tell it saved me a bunch of time :)

@aeneasr
Copy link
Member

aeneasr commented Sep 18, 2021

I am closing this issue as it has not received any engagement from the community or maintainers in a long time. That does not imply that the issue has no merit. If you feel strongly about this issue

  • open a PR referencing and resolving the issue;
  • leave a comment on it and discuss ideas how you could contribute towards resolving it;
  • open a new issue with updated details and a plan on resolving the issue.

We are cleaning up issues every now and then, primarily to keep the 4000+ issues in our backlog in check and to prevent maintainer burnout. Burnout in open source maintainership is a widespread and serious issue. It can lead to severe personal and health issues as well as enabling catastrophic attack vectors.

Thank you to anyone who participated in the issue! 🙏✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants