-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
500 errors with queue-proxy in knative versions higher than 0.24.0 #12387
Comments
Hi @adriangudas could be the known Golang issue, however let's make sure you dont see any errors at the user container side eg. application closed the connection for some reason. Could you try hit the application endpoint either locally in that container or via a client to see if there are any failures? |
This issue is stale because it has been open for 90 days with no |
/triage accepted |
I'm on @adriangudas 's team and we've been testing this issue on knative version
The application we're testing against has not changed since last time, and was unchanged from when we were using knative |
@dprotaso Let me know if you need help obtaining debug information. |
I prodded the upstream golang issue - golang/go#40747 (comment) we leverage the |
I tried grpc-ping-go with knative serving I hit the exact same problem.
|
@kahirokunn
Also our relevant revision, Not sure downgrade our kanative revision whether can resolve? $ k get svc -n istio-system knative-local-gateway -oyaml | kubectl-neat
apiVersion: v1
kind: Service
metadata:
labels:
experimental.istio.io/disable-gateway-port-translation: "true"
networking.knative.dev/ingress-provider: istio
serving.knative.dev/release: v0.26.0
name: knative-local-gateway
namespace: istio-system
spec:
clusterIP: |
FWIW - the golang upstream fix should land in go1.21 (comes out august) Also - I'd retest against the latest knative to make sure this is still a problem |
Adding to v1.12.0 release since we'll probably be able to switch to go1.21 for that release |
@dprotaso I tested the new fix it seems to work.
You just need the following modification in the reproducer and build with 1.21:
A few concerns: Having this method enabled by default is not what we probably want due to the deadlock issue. |
It'd be great to know which clients. I also wonder if this concern is mitigated because we have an envoy proxy (or something else) in front of the activator/queue proxy.
Good point about this being workload dependent. |
Does it only happen if the user-container is written on python? |
@tikr7 we've been seeing this issue with the |
@skonto having the behaviour configurable via an annotation on the revision would be extremely handy. depending on the defaults chosen we can pick the affected services on our end and have them opt in/out |
I will take a look. /assign @skonto |
@skonto thanks for the tests and repo 🎉 As there is no real clear answer, to which clients could break with this (golang/go#57786 (comment)), @dprotaso what do you think about if we add this behind a feature flag and then try to do a transformation from disabled to enabled by default but always with a flag to disable this behaviour? This issue causes a lot of issues for some of our users and we could ask them to try with full-duplex enabled and see if that causes any issues in a real environment. |
@ReToCode my goal btw is to have some annotation to enable it per workload and turn it on/off on demand. |
@skonto thanks for looking into this! In our Knative cluster we don't generally use the activator for the vast majority of the services (for context we're running ~40 KN services in the cluster). I presume the proposed changes would still work fine even if the activator is disabled for the services? |
Hi @moadi.
Yes it should, the idea is without activator on the path the QP should deal with the full duplex support as well. |
What version of Knative?
Expected Behavior
No 500 errors.
Actual Behavior
Sporadic 500 errors being presented to the client. From the queue-proxy logs on the service, we see the following error message:
httputil: ReverseProxy read error during body copy: read tcp 127.0.0.1:51376->127.0.0.1:8080: use of closed network connection
Steps to Reproduce the Problem
We were able to reproduce these errors with knative 1.0.0, as well as knative 0.26.0 and 0.25.1. We're not seeing an issue in 0.24.0. I wasn't able to reproduce it using
hey
andhelloworld-go
so it appears workload-dependent to some degree, however we're not doing anything unusual except that our responses are understandably larger than what helloworld-go sends out, being an actual production API workload.The count of 500's is fairly small (5 occurrences over several thousand requests), but it's higher than the rate of zero errors we were getting before, and we can't explain it. There doesn't appear to be any anomalous behaviour in the app itself that seems to be causing the issues (memory/cpu are stable, etc.)
Once we moved to knative 0.24.0, the problem went away. (The knative serving operator makes it really easy to move between versions 💯 )
Some digging in Go's repo resulted in an interesting lead, which might be related, and points to an issue in Go rather than queue-proxy specifically.
Was wondering if this is on the right track, or if anyone else experiences this in their logs. This is preventing us from being able to upgrade to 1.0.0 in production.
Thanks!
The text was updated successfully, but these errors were encountered: