-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
500 error for probably cancelled request (must be 499) #3748
Comments
499 is an nginx-ism, not a standard. The error is technically something that happened within the server, even if the root cause is something from the client; but there is no way to know, the same thing could happen if the server is force-closed after a grace period during a shutdown. I don't really like deviating from standards or making up status codes and such; and I think 5xx (and 500 for lack of a better code) is probably the best fit anyway. The good thing about our logs is that they're structured, so you can easily ignore the logs you don't want/like. |
@mholt but we can't see any related fields in structured logs for Here is formatted logs related to one 500-error: {
"level":"error",
"ts":1600849986.420132,
"logger":"http.log.error.log0",
"msg":"making dial info: lookup backend.service.consul on 123.123.123.123:53: dial udp 123.123.123.123:53: operation was canceled",
"request":{
"method":"GET",
"uri":"/favicon.ico",
"proto":"HTTP/1.1",
"remote_addr":"222.222.222.222:38284",
"host":"some-site.com",
"headers":{
"Accept":[
"image/webp,image/apng,image/*,*/*;q=0.8"
],
"Referer":[
"https://some-site.com/en/terms"
],
"Accept-Encoding":[
"gzip, deflate"
],
"Accept-Language":[
"id-ID,en-US;q=0.9"
],
"Cookie":[
"session=ffffffffffffffffffffffffffffffff"
],
"Connection":[
"keep-alive"
],
"User-Agent":[
"Mozilla/5.0 (Linux; Android 10; vivo 1818; wv) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36 VivoBrowser/6.7.0.2"
]
},
"tls":{
"resumed":true,
"version":771,
"ciphersuite":49196,
"proto":"http/1.1",
"proto_mutual":true,
"server_name":"some-site.com"
}
},
"duration":0.000442598
}
{
"level":"error",
"ts":1600849986.420243,
"logger":"http.log.access.log0",
"msg":"handled request",
"request":{
"method":"GET",
"uri":"/favicon.ico",
"proto":"HTTP/1.1",
"remote_addr":"222.222.222.222:38284",
"host":"some-site.com",
"headers":{
"Cookie":[
"session=ffffffffffffffffffffffffffffffff"
],
"Connection":[
"keep-alive"
],
"User-Agent":[
"Mozilla/5.0 (Linux; Android 10; vivo 1818; wv) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36 VivoBrowser/6.7.0.2"
],
"Accept":[
"image/webp,image/apng,image/*,*/*;q=0.8"
],
"Referer":[
"https://some-site.com/en/terms"
],
"Accept-Encoding":[
"gzip, deflate"
],
"Accept-Language":[
"id-ID,en-US;q=0.9"
]
},
"tls":{
"resumed":true,
"version":771,
"ciphersuite":49196,
"proto":"http/1.1",
"proto_mutual":true,
"server_name":"some-site.com"
}
},
"common_log":"222.222.222.222 - - [23/Sep/2020:08:33:06 +0000] \"GET /favicon.ico HTTP/1.1\" 500 0",
"duration":0.000442598,
"size":0,
"status":500,
"resp_headers":{
"Server":[
"Caddy"
],
"X-Request-Id":[
"9d882e60bc434c229094a434729fd32c"
]
}
} |
I'm not sure I follow. What does this have to do with joining two log messages? I thought we were talking about status codes... |
@mholt we have very strange 500 errors from caddy in access logs. This errors trigger false alerts in our monitoring system. We think that these errors are caused by clients closing connection and causing golang context cancellation. If caddy logs this errors as 500 (and not 499) then we need another method for filtering them out from our logs for eliminating false alerts. And currently there is no any other fields in json log which can be used for this filtration. Currently error message exists only in |
Hmm. I feel like this is a shortcoming in the log processing / monitoring app, and not the actual log message, but will reopen for consideration. In the meantime, please ask the developer of your alerting software to support filtering status 500 logs. |
Related to caddyserver#3748 Error message in access log can be used in monitoring/alerting systems. It can be used for example for filtering out "unimportant" 500 errors from access log.
I'm having the same issue regarding the logs... specifically the I agree with @mholt that deviating from standards is not a good path, but I also feel that treating this scenario as a pure 500 is going to mislead a lot of people on troubleshooting as, from my perspective, a 500 is an unknown and unrecoverable error usually generated on the server-side. According to https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#Unofficial_codes some projects and companies use unofficial codes to represent some known failure scenario. Would be great to have a way to either, don't log these client canceled requests at all, or have them categorized differently, for example with 499. |
Adding this here for more contextualization, since it happened in the past and was solved the way it is being suggested here: |
Fixed in #3816 (I don't love it, because of how the Go standard lib hides the actual, standard, exported error value with some unexported garbage, but oh well) |
@mholt @danlsgiga thanks for patch! Probably my PR #3779 can be closed (cancelled) then |
Problem
There are strange 500 errors in caddy access log. After some research we suppose that this errors related to cancelled requests. Currently we have problem to separate cancelled requests from real 500 errors.
Possible solution
Nginx use 499 error code for such situation.
Research
Example logs entries related to 500 error during
GET /favicon.ico
:The main error is
making dial info: lookup backend.service.consul on 123.123.123.123:53: dial udp 123.123.123.123:53: operation was canceled
This is related to code https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/reverseproxy.go#L385-L388:
... and happens only when error occurs during DNS resolving https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/hosts.go#L148-L151:
Code for
LookupSRV
probably returnsoperation cancelled
when provided context was cancelled.Taking into account that request duration was short (it is not timeout) we assume that such error relates to client which suddenly close its request.
The text was updated successfully, but these errors were encountered: