Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 error for probably cancelled request (must be 499) #3748

Closed
ont opened this issue Sep 23, 2020 · 9 comments
Closed

500 error for probably cancelled request (must be 499) #3748

ont opened this issue Sep 23, 2020 · 9 comments
Labels
feature ⚙️ New feature or request
Milestone

Comments

@ont
Copy link

ont commented Sep 23, 2020

Problem

There are strange 500 errors in caddy access log. After some research we suppose that this errors related to cancelled requests. Currently we have problem to separate cancelled requests from real 500 errors.

Possible solution

Nginx use 499 error code for such situation.

Research

Example logs entries related to 500 error during GET /favicon.ico:

{"level":"error","ts":1600849986.420132,"logger":"http.log.error.log0","msg":"making dial info: lookup backend.service.consul on 123.123.123.123:53: dial udp 123.123.123.123:53: operation was canceled","request":{"method":"GET","uri":"/favicon.ico","proto":"HTTP/1.1","remote_addr":"222.222.222.222:38284","host":"some-site.com","headers":{"Accept":["image/webp,image/apng,image/*,*/*;q=0.8"],"Referer":["https://some-site.com/en/terms"],"Accept-Encoding":["gzip, deflate"],"Accept-Language":["id-ID,en-US;q=0.9"],"Cookie":["session=ffffffffffffffffffffffffffffffff"],"Connection":["keep-alive"],"User-Agent":["Mozilla/5.0 (Linux; Android 10; vivo 1818; wv) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36 VivoBrowser/6.7.0.2"]},"tls":{"resumed":true,"version":771,"ciphersuite":49196,"proto":"http/1.1","proto_mutual":true,"server_name":"some-site.com"}},"duration":0.000442598} 
{"level":"error","ts":1600849986.420243,"logger":"http.log.access.log0","msg":"handled request","request":{"method":"GET","uri":"/favicon.ico","proto":"HTTP/1.1","remote_addr":"222.222.222.222:38284","host":"some-site.com","headers":{"Cookie":["session=ffffffffffffffffffffffffffffffff"],"Connection":["keep-alive"],"User-Agent":["Mozilla/5.0 (Linux; Android 10; vivo 1818; wv) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36 VivoBrowser/6.7.0.2"],"Accept":["image/webp,image/apng,image/*,*/*;q=0.8"],"Referer":["https://some-site.com/en/terms"],"Accept-Encoding":["gzip, deflate"],"Accept-Language":["id-ID,en-US;q=0.9"]},"tls":{"resumed":true,"version":771,"ciphersuite":49196,"proto":"http/1.1","proto_mutual":true,"server_name":"some-site.com"}},"common_log":"222.222.222.222 - - [23/Sep/2020:08:33:06 +0000] \"GET /favicon.ico HTTP/1.1\" 500 0","duration":0.000442598,"size":0,"status":500,"resp_headers":{"Server":["Caddy"],"X-Request-Id":["9d882e60bc434c229094a434729fd32c"]}}

The main error is making dial info: lookup backend.service.consul on 123.123.123.123:53: dial udp 123.123.123.123:53: operation was canceled

This is related to code https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/reverseproxy.go#L385-L388:

		dialInfo, err := upstream.fillDialInfo(r)
		if err != nil {
			return fmt.Errorf("making dial info: %v", err)
		}

... and happens only when error occurs during DNS resolving https://github.com/caddyserver/caddy/blob/master/modules/caddyhttp/reverseproxy/hosts.go#L148-L151:

		_, records, err := net.DefaultResolver.LookupSRV(r.Context(), "", "", srvName)
		if err != nil {
			return DialInfo{}, err
		}

Code for LookupSRV probably returns operation cancelled when provided context was cancelled.
Taking into account that request duration was short (it is not timeout) we assume that such error relates to client which suddenly close its request.

@mholt
Copy link
Member

mholt commented Sep 25, 2020

499 is an nginx-ism, not a standard. The error is technically something that happened within the server, even if the root cause is something from the client; but there is no way to know, the same thing could happen if the server is force-closed after a grace period during a shutdown.

I don't really like deviating from standards or making up status codes and such; and I think 5xx (and 500 for lack of a better code) is probably the best fit anyway.

The good thing about our logs is that they're structured, so you can easily ignore the logs you don't want/like.

@mholt mholt closed this as completed Sep 25, 2020
@mholt mholt added declined 🚫 Not a fit for this project feature ⚙️ New feature or request labels Sep 25, 2020
@ont
Copy link
Author

ont commented Sep 26, 2020

@mholt but we can't see any related fields in structured logs for http.log.access.log0... Yes we can theoretically join two log messages from http.log.error.log0 and http.log.access.log0 via similarity of ts and fields from request structure and then drop any http.log.access.log0 which has related http.log.error.log0 with msg ~= /.*operation was canceled.*/ but it is very error-prone.

Here is formatted logs related to one 500-error:

{
   "level":"error",
   "ts":1600849986.420132,
   "logger":"http.log.error.log0",
   "msg":"making dial info: lookup backend.service.consul on 123.123.123.123:53: dial udp 123.123.123.123:53: operation was canceled",
   "request":{
      "method":"GET",
      "uri":"/favicon.ico",
      "proto":"HTTP/1.1",
      "remote_addr":"222.222.222.222:38284",
      "host":"some-site.com",
      "headers":{
         "Accept":[
            "image/webp,image/apng,image/*,*/*;q=0.8"
         ],
         "Referer":[
            "https://some-site.com/en/terms"
         ],
         "Accept-Encoding":[
            "gzip, deflate"
         ],
         "Accept-Language":[
            "id-ID,en-US;q=0.9"
         ],
         "Cookie":[
            "session=ffffffffffffffffffffffffffffffff"
         ],
         "Connection":[
            "keep-alive"
         ],
         "User-Agent":[
            "Mozilla/5.0 (Linux; Android 10; vivo 1818; wv) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36 VivoBrowser/6.7.0.2"
         ]
      },
      "tls":{
         "resumed":true,
         "version":771,
         "ciphersuite":49196,
         "proto":"http/1.1",
         "proto_mutual":true,
         "server_name":"some-site.com"
      }
   },
   "duration":0.000442598
}
{
   "level":"error",
   "ts":1600849986.420243,
   "logger":"http.log.access.log0",
   "msg":"handled request",
   "request":{
      "method":"GET",
      "uri":"/favicon.ico",
      "proto":"HTTP/1.1",
      "remote_addr":"222.222.222.222:38284",
      "host":"some-site.com",
      "headers":{
         "Cookie":[
            "session=ffffffffffffffffffffffffffffffff"
         ],
         "Connection":[
            "keep-alive"
         ],
         "User-Agent":[
            "Mozilla/5.0 (Linux; Android 10; vivo 1818; wv) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36 VivoBrowser/6.7.0.2"
         ],
         "Accept":[
            "image/webp,image/apng,image/*,*/*;q=0.8"
         ],
         "Referer":[
            "https://some-site.com/en/terms"
         ],
         "Accept-Encoding":[
            "gzip, deflate"
         ],
         "Accept-Language":[
            "id-ID,en-US;q=0.9"
         ]
      },
      "tls":{
         "resumed":true,
         "version":771,
         "ciphersuite":49196,
         "proto":"http/1.1",
         "proto_mutual":true,
         "server_name":"some-site.com"
      }
   },
   "common_log":"222.222.222.222 - - [23/Sep/2020:08:33:06 +0000] \"GET /favicon.ico HTTP/1.1\" 500 0",
   "duration":0.000442598,
   "size":0,
   "status":500,
   "resp_headers":{
      "Server":[
         "Caddy"
      ],
      "X-Request-Id":[
         "9d882e60bc434c229094a434729fd32c"
      ]
   }
}

@mholt
Copy link
Member

mholt commented Sep 28, 2020

@mholt but we can't see any related fields in structured logs for http.log.access.log0... Yes we can theoretically join two log messages from http.log.error.log0 and http.log.access.log0 via similarity of ts and fields from request structure and then drop any http.log.access.log0 which has related http.log.error.log0 with msg ~= /.operation was canceled./ but it is very error-prone.

I'm not sure I follow. What does this have to do with joining two log messages? I thought we were talking about status codes...

@ont
Copy link
Author

ont commented Sep 29, 2020

I'm not sure I follow. What does this have to do with joining two log messages? I thought we were talking about status codes...

@mholt we have very strange 500 errors from caddy in access logs. This errors trigger false alerts in our monitoring system. We think that these errors are caused by clients closing connection and causing golang context cancellation. If caddy logs this errors as 500 (and not 499) then we need another method for filtering them out from our logs for eliminating false alerts. And currently there is no any other fields in json log which can be used for this filtration.
Probably adding error message to http.log.access.log0 log record will be sufficient. But it is not there.

Currently error message exists only in http.log.error.log0 log message ("msg":"making dial info: lookup backend.service.consul on 123.123.123.123:53: dial udp 123.123.123.123:53: operation was canceled") but for http.log.access.log0 record msg field contains "msg":"handled request".

@mholt
Copy link
Member

mholt commented Oct 5, 2020

Hmm. I feel like this is a shortcoming in the log processing / monitoring app, and not the actual log message, but will reopen for consideration. In the meantime, please ask the developer of your alerting software to support filtering status 500 logs.

@mholt mholt reopened this Oct 5, 2020
ont added a commit to ont/caddy that referenced this issue Oct 6, 2020
Related to caddyserver#3748

Error message in access log can be used in monitoring/alerting systems.
It can be used for example for filtering out "unimportant" 500 errors
from access log.
@danlsgiga
Copy link
Contributor

I'm having the same issue regarding the logs... specifically the http.log.error.log0 that I can't filter nor format. The reason being I want to stop Caddy from leaking sensitive information like Authorization or X-Token-* headers in the errors logs when canceled requests (context canceled) occur.

I agree with @mholt that deviating from standards is not a good path, but I also feel that treating this scenario as a pure 500 is going to mislead a lot of people on troubleshooting as, from my perspective, a 500 is an unknown and unrecoverable error usually generated on the server-side.

According to https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#Unofficial_codes some projects and companies use unofficial codes to represent some known failure scenario. Would be great to have a way to either, don't log these client canceled requests at all, or have them categorized differently, for example with 499.

@danlsgiga
Copy link
Contributor

Adding this here for more contextualization, since it happened in the past and was solved the way it is being suggested here:

#1925

@mholt mholt added this to the v2.3.0 milestone Nov 24, 2020
@mholt mholt removed the declined 🚫 Not a fit for this project label Nov 24, 2020
@mholt
Copy link
Member

mholt commented Nov 25, 2020

Fixed in #3816 (I don't love it, because of how the Go standard lib hides the actual, standard, exported error value with some unexported garbage, but oh well)

@mholt mholt closed this as completed Nov 25, 2020
@ont
Copy link
Author

ont commented Nov 25, 2020

@mholt @danlsgiga thanks for patch! Probably my PR #3779 can be closed (cancelled) then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature ⚙️ New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants