-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wsd: reduce noise while draining with ECONNRESET #10400
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we can simplify =) but otherwise good! thanks.
net/Socket.hpp
Outdated
LOG_SYS_ERRNO(last_errno, "Read failed, have " << _inBuffer.size() | ||
<< " buffered bytes"); | ||
{ | ||
if (closed && last_errno == ECONNRESET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not convinced by the complexity of the extra parameter - and there is of course a race between the poll returning and the socket closing - so this will not be a reliable catch here.
We will be able to see the socket closed with a HUP in other logging - to diagnose why.
I would be inclined to just LOGA_INF() or somesuch for all ECONNRESETs in this case - which should simplify the patch a lot.
Is that do-able ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, even better, since we can have the same effect by checking the buffer.
That is, if there is nothing in the buffer, then the connection was idle and the peer recycled the socket.
I also reduced to trace logging to match the others in the function.
Now that we re-use connections, we have revealed a corner-case that causes some noise. Remote connections might not retain HTTP connections past their immediate purpose and rudely reset the connection after a certain timeout. Apache has such a short timeout of between 5 and 15 seconds, by default. And, it seems, HA Proxy is similar. When the connection is not reused by the client within the timeout, it is simply closed. This wakes us up and we get revents with POLLIN, POLLHUP, and POLLERR (that is, 0x19). This patch handles this case by recognizing that we often get POLLHUP and the socket is closed before doing read(2). That is, read(2) *is* expected to fail with ECONNRESET. We handle this case specifically by avoiding reporting it as an error. We will still log an error if we get ECONNRESET while reading, when we haven't been idle, even though technically that is also possible. (Ideally, we would flag whether the socket is reusable or not and also suppress a close immediately after reading all the data.) As such, this is *not* a functional change; we only tollerate draining a dead connection without noisily logging. Here is a sample where socket #38 is used to request a static file through HTTP GET: 16.150238 [ websrv_poll ] TRC #38: Handling poll events of websrv_poll at index 4 (of 7): 0x1| net/Socket.cpp:558 16.150244 [ websrv_poll ] TRC #38: Read 506 bytes in addition to 0 buffered bytes| net/Socket.hpp:1166 16.150246 [ websrv_poll ] TRC #38: Incoming data buffer 506 bytes, read result: 506, events: 0x1 (not closed)| net/Socket.hpp:1350 16.150270 [ websrv_poll ] INF #38: parseHeader: Client HTTP Request: GET, uri `/browser/ebaa6a6/images/lc_addcitationnote.svg` HTTP/1.1, sz[header 506, content -1], offset 506, chunked false, host: office.nc-presales.top / user-agent: Mozilla/5.0 (X11; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0 / accept: image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5 / accept-language: en,en-US;q=0.5 / accept-encoding: gzip, deflate, br, zstd / dnt: 1 / sec-fetch-dest: image / sec-fetch-mode: no-cors / sec-fetch-site: same-origin / sec-gpc: 1 / priority: u=5, i / x-real-ip: 176.145.78.103 / x-forwarded-for: 176.145.78.103| net/Socket.cpp:1340 16.150276 [ websrv_poll ] DBG #38: Handling request: /browser/ebaa6a6/images/lc_addcitationnote.svg, closeConnection false| wsd/ClientRequestDispatcher.cpp:670 16.150287 [ websrv_poll ] TRC Fileserver request: /browser/ebaa6a6/images/lc_addcitationnote.svg| wsd/FileServer.cpp:524 16.150303 [ websrv_poll ] TRC #38: Sending compressed : file [/browser/dist/images/lc_addcitationnote.svg]: Date: Thu, 31 Oct 2024 12:25:16 / Server: COOLWSD HTTP Server 24.04.7.3 / Content-Type: image/svg+xml / Content-Encoding: gzip / Content-Length: 489 / Cache-Control: max-age=11059200 / ETag: "ebaa6a6" / X-Content-Type-Options: nosniff| wsd/FileServer.cpp:732 16.150317 [ websrv_poll ] TRC #38: Wrote 248 bytes of 248 buffered data| net/Socket.hpp:1490 16.150323 [ websrv_poll ] TRC #38: Wrote 489 bytes of 489 buffered data| net/Socket.hpp:1490 16.150326 [ websrv_poll ] DBG #38: Handled request: /browser/ebaa6a6/images/lc_addcitationnote.svg, inBuf[sz 506 -> 0, rm 506], connection open true| wsd/ClientRequestDispatcher.cpp:952 16.150341 [ websrv_poll ] TRC #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880 16.201424 [ websrv_poll ] TRC #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558 16.201547 [ websrv_poll ] TRC #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880 16.211803 [ websrv_poll ] TRC #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558 16.211810 [ websrv_poll ] TRC #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880 27.991710 [ websrv_poll ] TRC #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558 27.991836 [ websrv_poll ] TRC #28: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880 27.991838 [ websrv_poll ] TRC #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880 27.991840 [ websrv_poll ] TRC ppoll start, timeoutMicroS: 64000000 size 4| net/Socket.cpp:430 32.993143 [ websrv_poll ] TRC Poll completed with 2 live polls max (64000000us)| net/Socket.cpp:448 32.993196 [ websrv_poll ] TRC #38: Handling poll events of websrv_poll at index 3 (of 4): 0x19| net/Socket.cpp:558 32.993201 [ websrv_poll ] ERR #38: Read failed, have 0 buffered bytes (ECONNRESET: Connection reset by peer)| net/Socket.hpp:1156 32.993210 [ websrv_poll ] TRC #38: Incoming data buffer 0 bytes, read result: -1, events: 0x19 (closed)| net/Socket.hpp:1350 32.993212 [ websrv_poll ] DBG #38: Closed after reading| net/Socket.hpp:1373 32.993213 [ websrv_poll ] TRC #38: Closed. Firing onDisconnect.| net/Socket.hpp:1445 Notice how the socket remained open after serving the file for over 15 seconds, until we logged the error. With this patch, the ERR line will become: 32.993201 [ websrv_poll ] TRC #38: Read failed, because the connection is reset by peer, have 0 buffered bytes: ECONNRESET| net/Socket.hpp:1156 Change-Id: Id440fc25ff899b44bc4db0b267cde84cca78282d Signed-off-by: Ashod Nakashian <ashod.nakashian@collabora.co.uk>
91d0e21
to
9546fc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks Ash =)
Now that we re-use connections, we have revealed
a corner-case that causes some noise.
Remote connections might not retain HTTP
connections past their immediate purpose and
rudely reset the connection after a certain
timeout.
Apache has such a short timeout of between 5
and 15 seconds, by default. And, it seems,
HA Proxy is similar. When the connection is
not reused by the client within the timeout,
it is simply closed.
This wakes us up and we get revents with
POLLIN, POLLHUP, and POLLERR (that is, 0x19).
This patch handles this case by recognizing
that we got POLLHUP already and the socket
is closed before doing read(2). That is,
read(2) is expected to fail with ECONNRESET.
We handle this case specifically by avoiding
reporting it as an error. We will still log
an error if we get ECONNRESET while reading,
without having woken up with POLLHUP, even
though technically that is also possible.
(Ideally, we would flag whether the socket
is reusable or not and also suppress a close
immediately after reading all the data.)
As such, this is not a functional change;
we only tollerate draining a dead connection.
Here is a sample where socket #38 is used to
request a static file through HTTP GET:
Notice how the socket remained open after
serving the file for over 15 seconds,
until we logged the error.
With this patch, the ERR line will become:
32.993201 [ websrv_poll ] TRC #38: Read failed, because the connection is reset by peer, have 0 buffered bytes: ECONNRESET| net/Socket.hpp:1156