Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wsd: reduce noise while draining with ECONNRESET #10400

Merged
merged 1 commit into from
Nov 2, 2024

Conversation

Ashod
Copy link
Contributor

@Ashod Ashod commented Nov 2, 2024

Now that we re-use connections, we have revealed
a corner-case that causes some noise.

Remote connections might not retain HTTP
connections past their immediate purpose and
rudely reset the connection after a certain
timeout.

Apache has such a short timeout of between 5
and 15 seconds, by default. And, it seems,
HA Proxy is similar. When the connection is
not reused by the client within the timeout,
it is simply closed.

This wakes us up and we get revents with
POLLIN, POLLHUP, and POLLERR (that is, 0x19).

This patch handles this case by recognizing
that we got POLLHUP already and the socket
is closed before doing read(2). That is,
read(2) is expected to fail with ECONNRESET.

We handle this case specifically by avoiding
reporting it as an error. We will still log
an error if we get ECONNRESET while reading,
without having woken up with POLLHUP, even
though technically that is also possible.
(Ideally, we would flag whether the socket
is reusable or not and also suppress a close
immediately after reading all the data.)

As such, this is not a functional change;
we only tollerate draining a dead connection.

Here is a sample where socket #38 is used to
request a static file through HTTP GET:

16.150238 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 4 (of 7): 0x1| net/Socket.cpp:558
16.150244 [ websrv_poll ] TRC  #38: Read 506 bytes in addition to 0 buffered bytes| net/Socket.hpp:1166
16.150246 [ websrv_poll ] TRC  #38: Incoming data buffer 506 bytes, read result: 506, events: 0x1 (not closed)| net/Socket.hpp:1350
16.150270 [ websrv_poll ] INF  #38: parseHeader: Client HTTP Request: GET, uri `/browser/ebaa6a6/images/lc_addcitationnote.svg` HTTP/1.1, sz[header 506, content -1], offset 506, chunked false, host: office.nc-presales.top / user-agent: Mozilla/5.0 (X11; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0 / accept: image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5 / accept-language: en,en-US;q=0.5 / accept-encoding: gzip, deflate, br, zstd / dnt: 1 / sec-fetch-dest: image / sec-fetch-mode: no-cors / sec-fetch-site: same-origin / sec-gpc: 1 / priority: u=5, i / x-real-ip: 176.145.78.103 / x-forwarded-for: 176.145.78.103| net/Socket.cpp:1340
16.150276 [ websrv_poll ] DBG  #38: Handling request: /browser/ebaa6a6/images/lc_addcitationnote.svg, closeConnection false| wsd/ClientRequestDispatcher.cpp:670
16.150287 [ websrv_poll ] TRC  Fileserver request: /browser/ebaa6a6/images/lc_addcitationnote.svg| wsd/FileServer.cpp:524
16.150303 [ websrv_poll ] TRC  #38: Sending compressed : file [/browser/dist/images/lc_addcitationnote.svg]: Date: Thu, 31 Oct 2024 12:25:16 / Server: COOLWSD HTTP Server 24.04.7.3 / Content-Type: image/svg+xml / Content-Encoding: gzip / Content-Length: 489 / Cache-Control: max-age=11059200 / ETag: "ebaa6a6" / X-Content-Type-Options: nosniff| wsd/FileServer.cpp:732
16.150317 [ websrv_poll ] TRC  #38: Wrote 248 bytes of 248 buffered data| net/Socket.hpp:1490
16.150323 [ websrv_poll ] TRC  #38: Wrote 489 bytes of 489 buffered data| net/Socket.hpp:1490
16.150326 [ websrv_poll ] DBG  #38: Handled request: /browser/ebaa6a6/images/lc_addcitationnote.svg, inBuf[sz 506 -> 0, rm 506], connection open true| wsd/ClientRequestDispatcher.cpp:952
16.150341 [ websrv_poll ] TRC  #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
16.201424 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558
16.201547 [ websrv_poll ] TRC  #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
16.211803 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558
16.211810 [ websrv_poll ] TRC  #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
27.991710 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558
27.991836 [ websrv_poll ] TRC  #28: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
27.991838 [ websrv_poll ] TRC  #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
27.991840 [ websrv_poll ] TRC  ppoll start, timeoutMicroS: 64000000 size 4| net/Socket.cpp:430
32.993143 [ websrv_poll ] TRC  Poll completed with 2 live polls max (64000000us)| net/Socket.cpp:448
32.993196 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 3 (of 4): 0x19| net/Socket.cpp:558
32.993201 [ websrv_poll ] ERR  #38: Read failed, have 0 buffered bytes (ECONNRESET: Connection reset by peer)| net/Socket.hpp:1156
32.993210 [ websrv_poll ] TRC  #38: Incoming data buffer 0 bytes, read result: -1, events: 0x19 (closed)| net/Socket.hpp:1350
32.993212 [ websrv_poll ] DBG  #38: Closed after reading| net/Socket.hpp:1373
32.993213 [ websrv_poll ] TRC  #38: Closed. Firing onDisconnect.| net/Socket.hpp:1445

Notice how the socket remained open after
serving the file for over 15 seconds,
until we logged the error.

With this patch, the ERR line will become:
32.993201 [ websrv_poll ] TRC #38: Read failed, because the connection is reset by peer, have 0 buffered bytes: ECONNRESET| net/Socket.hpp:1156

@Ashod Ashod requested review from mmeeks, sgothel and caolanm November 2, 2024 13:10
Copy link
Contributor

@mmeeks mmeeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can simplify =) but otherwise good! thanks.

net/Socket.hpp Outdated
LOG_SYS_ERRNO(last_errno, "Read failed, have " << _inBuffer.size()
<< " buffered bytes");
{
if (closed && last_errno == ECONNRESET)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not convinced by the complexity of the extra parameter - and there is of course a race between the poll returning and the socket closing - so this will not be a reliable catch here.

We will be able to see the socket closed with a HUP in other logging - to diagnose why.

I would be inclined to just LOGA_INF() or somesuch for all ECONNRESETs in this case - which should simplify the patch a lot.

Is that do-able ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, even better, since we can have the same effect by checking the buffer.

That is, if there is nothing in the buffer, then the connection was idle and the peer recycled the socket.

I also reduced to trace logging to match the others in the function.

Now that we re-use connections, we have revealed
a corner-case that causes some noise.

Remote connections might not retain HTTP
connections past their immediate purpose and
rudely reset the connection after a certain
timeout.

Apache has such a short timeout of between 5
and 15 seconds, by default. And, it seems,
HA Proxy is similar. When the connection is
not reused by the client within the timeout,
it is simply closed.

This wakes us up and we get revents with
POLLIN, POLLHUP, and POLLERR (that is, 0x19).

This patch handles this case by recognizing
that we often get POLLHUP and the socket
is closed before doing read(2). That is,
read(2) *is* expected to fail with ECONNRESET.

We handle this case specifically by avoiding
reporting it as an error. We will still log
an error if we get ECONNRESET while reading,
when we haven't been idle, even though
technically that is also possible.
(Ideally, we would flag whether the socket
is reusable or not and also suppress a close
immediately after reading all the data.)

As such, this is *not* a functional change;
we only tollerate draining a dead connection
without noisily logging.

Here is a sample where socket #38 is used to
request a static file through HTTP GET:

16.150238 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 4 (of 7): 0x1| net/Socket.cpp:558
16.150244 [ websrv_poll ] TRC  #38: Read 506 bytes in addition to 0 buffered bytes| net/Socket.hpp:1166
16.150246 [ websrv_poll ] TRC  #38: Incoming data buffer 506 bytes, read result: 506, events: 0x1 (not closed)| net/Socket.hpp:1350
16.150270 [ websrv_poll ] INF  #38: parseHeader: Client HTTP Request: GET, uri `/browser/ebaa6a6/images/lc_addcitationnote.svg` HTTP/1.1, sz[header 506, content -1], offset 506, chunked false, host: office.nc-presales.top / user-agent: Mozilla/5.0 (X11; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0 / accept: image/avif,image/webp,image/png,image/svg+xml,image/*;q=0.8,*/*;q=0.5 / accept-language: en,en-US;q=0.5 / accept-encoding: gzip, deflate, br, zstd / dnt: 1 / sec-fetch-dest: image / sec-fetch-mode: no-cors / sec-fetch-site: same-origin / sec-gpc: 1 / priority: u=5, i / x-real-ip: 176.145.78.103 / x-forwarded-for: 176.145.78.103| net/Socket.cpp:1340
16.150276 [ websrv_poll ] DBG  #38: Handling request: /browser/ebaa6a6/images/lc_addcitationnote.svg, closeConnection false| wsd/ClientRequestDispatcher.cpp:670
16.150287 [ websrv_poll ] TRC  Fileserver request: /browser/ebaa6a6/images/lc_addcitationnote.svg| wsd/FileServer.cpp:524
16.150303 [ websrv_poll ] TRC  #38: Sending compressed : file [/browser/dist/images/lc_addcitationnote.svg]: Date: Thu, 31 Oct 2024 12:25:16 / Server: COOLWSD HTTP Server 24.04.7.3 / Content-Type: image/svg+xml / Content-Encoding: gzip / Content-Length: 489 / Cache-Control: max-age=11059200 / ETag: "ebaa6a6" / X-Content-Type-Options: nosniff| wsd/FileServer.cpp:732
16.150317 [ websrv_poll ] TRC  #38: Wrote 248 bytes of 248 buffered data| net/Socket.hpp:1490
16.150323 [ websrv_poll ] TRC  #38: Wrote 489 bytes of 489 buffered data| net/Socket.hpp:1490
16.150326 [ websrv_poll ] DBG  #38: Handled request: /browser/ebaa6a6/images/lc_addcitationnote.svg, inBuf[sz 506 -> 0, rm 506], connection open true| wsd/ClientRequestDispatcher.cpp:952
16.150341 [ websrv_poll ] TRC  #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
16.201424 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558
16.201547 [ websrv_poll ] TRC  #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
16.211803 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558
16.211810 [ websrv_poll ] TRC  #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
27.991710 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 4 (of 7): 0x0| net/Socket.cpp:558
27.991836 [ websrv_poll ] TRC  #28: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
27.991838 [ websrv_poll ] TRC  #38: setupPollFds getPollEvents: 0x1| net/Socket.hpp:880
27.991840 [ websrv_poll ] TRC  ppoll start, timeoutMicroS: 64000000 size 4| net/Socket.cpp:430
32.993143 [ websrv_poll ] TRC  Poll completed with 2 live polls max (64000000us)| net/Socket.cpp:448
32.993196 [ websrv_poll ] TRC  #38: Handling poll events of websrv_poll at index 3 (of 4): 0x19| net/Socket.cpp:558
32.993201 [ websrv_poll ] ERR  #38: Read failed, have 0 buffered bytes (ECONNRESET: Connection reset by peer)| net/Socket.hpp:1156
32.993210 [ websrv_poll ] TRC  #38: Incoming data buffer 0 bytes, read result: -1, events: 0x19 (closed)| net/Socket.hpp:1350
32.993212 [ websrv_poll ] DBG  #38: Closed after reading| net/Socket.hpp:1373
32.993213 [ websrv_poll ] TRC  #38: Closed. Firing onDisconnect.| net/Socket.hpp:1445

Notice how the socket remained open after
serving the file for over 15 seconds,
until we logged the error.

With this patch, the ERR line will become:
32.993201 [ websrv_poll ] TRC  #38: Read failed, because the connection is reset by peer, have 0 buffered bytes: ECONNRESET| net/Socket.hpp:1156

Change-Id: Id440fc25ff899b44bc4db0b267cde84cca78282d
Signed-off-by: Ashod Nakashian <ashod.nakashian@collabora.co.uk>
@Ashod Ashod force-pushed the private/ash/econnreset branch from 91d0e21 to 9546fc2 Compare November 2, 2024 15:22
Copy link
Contributor

@mmeeks mmeeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks Ash =)

@mmeeks mmeeks merged commit 6bcced1 into master Nov 2, 2024
13 checks passed
@mmeeks mmeeks deleted the private/ash/econnreset branch November 2, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants