blacklist servers which are down for several days #5113

richvdh · 2019-04-29T16:30:28Z

Why don't we add servers which fail to respond to federation requests for several days to a blacklist, and stop trying?

This could significantly reduce network traffic, CPU usage, and the amount of cruft that gets logged.

We would need to unblock such servers when we receive a valid request from them.

aaronraimist · 2019-04-29T16:48:32Z

See https://gist.github.com/Sharparam/b144e294189d78ee6c73df0e109ee2af for one week of data showing the number of times per day that @Sharparam's server contacted a bunch of dead servers.

See convo starting from https://matrix.to/#/!HsxjoYRFsDtWBgDQPh:matrix.org/$155655145216EFGaa:matrix.sharparam.com?via=matrix.org&via=chat.weho.st&via=hackerspaces.be

richvdh · 2019-04-29T16:57:26Z

Note that we already have an exponential-backoff algorithm, but it tops out at a 24hr retry period.

After the first failed request, we back off for 10 minutes. We then increase the backoff by a factor of between 4 and 7 after each failed request, until we get to 24 hours. The retry intervals are therefore:

10min
~ 55min
~ 5hr
24hr
24hr
...

An easy way to implement this would be to go to 999 years after 24 hours.

richvdh · 2019-04-29T17:46:36Z

I should also mention that each request is retried several times (normally 10), with its own exponential-backoff loop. The per-host exponential backoff is only increased after a request fails completely.

mguentner · 2019-04-29T22:56:37Z

The worst backoff time of a server should be calculated to know how long
operators of homeservers have time to fix issues. This default needs to be
communicated to homeserver admins.

Also, if a server is down for one week or so this shouldn't result in a blacklist if it is no easily reversible.
Such a reversal could be that the blacklisted server contacts the server again that put its blacklist.

An observation:
I moved an homeserver I operate it to a dedicated machine some days ago.
Before it was running on domain.tld, now it runs on synapse.domain.tld

To do that, I created a .well-known/matrix/server file and changed the
SRV record from domain.tld to synapse.domain.tld.
The main domain (without the host part, synapse) however appears now in the gist referenced in #5113 (comment)
My homeserver happily federates with the network as far as I can see and the fed tester
is also green.
Adding a .well-known/matrix/server file or changing / adding the SRV record mustn't
result in a ban, even if cached DNS results or GET requests of .well-known/matrix/server
let requests fail.

Sharparam · 2019-04-29T23:59:50Z

@mguentner What is your server hostname so I can check the logs more closely for errors related to it?

kroeckx · 2019-05-01T20:44:07Z

@Sharparam: My server (roeckx.be) seems to have high numbers in that file, but the server has always been up.

Sharparam · 2019-05-01T21:33:59Z

@kroeckx Your server is responding with 400: Bad Request to federation attempts.

Edit: The latest entry from the dumped logs:

2019-04-29 15:32:44,450 - synapse.http.matrixfederationclient - 472 - WARNING - federation_transaction_transmission_loop-242514- {PUT-O-173776} [roeckx.be] Request failed: PUT matrix://roeckx.be/_matrix/federation/v1/send/1556427433382: HttpResponseException("400: b'Bad Request'",)

kroeckx · 2019-05-01T21:54:14Z

Looking in my logs, I see lots of: 2019-05-01 21:48:21,978 - synapse.access.https.8448 - 233 - INFO - PUT-4120380- 80.245.199.234 - 8448 - Received request: PUT /_matrix/federation/v1/send/1554208235995 2019-05-01 21:48:21,979 - synapse.http.server - 85 - INFO - PUT-4120380- <SynapseRequest at 0x7f4ad26864a8 method='PUT' uri='/_matrix/federation/v1/send/1554208235995' clientproto='HTTP/1.1' site=8448> SynapseError: 400 - Unrecognized request 2019-05-01 21:48:21,980 - synapse.access.https.8448 - 302 - INFO - PUT-4120380- 80.245.199.234 - 8448 - {None} Processed request: 0.002sec/0.001sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 59B 400 "PUT /_matrix/federation/v1/send/1554208235995 HTTP/1.1" "Synapse/0.99.3" [0 dbevts] Why is generating that error, and why doesn't the federation tester give any error?

Sharparam · 2019-05-01T21:56:16Z

For the record, some discussion related to this in #synapse:matrix.org starting from this message.

Edit: Looks like you might need to update your Synapse instance @kroeckx. You are on 0.99.2 and the latest Synapse is 0.99.3, apparently something regarding federation was changed.

mguentner · 2019-05-02T10:49:41Z

@Sharparam Aha! My instance was still on 0.99.2 as well but now runs on 0.99.3
It would have been a bad reason to ban my homeserver though ;)

Sharparam · 2019-05-02T11:06:10Z

@mguentner Yeah, my analysis only looks at the errors logged by Synapse and that doesn't take into account that special case where there is a failed call immediately followed by a successful one. The blacklisting code would have to take this into account somehow. (This actually might resolve itself if the servers are removed from blacklist as soon as a successful call is made or received.)

richvdh · 2019-05-02T15:32:03Z

The blacklisting code would have to take this into account somehow.

the current backoff code ignores 400s, ftr.

Sharparam · 2019-05-02T16:49:50Z

I'm not sure if that's a solution though, there could be 400 errors that are not resolved by sending again with a slash.

aaronraimist · 2019-06-07T22:44:33Z

Seems like the things that cause a backoff have not been updated in several years https://github.com/matrix-org/synapse/blob/develop/synapse/util/retryutils.py#L177

I just pulled my last 4 hours of logs from 1.0.0rc1:

33% of the WARNING lines are RequestTransmissionFailed:[Error([('SSL routines', 'CONNECT_CR_CERT', 'certificate verify failed')],)]
16% TimeoutError('',)
15% DNSLookupError('no results for hostname lookup...
12% ConnectionRefusedError('Connection refused',)
10% Some form of DNSMismatch
7% HttpResponseException("400: b'Bad Request'",)

and a fewConnectError('Address not available',), HttpResponseException("403: b'Forbidden'",) and HttpResponseException("405: b'Method Not Allowed'",)

It doesn't seem like these backoff even though I believe most of these should.

richvdh · 2019-06-09T09:40:52Z

@aaronraimist: what makes you think that those do not cause backoff? As far as I know they all should (most of them do not derive from CodeMessagesException so the line you have linked to is not relevant). In any case it's a separate problem, so please open a new issue.

Essentially the intention here is to end up blacklisting servers which never respond to federation requests. Fixes #5113.

richvdh · 2019-09-12T12:00:37Z

#6026

DMRobertson · 2022-11-22T13:09:39Z

We would need to unblock such servers when we receive a valid request from them.

Did this happen?

richvdh · 2022-11-22T16:08:14Z

yes, valid requests received will reset the backoff.

aaronraimist mentioned this issue Apr 29, 2019

Does Synapse ever stop trying to send events to dead servers? #4979

Closed

neilisfragile added enhancement z-p2 (Deprecated Label) labels Apr 29, 2019

richvdh mentioned this issue Jun 6, 2019

we need to limit the concurrency of outgoing federation requests #5373

Open

aaronraimist mentioned this issue Jun 9, 2019

Is Synapse backing off on everything it should? #5406

Closed

richvdh mentioned this issue Jun 25, 2019

Guideline for log message level #4751

Closed

richvdh mentioned this issue Jul 8, 2019

Add tuning options for federation client backoff #5556

Closed

hawkowl self-assigned this Jul 11, 2019

hawkowl added the z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers label Jul 11, 2019

richvdh added a commit that referenced this issue Sep 12, 2019

Remove the cap on federation retry interval.

d0a69bb

Essentially the intention here is to end up blacklisting servers which never respond to federation requests. Fixes #5113.

richvdh mentioned this issue Sep 12, 2019

Remove the cap on federation retry interval. #6026

Merged

richvdh added a commit that referenced this issue Sep 12, 2019

Remove the cap on federation retry interval. (#6026)

3d882a7

Essentially the intention here is to end up blacklisting servers which never respond to federation requests. Fixes #5113.

richvdh closed this as completed Sep 12, 2019

evilham mentioned this issue Feb 12, 2020

Unreachable servers lead to poor performance #6895

Open

benvei mentioned this issue Mar 12, 2021

retry device resync doesn't follow exponentially back off algorithm #9603

Open

thegcat mentioned this issue Nov 22, 2022

Reduce the backoff cliff #14513

Closed

matrixbot mentioned this issue Dec 21, 2023

retry device resync doesn't follow exponentially back off algorithm element-hq/synapse#9603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blacklist servers which are down for several days #5113

blacklist servers which are down for several days #5113

richvdh commented Apr 29, 2019

aaronraimist commented Apr 29, 2019 •

edited

Loading

richvdh commented Apr 29, 2019

richvdh commented Apr 29, 2019

mguentner commented Apr 29, 2019 •

edited

Loading

Sharparam commented Apr 29, 2019

kroeckx commented May 1, 2019

Sharparam commented May 1, 2019 •

edited

Loading

kroeckx commented May 1, 2019 via email

Sharparam commented May 1, 2019 •

edited

Loading

mguentner commented May 2, 2019

Sharparam commented May 2, 2019 •

edited

Loading

richvdh commented May 2, 2019

Sharparam commented May 2, 2019

aaronraimist commented Jun 7, 2019 •

edited

Loading

richvdh commented Jun 9, 2019

richvdh commented Sep 12, 2019

DMRobertson commented Nov 22, 2022

richvdh commented Nov 22, 2022

blacklist servers which are down for several days #5113

blacklist servers which are down for several days #5113

Comments

richvdh commented Apr 29, 2019

aaronraimist commented Apr 29, 2019 • edited Loading

richvdh commented Apr 29, 2019

richvdh commented Apr 29, 2019

mguentner commented Apr 29, 2019 • edited Loading

Sharparam commented Apr 29, 2019

kroeckx commented May 1, 2019

Sharparam commented May 1, 2019 • edited Loading

kroeckx commented May 1, 2019 via email

Sharparam commented May 1, 2019 • edited Loading

mguentner commented May 2, 2019

Sharparam commented May 2, 2019 • edited Loading

richvdh commented May 2, 2019

Sharparam commented May 2, 2019

aaronraimist commented Jun 7, 2019 • edited Loading

richvdh commented Jun 9, 2019

richvdh commented Sep 12, 2019

DMRobertson commented Nov 22, 2022

richvdh commented Nov 22, 2022

aaronraimist commented Apr 29, 2019 •

edited

Loading

mguentner commented Apr 29, 2019 •

edited

Loading

Sharparam commented May 1, 2019 •

edited

Loading

Sharparam commented May 1, 2019 •

edited

Loading

Sharparam commented May 2, 2019 •

edited

Loading

aaronraimist commented Jun 7, 2019 •

edited

Loading