Is Synapse backing off on everything it should? #5406

aaronraimist · 2019-06-09T16:53:13Z

Continued in a new issue as requested

what makes you think that those do not cause backoff?

I'd be happy to be wrong but my understanding is I should see a log line like

synapse.util.retryutils - 214 - INFO - federation_transaction_transmission_loop-482 - Connection to matrix.th3v01d.net was unsuccessful (<class 'synapse.api.errors.RequestSendFailed'>(Failed to send request: DNSLookupError: DNS lookup failed: no results for hostname lookup: *******.)); backoff now 86400000

if it was backing off from those errors but I don't see any backoff line for those errors I mentioned.

There is a line like

synapse.http.matrixfederationclient - 502 - DEBUG - federation_transaction_transmission_loop-794 - {PUT-O-758} [***********] Waiting 78.30996245200296s before re-sending...

Is that the same thing? It never seems to go above ~85 seconds so I don't think that is the same thing.

The text was updated successfully, but these errors were encountered:

aaronraimist · 2019-06-09T17:09:53Z

Examples from the last hour of logs:

DNSLookupError('no results for hostname lookup: im.srizbi.com.srizbi.com',)

Jun 09 15:52:52 ubuntu matrix-synapse[26430]: 2019-06-09 15:52:52,642 - synapse.http.matrixfederationclient - 482 - WARNING - federation_transaction_transmission_loop-558 - {PUT-O-616} [srizbi.com] Request failed: PUT matrix://srizbi.com/_matrix/federation/v1/send/1560095475160: DNSLookupError('no results for hostname lookup: im.srizbi.com.srizbi.com',)
Jun 09 15:52:52 ubuntu matrix-synapse[26430]: 2019-06-09 15:52:52,643 - synapse.http.matrixfederationclient - 502 - DEBUG - federation_transaction_transmission_loop-558 - {PUT-O-616} [srizbi.com] Waiting 3.3683070020577666s before re-sending...

in the last hour my server has logged [srizbi.com] Request failed... 22 times.

RequestTransmissionFailed:[Error([('SSL routines', 'CONNECT_CR_CERT', 'certificate verify failed')],)]

Jun 09 16:13:53 ubuntu matrix-synapse[29832]: 2019-06-09 16:13:53,382 - synapse.http.matrixfederationclient - 482 - WARNING - federation_transaction_transmission_loop-999 - {PUT-O-1192} [matrix.wina.be] Request failed: PUT matrix://matrix.wina.be/_matrix/federation/v1/send/1560095829267: RequestTransmissionFailed:[Error([('SSL routines', 'CONNECT_CR_CERT', 'certificate verify failed')],)]
Jun 09 16:13:53 ubuntu matrix-synapse[29832]: 2019-06-09 16:13:53,383 - synapse.http.matrixfederationclient - 502 - DEBUG - federation_transaction_transmission_loop-999 - {PUT-O-1192} [matrix.wina.be] Waiting 3.4019895098393143s before re-sending...

my server logged [matrix.wina.be] Request failed: 21 times.

ConnectError('Address not available',)

Jun 09 16:23:54 ubuntu matrix-synapse[10774]: 2019-06-09 16:23:54,831 - synapse.http.matrixfederationclient - 482 - WARNING - federation_transaction_transmission_loop-672 - {PUT-O-371} [productionservers.net] Request failed: PUT matrix://productionservers.net/_matrix/federation/v1/send/1560097335772: ConnectError('Address not available',)
Jun 09 16:23:54 ubuntu matrix-synapse[10774]: 2019-06-09 16:23:54,832 - synapse.http.matrixfederationclient - 502 - DEBUG - federation_transaction_transmission_loop-672 - {PUT-O-371} [productionservers.net] Waiting 50.39709599439885s before re-sending...

Jun 09 16:25:12 ubuntu matrix-synapse[12436]: 2019-06-09 16:25:12,220 - synapse.http.matrixfederationclient - 401 - INFO - federation_transaction_transmission_loop-311 - {PUT-O-170} [productionservers.net] Sending request: PUT matrix://productionservers.net/_matrix/federation/v1/send/1560097491755; timeout 60.000000s

Jun 09 16:25:28 ubuntu matrix-synapse[12436]: 2019-06-09 16:25:28,495 - synapse.http.federation.matrix_federation_agent - 339 - INFO - federation_transaction_transmission_loop-311 - Error fetching https://productionservers.net/.well-known/matrix/server: An error occurred while connecting: 113: Host is unreachable.

My server logged [productionservers.net] Request failed: 27 times in addition to 28 [productionservers.net] Sending request: PUT matrix://productionservers.net/_matrix/federation/v1/send/...; timeout 60.000000ss

405 looks like

Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,817 - synapse.http.matrixfederationclient - 433 - INFO - federation_transaction_transmission_loop-907 - {PUT-O-860} [decentralizing.world] Got response headers: 405 Not Allowed
Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,817 - synapse.http.matrixfederationclient - 517 - WARNING - federation_transaction_transmission_loop-907 - {PUT-O-860} [decentralizing.world] Request failed: PUT matrix://decentralizing.world/_matrix/federation/v1/send/1560095300424: HttpResponseException("405: b'Not Allowed'",)
Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,818 - synapse.federation.sender.transaction_manager - 123 - INFO - federation_transaction_transmission_loop-907 - TX [decentralizing.world] {1560095300424} got 405 response
Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,818 - synapse.federation.persistence - 89 - DEBUG - federation_transaction_transmission_loop-907 - Invoked 'delivered' with args: self=<synapse.federation.persistence.TransactionActions..., transaction=(Transaction, {'unrecognized_keys': {}, 'transacti..., response_code=405, response_dict=b'<html>\r\n<head><title>405 Not Allowed</title></...
Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,818 - synapse.federation.sender.transaction_manager - 130 - DEBUG - federation_transaction_transmission_loop-907 - TX [decentralizing.world] {1560095300424} Marked as delivered

If you need more info let me know.

ara4n · 2019-06-10T14:11:13Z

(see #5414, #5413 and ye olde #1404)

richvdh · 2019-06-12T19:35:19Z

Something that may help to understand the logs:

Each federation request is tried several times. There are two potential per-request retry schedules, a 'regular' one, which tries 4 times with about 2s between requests, and a 'long' one, which tries 11 times with about 60s between requests. The long retry schedule is only used for /federation/v1/send requests. The per-request retry schedules are documented here.

Only once those 4 or 11 attempts have failed, is the per-destination backoff incremented. The backoff starts at 10 seconds [edit 2021/02/09: I think this should be 10 minutes], and then increases by roughly a factor of 5 for each failure, up to a max of 24h [edit 2021/02/09: as of #6026, infinite]. The way the backoff works is that, while the backoff is in place, we won't make any more requests. Once the backoff expires, the next request that we would send will be attempted, and will be retried several times following the per-request retry algorithm.

The reason for the two levels of "backoff" is to distinguish transient errors (the destination server was being restarted) from actual "server is down" situations.

It's also important to understand that some requests (including /key/v1/server, as per #5414) are deliberately excluded from the backoff schedule.

I don't claim that algorithm is perfect: I explain it only in the hope that it will give some clues as to what you are seeing in the logs. What I hope to show is that the retry schedule is at least working as designed.

So, I hope that explains why grepping for 'Request failed' does not give the full picture: 28 hits over an hour seems quite reasonable when each request is attempted 10 times. If you still have concerns, we probably need to see more comprehensive logs, rather than just excerpts.

The 405 handling does seem wrong and I have raised #5442.

aaronraimist · 2019-06-12T23:43:59Z

Alright. I can understand trying to send these for a few days but it seems like most of these are coming from servers which no longer exist or don't intend to upgrade their certificates any time soon. To solve #5113 I believe Synapse should stop all requests, not just the ones included in the backoff schedule. So the blacklist system probably does need to be more a more comprehensive change.

I'll see if I can setup something like https://github.com/grafana/loki so we can more easily see how often each server is being sent requests rather than manually digging through days/weeks of logs.

I'll let you decide if you want to close this in favor of #5113.

neilisfragile · 2019-06-13T13:46:25Z

Okay, I'll close this one in favour #5113

neilisfragile closed this as completed Jun 13, 2019

richvdh mentioned this issue Sep 12, 2019

Fix bug in calculating the federation retry backoff period #6025

Merged

aaronraimist mentioned this issue Feb 12, 2020

Unreachable servers lead to poor performance #6895

Open

richvdh mentioned this issue Feb 9, 2021

retry federation requests on 5xx errors #8915

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Synapse backing off on everything it should? #5406

Is Synapse backing off on everything it should? #5406

aaronraimist commented Jun 9, 2019

aaronraimist commented Jun 9, 2019

ara4n commented Jun 10, 2019

richvdh commented Jun 12, 2019 •

edited

Loading

aaronraimist commented Jun 12, 2019

neilisfragile commented Jun 13, 2019

Is Synapse backing off on everything it should? #5406

Is Synapse backing off on everything it should? #5406

Comments

aaronraimist commented Jun 9, 2019

aaronraimist commented Jun 9, 2019

Examples from the last hour of logs:

ara4n commented Jun 10, 2019

richvdh commented Jun 12, 2019 • edited Loading

aaronraimist commented Jun 12, 2019

neilisfragile commented Jun 13, 2019

richvdh commented Jun 12, 2019 •

edited

Loading