-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Is Synapse backing off on everything it should? #5406
Comments
Examples from the last hour of logs:
in the last hour my server has logged
my server logged
My server logged 405 looks like
If you need more info let me know. |
Something that may help to understand the logs: Each federation request is tried several times. There are two potential per-request retry schedules, a 'regular' one, which tries 4 times with about 2s between requests, and a 'long' one, which tries 11 times with about 60s between requests. The long retry schedule is only used for Only once those 4 or 11 attempts have failed, is the per-destination backoff incremented. The backoff starts at 10 seconds [edit 2021/02/09: I think this should be 10 minutes], and then increases by roughly a factor of 5 for each failure, up to a max of 24h [edit 2021/02/09: as of #6026, infinite]. The way the backoff works is that, while the backoff is in place, we won't make any more requests. Once the backoff expires, the next request that we would send will be attempted, and will be retried several times following the per-request retry algorithm. The reason for the two levels of "backoff" is to distinguish transient errors (the destination server was being restarted) from actual "server is down" situations. It's also important to understand that some requests (including /key/v1/server, as per #5414) are deliberately excluded from the backoff schedule. I don't claim that algorithm is perfect: I explain it only in the hope that it will give some clues as to what you are seeing in the logs. What I hope to show is that the retry schedule is at least working as designed. So, I hope that explains why grepping for 'Request failed' does not give the full picture: 28 hits over an hour seems quite reasonable when each request is attempted 10 times. If you still have concerns, we probably need to see more comprehensive logs, rather than just excerpts. The 405 handling does seem wrong and I have raised #5442. |
Alright. I can understand trying to send these for a few days but it seems like most of these are coming from servers which no longer exist or don't intend to upgrade their certificates any time soon. To solve #5113 I believe Synapse should stop all requests, not just the ones included in the backoff schedule. So the blacklist system probably does need to be more a more comprehensive change. I'll see if I can setup something like https://github.com/grafana/loki so we can more easily see how often each server is being sent requests rather than manually digging through days/weeks of logs. I'll let you decide if you want to close this in favor of #5113. |
Okay, I'll close this one in favour #5113 |
Continued in a new issue as requested
See #5113 (comment) and #5113 (comment)
I'd be happy to be wrong but my understanding is I should see a log line like
if it was backing off from those errors but I don't see any backoff line for those errors I mentioned.
There is a line like
Is that the same thing? It never seems to go above ~85 seconds so I don't think that is the same thing.
The text was updated successfully, but these errors were encountered: