-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
workers stop working after elevated traffic #2738
Comments
apparently this is still happening on 0.27.3 - @erikjohnston did you see anything in this area when painting go-faster stripes on the federation-sender the other day (by parallelising sends?) |
fwiw this always correlates with a higher number of events being persisted. I wouldn't be surprised if the federation_sender was spamming get event requests, choking out other traffic. |
This was triggered on the Unfortunately, the logs don't seem to contain anything useful other than the error given above. |
The latest outage (screenshot in last comment) appears to be directly attributed to Voyager joining HQ (#3337), however I'm still not sure why the federation sender gave up on life. It appears as though the stream falling behind breaks replication, therefore the federation sender does nothing. Although the logs don't actually show a 35Hz stream of events being logged, the timing and verbosity of Voyager's join to HQ is oddly convenient.
From that point on, more streams fall behind and eventually the federation sender stops getting traffic. |
Had something similar on my server. Joined 3 largeish rooms in a short period ( Events shows 40Hz Running Synapse develop branch at commit cb8d568 |
A bot from my server was invited to Here's one of the stack traces about the federation sender stream failing. The first two error messages were ~1.3mb each (containing the whole latest room state?), maybe something should prevent that?
|
Same in assorted colours:
|
the stacktrace has evolved slightly since 2017 as this code has been updated, but the problem is still very much there and biting people today. @tulir's stacktrace in #2738 (comment) is representative; the hallmarks are Certain actions (joining a large room) can require a huge update to be sent over the replication stream to the workers; the huge update is misinterpreted as the workers getting behind and causes an exception. The only way out is to restart the master and all the workers. This has been fudged around on matrix.org via 1766a5f. I am ... displeased ... to discover this hasn't made it to mainline. @grinapo's similar exception with |
essentially this happens whenever the homeserver joins a room with more than 10000 state events :/ |
hopefully made to go away by #6967. |
Description
There appears to be nothing indicating a problem in the logs, however there's circumstantial evidence that when synapse receives higher than normal traffic it can cause the federation_sender to stop working (no activity), therefore not federating with remote servers. The federation_sender logs don't seem to have anything out of the ordinary - it just stops sending requests. The main synapse process complains about the
events
stream falling behind, but doesn't seem to cause problems until 12 minutes later.This has happened about 10 times in the past to t2bot.io, and each time the number of events being persisted was always elevated (double it's normal rate) before the federation_sender stopped working. For t2bot.io "normal" is defined as 2-3Hz. Each time the federation_sender has stopped the persisted events were going through at >6Hz (this latest being ~6-10Hz).
Here's the timeline for the problem (in UTC):
events
stream was falling behindDuring this time the only error spat out was (repeated every few seconds):
Further, during this time incoming federation was unaffected. Synapse was still processing events and passing them along to appservices. Only outbound federation was affected.
More in-depth logs are available upon request.
Version information
The text was updated successfully, but these errors were encountered: