workers stop working after elevated traffic #2738

turt2live · 2017-12-17T04:55:39Z

Description

There appears to be nothing indicating a problem in the logs, however there's circumstantial evidence that when synapse receives higher than normal traffic it can cause the federation_sender to stop working (no activity), therefore not federating with remote servers. The federation_sender logs don't seem to have anything out of the ordinary - it just stops sending requests. The main synapse process complains about the events stream falling behind, but doesn't seem to cause problems until 12 minutes later.

This has happened about 10 times in the past to t2bot.io, and each time the number of events being persisted was always elevated (double it's normal rate) before the federation_sender stopped working. For t2bot.io "normal" is defined as 2-3Hz. Each time the federation_sender has stopped the persisted events were going through at >6Hz (this latest being ~6-10Hz).

Here's the timeline for the problem (in UTC):

01:56:09 Synapse crosses the 6Hz persisted events line
03:07:28 The main synapse process started complaining that the events stream was falling behind
03:10:03 Synapse falls below the 6Hz persisted events line
03:19:56 The federation_sender officially stopped working
04:27:22 The entire stack was restarted, restoring federation

During this time the only error spat out was (repeated every few seconds):

homeserver - 2017-12-17 03:13:04,014 - twisted - 131 - CRITICAL - - 
Traceback (most recent call last):
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/resource.py", line 164, in on_notifier_poke
    updates, current_token = yield stream.get_updates()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 169, in get_updates
    updates, current_token = yield self.get_updates_since(self.last_token)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 200, in get_updates_since
    raise Exception("stream %s has fallen behined" % (self.NAME))
Exception: stream current_state_deltas has fallen behined

Further, during this time incoming federation was unaffected. Synapse was still processing events and passing them along to appservices. Only outbound federation was affected.

More in-depth logs are available upon request.

Version information

Homeserver: t2bot.io
Version: 0.26.0-rc1
Install method: pip
Platform: container, ubuntu host.

The text was updated successfully, but these errors were encountered:

ara4n · 2018-04-13T14:39:34Z

apparently this is still happening on 0.27.3 - @erikjohnston did you see anything in this area when painting go-faster stripes on the federation-sender the other day (by parallelising sends?)

turt2live · 2018-04-13T14:46:13Z

fwiw this always correlates with a higher number of events being persisted. I wouldn't be surprised if the federation_sender was spamming get event requests, choking out other traffic.

silkeh · 2018-05-09T13:21:40Z

This was triggered on the utwente.io homeserver when it persisted about 120k m.room.member events from matrix.org in <2 minutes (according to synapse_storage_events_persisted_events_sep).

Unfortunately, the logs don't seem to contain anything useful other than the error given above.

turt2live · 2018-06-08T14:35:35Z

Judging from the fact that the federation sender flatlines, my original hypothesis of it doing a lot of work doesn't stand.

turt2live · 2018-06-08T16:11:16Z

The latest outage (screenshot in last comment) appears to be directly attributed to Voyager joining HQ (#3337), however I'm still not sure why the federation sender gave up on life. It appears as though the stream falling behind breaks replication, therefore the federation sender does nothing.

Although the logs don't actually show a 35Hz stream of events being logged, the timing and verbosity of Voyager's join to HQ is oddly convenient.

10:02:26 - Voyager tries to join HQ
10:02:26 - matrix.org replies 200 OK to /make_join
10:02:54 - matrix.org replies 200 OK to /send_join
10:03:20 - Synapse starts complaining about signature checks and tampered events in HQ
10:03:40 - Synapse tries to store HQ again, and fails (duplicate key value violates unique constraint "rooms_pkey" error when rejoining a room #3374)
10:03:41 - Synapse asks matrix.org for an event in HQ
10:03:43 - matrix.org replies 200 OK for the event, Synapse starts rejecting events in HQ (due to visibility)
10:19:45 - Voyager tries to join HQ again (timeout caused it to try again 'later')
10:19:45 - Synapse rejects the join due to an AssertionError (room_id is in the room_queue)
10:29:18 - The join completes
10:29:19 - stream caches falls behind

From that point on, more streams fall behind and eventually the federation sender stops getting traffic.

jaywink · 2019-07-09T11:27:59Z

Had something similar on my server. Joined 3 largeish rooms in a short period (#archlinux:matrix.org, #manjaro:matrix.org and #general:privacytools.io) and things became very unstable. CPU was at 100% due to replication_notifier being stuck on full throttle. Didn't resolve until I restarted both Synapse and the event_creator - the only worker I run.

Events shows 40Hz m.room.member related to matrix.org when things got wedged after joining the rooms.

Running Synapse develop branch at commit cb8d568

tulir · 2019-09-25T19:42:29Z

A bot from my server was invited to #fedora:matrix.org, which eventually made federation sending stop working. When I noticed it, synapse was making logs at around 300mb/min, so the original errors are long gone.

Here's one of the stack traces about the federation sender stream failing. The first two error messages were ~1.3mb each (containing the whole latest room state?), maybe something should prevent that?

2019-09-25 18:58:45,371 - synapse.replication.tcp.resource - 200 - INFO - replication_notifier-2176370- Failed to handle stream events
2019-09-25 18:58:45,372 - synapse.metrics.background_process_metrics - 202 - ERROR - replication_notifier-2176370- Background process 'replication_notifier' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(8595496, '!sWugvEqzNjUsmflyPC:matrix.org', 'm.room.member', '@wayne734:matrix.org', '$1506025804571122MsqQl:matrix.org'), 1309125 more characters...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(8595496, '!sWugvEqzNjUsmflyPC:matrix.org', 'm.room.member', '@wayne734:matrix.org', '$1506025804571122MsqQl:matrix.org'), 1309125 more characters...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <generator object merge at 0x7f9155c1cf48>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/synapse/metrics/background_process_metrics.py", line 200, in run
    yield func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/resource.py", line 198, in _run_notifier_loop
    updates, current_token = yield stream.get_updates()
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/streams/_base.py", line 158, in get_updates
    updates, current_token = yield self.get_updates_since(self.last_token)
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/streams/_base.py", line 199, in get_updates_since
    raise Exception("stream %s has fallen behind" % (self.NAME))
Exception: stream events has fallen behind

grinapo · 2019-11-29T08:41:20Z

Same in assorted colours:

2019-11-29 07:01:30,589 - synapse.replication.tcp.protocol - 532 - ERROR - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Failed to handle REPLICATE command
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wking:matrix.org', '$15009018172787008uGnlB:matrix.org'),.......


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wking:matrix.org', '$15009018172787008uGnlB:matrix.org'), (1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wkennington:matrix.org', '$1484152256110686udaKk:matrix.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <generator object merge at 0x7effd5f59660>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/synapse/replication/tcp/protocol.py", line 484, in subscribe_to_stream
    stream_name, token
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3/dist-packages/synapse/util/metrics.py", line 71, in measured_func
    r = yield func(self, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python3/dist-packages/synapse/replication/tcp/streams/_base.py", line 199, in get_updates_since
    raise Exception("stream %s has fallen behind" % (self.NAME))
Exception: stream events has fallen behind
2019-11-29 07:01:30,629 - synapse.replication.tcp.protocol - 252 - WARNING - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Closing connection
2019-11-29 07:01:30,629 - synapse.replication.tcp.protocol - 373 - INFO - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Connection was closed
2019-11-29 07:01:30,643 - synapse.replication.tcp.protocol - 354 - INFO - - [synapse.app.synchrotron-iFLjN] Replication connection closed: <twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>
2019-11-29 07:01:30,643 - synapse.replication.tcp.protocol - 373 - INFO - - [synapse.app.synchrotron-iFLjN] Connection was closed

richvdh · 2020-02-11T17:26:38Z

the stacktrace has evolved slightly since 2017 as this code has been updated, but the problem is still very much there and biting people today.

@tulir's stacktrace in #2738 (comment) is representative; the hallmarks are Background process 'replication_notifier' threw an exception, with a huge stacktrace culminating in Exception: stream events has fallen behind.

Certain actions (joining a large room) can require a huge update to be sent over the replication stream to the workers; the huge update is misinterpreted as the workers getting behind and causes an exception. The only way out is to restart the master and all the workers.

This has been fudged around on matrix.org via 1766a5f. I am ... displeased ... to discover this hasn't made it to mainline.

@grinapo's similar exception with Failed to handle REPLICATE command is due to workers getting behind; the fact they don't recover is #4963.

richvdh · 2020-02-11T17:48:58Z

essentially this happens whenever the homeserver joins a room with more than 10000 state events :/

richvdh · 2020-02-26T22:52:16Z

hopefully made to go away by #6967.

turt2live mentioned this issue Jun 8, 2018

duplicate key value violates unique constraint "rooms_pkey" error when rejoining a room #3374

Closed

richvdh closed this as completed Jun 8, 2018

richvdh reopened this Jun 8, 2018

turt2live mentioned this issue Jun 8, 2018

Don't blindly join rooms turt2live/matrix-voyager-bot#211

Closed

turt2live mentioned this issue Feb 2, 2019

replication wedged after spike in event traffic #4549

Closed

richvdh changed the title ~~federation_sender stops working after elevated traffic~~ workers stop working after elevated traffic Feb 11, 2020

richvdh mentioned this issue Feb 11, 2020

"stream federation has fallen behind" + a bunch of user IDs #4388

Closed

richvdh added z-bug (Deprecated Label) p1 labels Feb 11, 2020

babolivier mentioned this issue Feb 11, 2020

Spell out that the last event sent to a room won't be deleted by a purge #6891

Merged

richvdh mentioned this issue Feb 21, 2020

Increase MAX_EVENTS_BEHIND for replication clients #6967

Merged

richvdh self-assigned this Feb 21, 2020

richvdh closed this as completed Feb 26, 2020

MadLittleMods added the A-Workers Problems related to running Synapse in Worker Mode (or replication) label Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workers stop working after elevated traffic #2738

workers stop working after elevated traffic #2738

turt2live commented Dec 17, 2017 •

edited

Loading

ara4n commented Apr 13, 2018

turt2live commented Apr 13, 2018

silkeh commented May 9, 2018

turt2live commented Jun 8, 2018

turt2live commented Jun 8, 2018

jaywink commented Jul 9, 2019

tulir commented Sep 25, 2019

grinapo commented Nov 29, 2019

richvdh commented Feb 11, 2020

richvdh commented Feb 11, 2020

richvdh commented Feb 26, 2020

workers stop working after elevated traffic #2738

workers stop working after elevated traffic #2738

Comments

turt2live commented Dec 17, 2017 • edited Loading

Description

Version information

ara4n commented Apr 13, 2018

turt2live commented Apr 13, 2018

silkeh commented May 9, 2018

turt2live commented Jun 8, 2018

turt2live commented Jun 8, 2018

jaywink commented Jul 9, 2019

tulir commented Sep 25, 2019

grinapo commented Nov 29, 2019

richvdh commented Feb 11, 2020

richvdh commented Feb 11, 2020

richvdh commented Feb 26, 2020

turt2live commented Dec 17, 2017 •

edited

Loading