Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

workers stop working after elevated traffic #2738

Closed
turt2live opened this issue Dec 17, 2017 · 11 comments
Closed

workers stop working after elevated traffic #2738

turt2live opened this issue Dec 17, 2017 · 11 comments
Assignees
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) z-bug (Deprecated Label)

Comments

@turt2live
Copy link
Member

turt2live commented Dec 17, 2017

Description

There appears to be nothing indicating a problem in the logs, however there's circumstantial evidence that when synapse receives higher than normal traffic it can cause the federation_sender to stop working (no activity), therefore not federating with remote servers. The federation_sender logs don't seem to have anything out of the ordinary - it just stops sending requests. The main synapse process complains about the events stream falling behind, but doesn't seem to cause problems until 12 minutes later.

This has happened about 10 times in the past to t2bot.io, and each time the number of events being persisted was always elevated (double it's normal rate) before the federation_sender stopped working. For t2bot.io "normal" is defined as 2-3Hz. Each time the federation_sender has stopped the persisted events were going through at >6Hz (this latest being ~6-10Hz).

Here's the timeline for the problem (in UTC):

  • 01:56:09 Synapse crosses the 6Hz persisted events line
  • 03:07:28 The main synapse process started complaining that the events stream was falling behind
  • 03:10:03 Synapse falls below the 6Hz persisted events line
  • 03:19:56 The federation_sender officially stopped working
  • 04:27:22 The entire stack was restarted, restoring federation

During this time the only error spat out was (repeated every few seconds):

homeserver - 2017-12-17 03:13:04,014 - twisted - 131 - CRITICAL - - 
Traceback (most recent call last):
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/resource.py", line 164, in on_notifier_poke
    updates, current_token = yield stream.get_updates()
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 169, in get_updates
    updates, current_token = yield self.get_updates_since(self.last_token)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse/local/lib/python2.7/site-packages/synapse/replication/tcp/streams.py", line 200, in get_updates_since
    raise Exception("stream %s has fallen behined" % (self.NAME))
Exception: stream current_state_deltas has fallen behined

Further, during this time incoming federation was unaffected. Synapse was still processing events and passing them along to appservices. Only outbound federation was affected.

More in-depth logs are available upon request.

Version information

  • Homeserver: t2bot.io
  • Version: 0.26.0-rc1
  • Install method: pip
  • Platform: container, ubuntu host.
@ara4n
Copy link
Member

ara4n commented Apr 13, 2018

apparently this is still happening on 0.27.3 - @erikjohnston did you see anything in this area when painting go-faster stripes on the federation-sender the other day (by parallelising sends?)

@turt2live
Copy link
Member Author

fwiw this always correlates with a higher number of events being persisted. I wouldn't be surprised if the federation_sender was spamming get event requests, choking out other traffic.

@silkeh
Copy link
Contributor

silkeh commented May 9, 2018

This was triggered on the utwente.io homeserver when it persisted about 120k m.room.member events from matrix.org in <2 minutes (according to synapse_storage_events_persisted_events_sep).

Unfortunately, the logs don't seem to contain anything useful other than the error given above.

@turt2live
Copy link
Member Author

image

Judging from the fact that the federation sender flatlines, my original hypothesis of it doing a lot of work doesn't stand.

@turt2live
Copy link
Member Author

The latest outage (screenshot in last comment) appears to be directly attributed to Voyager joining HQ (#3337), however I'm still not sure why the federation sender gave up on life. It appears as though the stream falling behind breaks replication, therefore the federation sender does nothing.

Although the logs don't actually show a 35Hz stream of events being logged, the timing and verbosity of Voyager's join to HQ is oddly convenient.

  • 10:02:26 - Voyager tries to join HQ
  • 10:02:26 - matrix.org replies 200 OK to /make_join
  • 10:02:54 - matrix.org replies 200 OK to /send_join
  • 10:03:20 - Synapse starts complaining about signature checks and tampered events in HQ
  • 10:03:40 - Synapse tries to store HQ again, and fails (duplicate key value violates unique constraint "rooms_pkey" error when rejoining a room #3374)
  • 10:03:41 - Synapse asks matrix.org for an event in HQ
  • 10:03:43 - matrix.org replies 200 OK for the event, Synapse starts rejecting events in HQ (due to visibility)
  • 10:19:45 - Voyager tries to join HQ again (timeout caused it to try again 'later')
  • 10:19:45 - Synapse rejects the join due to an AssertionError (room_id is in the room_queue)
  • 10:29:18 - The join completes
  • 10:29:19 - stream caches falls behind

From that point on, more streams fall behind and eventually the federation sender stops getting traffic.

@jaywink
Copy link
Member

jaywink commented Jul 9, 2019

Had something similar on my server. Joined 3 largeish rooms in a short period (#archlinux:matrix.org, #manjaro:matrix.org and #general:privacytools.io) and things became very unstable. CPU was at 100% due to replication_notifier being stuck on full throttle. Didn't resolve until I restarted both Synapse and the event_creator - the only worker I run.

Screenshot from 2019-07-09 12-59-00

Events shows 40Hz m.room.member related to matrix.org when things got wedged after joining the rooms.

Selection_215

Running Synapse develop branch at commit cb8d568

@tulir
Copy link
Member

tulir commented Sep 25, 2019

A bot from my server was invited to #fedora:matrix.org, which eventually made federation sending stop working. When I noticed it, synapse was making logs at around 300mb/min, so the original errors are long gone.

Here's one of the stack traces about the federation sender stream failing. The first two error messages were ~1.3mb each (containing the whole latest room state?), maybe something should prevent that?

2019-09-25 18:58:45,371 - synapse.replication.tcp.resource - 200 - INFO - replication_notifier-2176370- Failed to handle stream events
2019-09-25 18:58:45,372 - synapse.metrics.background_process_metrics - 202 - ERROR - replication_notifier-2176370- Background process 'replication_notifier' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(8595496, '!sWugvEqzNjUsmflyPC:matrix.org', 'm.room.member', '@wayne734:matrix.org', '$1506025804571122MsqQl:matrix.org'), 1309125 more characters...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(8595496, '!sWugvEqzNjUsmflyPC:matrix.org', 'm.room.member', '@wayne734:matrix.org', '$1506025804571122MsqQl:matrix.org'), 1309125 more characters...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <generator object merge at 0x7f9155c1cf48>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/synapse/metrics/background_process_metrics.py", line 200, in run
    yield func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/resource.py", line 198, in _run_notifier_loop
    updates, current_token = yield stream.get_updates()
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/local/lib/python3.7/dist-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/streams/_base.py", line 158, in get_updates
    updates, current_token = yield self.get_updates_since(self.last_token)
  File "/usr/local/lib/python3.7/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python3.7/dist-packages/synapse/replication/tcp/streams/_base.py", line 199, in get_updates_since
    raise Exception("stream %s has fallen behind" % (self.NAME))
Exception: stream events has fallen behind

@grinapo
Copy link

grinapo commented Nov 29, 2019

Same in assorted colours:

2019-11-29 07:01:30,589 - synapse.replication.tcp.protocol - 532 - ERROR - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Failed to handle REPLICATE command
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wking:matrix.org', '$15009018172787008uGnlB:matrix.org'),.......


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: [(1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wking:matrix.org', '$15009018172787008uGnlB:matrix.org'), (1388657, '!yhqiEdqNjyPbxtUjzm:matrix.org', 'm.room.member', '@wkennington:matrix.org', '$1484152256110686udaKk:matrix.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration: <generator object merge at 0x7effd5f59660>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/synapse/replication/tcp/protocol.py", line 484, in subscribe_to_stream
    stream_name, token
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3/dist-packages/synapse/util/metrics.py", line 71, in measured_func
    r = yield func(self, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/usr/lib/python3/dist-packages/synapse/replication/tcp/streams/_base.py", line 199, in get_updates_since
    raise Exception("stream %s has fallen behind" % (self.NAME))
Exception: stream events has fallen behind
2019-11-29 07:01:30,629 - synapse.replication.tcp.protocol - 252 - WARNING - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Closing connection
2019-11-29 07:01:30,629 - synapse.replication.tcp.protocol - 373 - INFO - replication-REPLICATE-events-4534- [synapse.app.synchrotron-iFLjN] Connection was closed
2019-11-29 07:01:30,643 - synapse.replication.tcp.protocol - 354 - INFO - - [synapse.app.synchrotron-iFLjN] Replication connection closed: <twisted.python.failure.Failure twisted.internet.error.ConnectionDone: Connection was closed cleanly.>
2019-11-29 07:01:30,643 - synapse.replication.tcp.protocol - 373 - INFO - - [synapse.app.synchrotron-iFLjN] Connection was closed

@richvdh richvdh changed the title federation_sender stops working after elevated traffic workers stop working after elevated traffic Feb 11, 2020
@richvdh
Copy link
Member

richvdh commented Feb 11, 2020

the stacktrace has evolved slightly since 2017 as this code has been updated, but the problem is still very much there and biting people today.

@tulir's stacktrace in #2738 (comment) is representative; the hallmarks are Background process 'replication_notifier' threw an exception, with a huge stacktrace culminating in Exception: stream events has fallen behind.

Certain actions (joining a large room) can require a huge update to be sent over the replication stream to the workers; the huge update is misinterpreted as the workers getting behind and causes an exception. The only way out is to restart the master and all the workers.

This has been fudged around on matrix.org via 1766a5f. I am ... displeased ... to discover this hasn't made it to mainline.

@grinapo's similar exception with Failed to handle REPLICATE command is due to workers getting behind; the fact they don't recover is #4963.

@richvdh
Copy link
Member

richvdh commented Feb 11, 2020

essentially this happens whenever the homeserver joins a room with more than 10000 state events :/

@richvdh
Copy link
Member

richvdh commented Feb 26, 2020

hopefully made to go away by #6967.

@richvdh richvdh closed this as completed Feb 26, 2020
@MadLittleMods MadLittleMods added the A-Workers Problems related to running Synapse in Worker Mode (or replication) label Dec 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) z-bug (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

8 participants