Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

"stream federation has fallen behind" + a bunch of user IDs #4388

Closed
turt2live opened this issue Jan 15, 2019 · 5 comments
Closed

"stream federation has fallen behind" + a bunch of user IDs #4388

turt2live opened this issue Jan 15, 2019 · 5 comments

Comments

@turt2live
Copy link
Member

Unknown cause, happened on t2bot.io within 15 minutes of startup.

homeserver_1 - 2019-01-14 21:12:43,332 - synapse.replication.tcp.resource - 181 - INFO - replication_notifier-384250 - Streaming: events -> 106669963
homeserver_1 - 2019-01-14 21:12:43,586 - synapse.replication.tcp.resource - 171 - INFO - replication_notifier-384251 - Failed to handle stream federation
homeserver_1 - 2019-01-14 21:12:43,586 - synapse.metrics.background_process_metrics - 203 - ERROR - replication_notifier-384251 - Background process 'replication_notifier' threw an exception
Traceback (most recent call last):
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/synapse/storage/_base.py", line 378, in runWithConnection
    defer.returnValue(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: [<<SNIPPED: 11501 User IDs>>]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/synapse/storage/_base.py", line 336, in runInteraction
    defer.returnValue(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: [<<SNIPPED: 11501 User IDs>>]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/synapse/metrics/background_process_metrics.py", line 201, in run
    yield func(*args, **kwargs)
Exception: stream federation has fallen behind

Much shorter errors followed in hoards afterwards:

homeserver_1 - 2019-01-14 21:12:44,227 - synapse.metrics.background_process_metrics - 203 - ERROR - replication_notifier-384255 - Background process 'replication_notifier' threw an exception
Traceback (most recent call last):
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/synapse/storage/_base.py", line 378, in runWithConnection
    defer.returnValue(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: []

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/synapse/storage/_base.py", line 336, in runInteraction
    defer.returnValue(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: []

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/synapse/replication/tcp/streams.py", line 207, in get_updates_since
    defer.returnValue((updates, current_token))
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: ([], 106669967)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/synapse/replication/tcp/streams.py", line 171, in get_updates
    defer.returnValue((updates, current_token))
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1362, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: ([], 106669967)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matrix/.synapse-py3/lib/python3.5/site-packages/synapse/metrics/background_process_metrics.py", line 201, in run
    yield func(*args, **kwargs)
Exception: stream federation has fallen behind

Reporting because it's Not Good to have ~23k user IDs dropped in a log file for what looks like a Bad Thing. As far as I can tell, everything recovered cleanly after a couple minutes.

@Valodim
Copy link
Contributor

Valodim commented Jan 15, 2019

There is a method that just keeps a list of 11k user IDs around in memory? Is that intentional?

@turt2live
Copy link
Member Author

as a point of interest: I just had a very similar explosion for a different stream and moderately different result set.

Synapse appears to have been doing state resolution on an IRC-bridged room and discovered a few thousand unhandled membership events, which is promptly shoved through replication. This blew out the get_rooms_for_user_with_stream_ordering cache (I think) resulting in DefGen_Return errors very much like the ones above, appearing to dump the contents of the cache into the log.

fwiw, the act of dumping the cache into the log filled 5gb of logs in roughly 5 minutes before the entire homeserver was restarted to speed up the resolution of the problem. The server operated fine after being restarted.

@jo-so
Copy link

jo-so commented Feb 24, 2019

I've facing the same error, but a restart doesn't help. After a few minutes the same show starts again and the service is dead.

@turt2live
Copy link
Member Author

This has definitely gotten worse for me in the last few days: seeing it nearly every 4ish hours. Also seems to have a 50% chance of murdering the homeserver in the most violent way possible: leaving the process running but otherwise non-functional.

Applies to presence streams, events, federation, and pretty much everything else.

@richvdh
Copy link
Member

richvdh commented Feb 11, 2020

I think this is basically a duplicate of #2738 / #4963. The massive spew of User IDs is a somewhat unfortunate consequence of the way that twisted's inlineCallbacks works; the best solution is porting to async/await which is something we are gradually working on across the codebase.

@richvdh richvdh closed this as completed Feb 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants