Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

createRoom succeeded, but returned 500, due to replication lag #8625

Open
richvdh opened this issue Oct 22, 2020 · 6 comments
Open

createRoom succeeded, but returned 500, due to replication lag #8625

richvdh opened this issue Oct 22, 2020 · 6 comments
Labels
A-Create-Room S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Oct 22, 2020

element-hq/element-web#15512 (comment) reports a createRoom request returning a 500.

The server-side logs say:

2020-10-21 13:29:51,969 - synapse.replication.tcp.client - 232 - INFO - POST-39220 - Waiting for repl stream 'events' to reach 1605923815
...
2020-10-21 13:30:21,987 - synapse.http.server - 85 - ERROR - POST-39220 - Failed handle request via 'RoomCreateRestServlet': <XForwardedForRequest at 0x7fb379ee6cf8 method='POST' uri='/_matrix/client/r0/createRoom' clientproto='HTTP/1.1' site=8080>
Capture point (most recent call last):
...
twisted.internet.defer.TimeoutError: Timed out after 30s
2020-10-21 13:30:22,015 - synapse.access.http.8080 - 311 - INFO - POST-39220 - 109.175.168.80 - 8080 - {@webdevguru_test13:matrix.org} Processed request: 39.142sec/-0.000sec (0.121sec, 0.012sec) (0.241sec/0.546sec/58) 55B 500 "POST /_matrix/client/r0/createRoom HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86
...
2020-10-21 13:31:30,491 - synapse.replication.tcp.handler - 543 - INFO - process-replication-data-4 - Caught up with stream 'events' to 1605923815

It seems that matrix.org was still catching up after an earlier outage at this time:

image

I'm not sure exactly what can be done about it, but it's certainly unsatisfactory that the room creation completes and then we return a 500 anyway.

@richvdh
Copy link
Member Author

richvdh commented Oct 22, 2020

#7867 might make the failure mode less likely, but doesn't really address the underlying problem.

#8283 might also help.

Generally it feels like processing the replication data should be higher priority than any other work.

@richvdh
Copy link
Member Author

richvdh commented Oct 22, 2020

The reason for the outage on 2020-10-21, incidentally, was #8613.

@richvdh
Copy link
Member Author

richvdh commented Oct 22, 2020

similar failure mode at 2020-10-20 12:39:04,456:

2020-10-20 12:39:04,369 - synapse.http.server - 85 - ERROR - POST-19292158 - Failed handle request via 'RoomCreateRestServlet': <XForwardedForRequest at 0x7f5364e78860 method='POST' uri='/_matrix/client/r0/createRoom' clientproto='HTTP/1.1' site=8080>
Capture point (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/synapse/src/synapse/app/homeserver.py", line 512, in <module>
    main()
  File "/home/synapse/src/synapse/app/homeserver.py", line 508, in main
    run(hs)
  File "/home/synapse/src/synapse/app/homeserver.py", line 499, in run
    logger=logger,
  File "/home/synapse/src/synapse/app/_base.py", line 133, in start_reactor
    run()
  File "/home/synapse/src/synapse/app/_base.py", line 117, in run
    run_command()
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
  File "/home/synapse/src/synapse/metrics/__init__.py", line 538, in f
    ret = func(*args, **kwargs)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/home/synapse/src/synapse/util/async_helpers.py", line 485, in time_it_out
    deferred.cancel()
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 545, in cancel
    self.errback(failure.Failure(CancelledError()))
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 501, in errback
    self._startRunCallbacks(fail)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
    self._runCallbacks()
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/synapse/src/synapse/util/async_helpers.py", line 521, in failure_cb
    new_d.errback(val)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 501, in errback
    self._startRunCallbacks(fail)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
    self._runCallbacks()
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 1475, in gotResult
    _inlineCallbacks(r, g, status)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
Traceback (most recent call last):
  File "/home/synapse/src/synapse/http/server.py", line 230, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/home/synapse/src/synapse/http/server.py", line 407, in _async_render
    callback_return = await raw_callback_return
  File "/home/synapse/src/synapse/rest/client/v1/room.py", line 98, in on_POST
    requester, self.get_room_config(request)
  File "/home/synapse/src/synapse/handlers/room.py", line 818, in create_room
    last_stream_id,
  File "/home/synapse/src/synapse/replication/tcp/client.py", line 233, in wait_for_stream_position
    await make_deferred_yieldable(deferred)
  File "/home/synapse/env-py37/lib/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/synapse/src/synapse/util/async_helpers.py", line 502, in convert_cancelled
    raise defer.TimeoutError("Timed out after %gs" % (timeout,))
twisted.internet.defer.TimeoutError: Timed out after 30s

image

That one appears to have been due to database contention due to a big device list update:

image

@richvdh
Copy link
Member Author

richvdh commented Oct 22, 2020

one of the reasons this is so disastrous is that clients currently do things like set the m.dm flag once the createRoom call completes. If the createRoom 500s, the room gets created with the wrong settings.

@anoadragon453 anoadragon453 added z-bug (Deprecated Label) p1 A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label) and removed p1 labels Oct 22, 2020
@ara4n
Copy link
Member

ara4n commented Oct 24, 2020

giving this is screwing FTUE as per element-hq/element-web#15512, i'm marking it p1

@ara4n ara4n added p1 and removed z-p2 (Deprecated Label) labels Oct 24, 2020
@anoadragon453 anoadragon453 removed the A-Performance Performance, both client-facing and admin-facing label Oct 29, 2020
@richvdh
Copy link
Member Author

richvdh commented Oct 29, 2020

considering this part of the FTUE work, so removing it from Backend team board

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Create-Room S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

4 participants