Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Federated room cannot be joined again when left and deleted before #9481

Closed
waclaw66 opened this issue Feb 24, 2021 · 23 comments · Fixed by #9498 or #9542
Closed

Federated room cannot be joined again when left and deleted before #9481

waclaw66 opened this issue Feb 24, 2021 · 23 comments · Fixed by #9498 or #9542
Assignees
Labels
A-Federated-Join joins over federation generally suck S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@waclaw66
Copy link

Description

When a federated room (e.g. #synapse:matrix.org) is joined from self hosted federated homeserver, then left, deleted via API from self homeserver, then rejoin fails.

Steps to reproduce

  1. Join some room (e.g. #synapse:matrix.org) from self hosted homeserver (same version of Synapse 1.27.0).
  2. Leave that room
  3. Delete that room using API /_synapse/admin/v1/rooms/[room_id]/delete
  4. Try to join that room again
  5. It fails with following error...
synapse.http.server: [POST-10040] Failed handle request via 'JoinRoomAliasServlet': <XForwardedForRequest at 0x7f57646fa970 method='POST' uri='/_matrix/client/r0/join/%23synapse%3Amatrix.org' clientproto='HTTP/1.1' site='8008'>
    Traceback (most recent call last):
      File "/usr/lib/python3.9/site-packages/synapse/http/server.py", line 252, in _async_render_wrapper
        callback_return = await self._async_render(request)
      File "/usr/lib/python3.9/site-packages/synapse/http/server.py", line 430, in _async_render
        callback_return = await raw_callback_return
      File "/usr/lib/python3.9/site-packages/synapse/rest/client/v1/room.py", line 301, in on_POST
        await self.room_member_handler.update_membership(
      File "/usr/lib/python3.9/site-packages/synapse/handlers/room_member.py", line 333, in update_membership
        result = await self.update_membership_locked(
      File "/usr/lib/python3.9/site-packages/synapse/handlers/room_member.py", line 549, in update_membership_locked
        remote_join_response = await self._remote_join(
      File "/usr/lib/python3.9/site-packages/synapse/handlers/room_member.py", line 1091, in _remote_join
        event_id, stream_id = await self.federation_handler.do_invite_join(
      File "/usr/lib/python3.9/site-packages/synapse/handlers/federation.py", line 1400, in do_invite_join
        max_stream_id = await self._persist_auth_tree(
      File "/usr/lib/python3.9/site-packages/synapse/handlers/federation.py", line 2050, in _persist_auth_tree
        await self.persist_events_and_notify(
      File "/usr/lib/python3.9/site-packages/synapse/handlers/federation.py", line 2925, in persist_events_and_notify
        events, max_stream_token = await self.storage.persistence.persist_events(
      File "/usr/lib/python3.9/site-packages/synapse/storage/persist_events.py", line 262, in persist_events
        ret_vals = await make_deferred_yieldable(
    twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback: <class 'psycopg2.errors.UniqueViolation'>: duplicate key value violates unique constraint "event_auth_chains_pkey"
    DETAIL:  Key (event_id)=($e9U026auDHIgaZPAqlblvPupACjl7jcZDblP970dJPs) already exists.

    /usr/lib/python3.9/site-packages/synapse/metrics/background_process_metrics.py:208:run
    --- <exception caught here> ---
    /usr/lib/python3.9/site-packages/synapse/storage/persist_events.py:172:handle_queue_loop
    /usr/lib/python3.9/site-packages/synapse/storage/persist_events.py:324:persisting_queue
    /usr/lib/python3.9/site-packages/synapse/storage/persist_events.py:532:_persist_events
    /usr/lib/python3.9/site-packages/synapse/storage/databases/main/events.py:171:_persist_events_and_state_updates
    /usr/lib/python3.9/site-packages/synapse/storage/database.py:661:runInteraction
    /usr/lib/python3.9/site-packages/synapse/storage/database.py:744:runWithConnection
    /usr/lib64/python3.9/site-packages/twisted/python/threadpool.py:250:inContext
    /usr/lib64/python3.9/site-packages/twisted/python/threadpool.py:266:<lambda>
    /usr/lib64/python3.9/site-packages/twisted/python/context.py:122:callWithContext
    /usr/lib64/python3.9/site-packages/twisted/python/context.py:85:callWithContext
    /usr/lib64/python3.9/site-packages/twisted/enterprise/adbapi.py:306:_runWithConnection
    /usr/lib64/python3.9/site-packages/twisted/python/compat.py:464:reraise
    /usr/lib64/python3.9/site-packages/twisted/enterprise/adbapi.py:297:_runWithConnection
    /usr/lib/python3.9/site-packages/synapse/storage/database.py:739:inner_func
    /usr/lib/python3.9/site-packages/synapse/storage/database.py:539:new_transaction
    /usr/lib/python3.9/site-packages/synapse/logging/utils.py:71:wrapped
    /usr/lib/python3.9/site-packages/synapse/storage/databases/main/events.py:379:_persist_events_txn
    /usr/lib/python3.9/site-packages/synapse/storage/databases/main/events.py:472:_persist_event_auth_chain_txn
    /usr/lib/python3.9/site-packages/synapse/storage/databases/main/events.py:630:_add_chain_cover_index
    /usr/lib/python3.9/site-packages/synapse/storage/database.py:896:simple_insert_many_txn
    /usr/lib/python3.9/site-packages/synapse/storage/database.py:274:execute_batch
    /usr/lib/python3.9/site-packages/synapse/storage/database.py:319:_do_execute
    /usr/lib/python3.9/site-packages/synapse/storage/database.py:274:<lambda>
    /usr/lib64/python3.9/site-packages/psycopg2/extras.py:1209:execute_batch
    ]]

The room should be rejoined without any problem.

Version information

  • Homeserver: matrix.bolesiny.net

  • Version: 1.27.0

  • Install method: Fedora rpm package

  • Platform: Fedora 33

@waclaw66
Copy link
Author

Related #9418

@clokep
Copy link
Member

clokep commented Feb 24, 2021

@waclaw66 This seems to be a duplicate of #9418. I'm going to close it unless you feel there's a reason that it is not.

@clokep clokep added the X-Needs-Info This issue is blocked awaiting information from the reporter label Feb 24, 2021
@waclaw66
Copy link
Author

Maybe it's the same bug causing the problem, but those are different cases, which could lead to it.
You can close it in favour of #9418, there is a reference, in case the fix wouldn't work for my problem I can reopen it.

@waclaw66 This seems to be a duplicate of #9418. I'm going to close it unless you feel there's a reason that it is not.

@clokep
Copy link
Member

clokep commented Feb 24, 2021

Ah, I see that this is due to leaving while the other is due to a failure to join. 👍 Hopefully the fix is the same!

@waclaw66
Copy link
Author

waclaw66 commented Feb 25, 2021

Is there any hotfix? Leaving and joining the room again works properly, but when a room is deleted via API and joined again it crashes as shown. So there supposed to be some remain of that deleted room in database which causes that crash after joining it again.

@clokep clokep added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed X-Needs-Info This issue is blocked awaiting information from the reporter labels Feb 25, 2021
@clokep
Copy link
Member

clokep commented Feb 25, 2021

When you deleted the room you also purged it I'm guessing?

I suspect that purge_room_state needs to be updated to remove from event_auth_chains also?

@waclaw66
Copy link
Author

Right, I purged the history, then I left and finally deleted it.

When you deleted the room you also purged it I'm guessing?

I suspect that purge_room_state needs to be updated to remove from event_auth_chains also?

@waclaw66
Copy link
Author

That's strange
select * from state_groups_state where room_id='!ehXvUhWNASUkSLvAGP:matrix.org'
gives me more than 300k records, but
POST http://localhost:8008/_synapse/admin/v1/purge_history/!ehXvUhWNASUkSLvAGP:matrix.org
with data
{ "delete_local_events": true, "purge_up_to_ts": 1614270664000 }
returns
{"errcode":"M_NOT_FOUND","error":"there is no event to be purged"}

@clokep
Copy link
Member

clokep commented Feb 25, 2021

purge_history is different than purging the room (it purges the events before a certain time). I was asking whether you gave {"purge": true} when calling /_synapse/admin/v1/rooms/[room_id]/delete.

@waclaw66
Copy link
Author

Understood, I used synapse-admin to delete the room finally.
It seems, it doesn't purge...
https://github.com/Awesome-Technologies/synapse-admin/blob/684c44e47057e42fb217688e08706e9f2e6767a7/src/synapse/dataProvider.js#L69

purge_history is different than purging the room (it purges the events before a certain time). I was asking whether you gave {"purge": true} when calling /_synapse/admin/v1/rooms/[room_id]/delete.

@Bubu
Copy link
Contributor

Bubu commented Feb 25, 2021

Oh, interesting. FWIW, in #9418, I probably purged/deleted the room at some point in the past before trying to rejoin it (and then lrunning into this issue).

@waclaw66
Copy link
Author

The question is how to fix it. I suppose the patch from #9498 will fix it for the future. Do I need to run some sql commands to fix current state?

@clokep
Copy link
Member

clokep commented Feb 25, 2021

The question is how to fix it. I suppose the patch from #9498 will fix it for the future. Do I need to run some sql commands to fix current state?

My goal in there would be to fix it for previously purged rooms too, if possible. If not you can hopefully run some of the commands manually.

Oh, interesting. FWIW, in #9418, I probably purged/deleted the room at some point in the past before trying to rejoin it (and then lrunning into this issue).

Sounds like these are probably duplicates then.

@waclaw66
Copy link
Author

waclaw66 commented Feb 26, 2021

I have tested all those sql commands (delete replaced with select) from _purge_room_txn function. But no rows were found. Joining #synapse:matrix.org still fails with the same exception.
Then I replaced the whole purge_events.py in my Synapse server, still the same.
There are some other tables with reference to that room_id left, e.g. state_groups_state as I mentioned, then current_state_delta_stream, receipts_graph, receipts_linearized, state_groups, stream_ordering_to_exterm.

What causes this?

File "/usr/lib/python3.9/site-packages/synapse/storage/persist_events.py", line 262, in persist_events
        ret_vals = await make_deferred_yieldable(
    twisted.internet.defer.FirstError: FirstError[#0, [Failure instance: Traceback: <class 'psycopg2.errors.UniqueViolation'>: duplicate key value violates unique constraint "event_auth_chains_pkey"
    DETAIL:  Key (event_id)=($e9U026auDHIgaZPAqlblvPupACjl7jcZDblP970dJPs) already exists.

@clokep
Copy link
Member

clokep commented Feb 26, 2021

state_groups_state, state_groups, receipts_graph, receipts_linearized, and stream_ordering_to_exterm should have been deleted previously unless an error occurred when you deleted the room. 😕 I believe those have all been part of the purge for at least the last couple of years.

The code currently does not remove data from current_state_delta_stream, however and notes that we might want to.

@waclaw66
Copy link
Author

waclaw66 commented Feb 26, 2021

So can I try to purge those tables you mentioned manually with that room_id without any harm?

I think the exception

DETAIL:  Key (event_id)=($e9U026auDHIgaZPAqlblvPupACjl7jcZDblP970dJPs) already exists.

is the key to the problem. It seems primary key is created multiple times within table event_auth_chains for some reason.

state_groups_state, state_groups, receipts_graph, receipts_linearized, and stream_ordering_to_exterm should have been deleted previously unless an error occurred when you deleted the room. 😕 I believe those have all been part of the purge for at least the last couple of years.

The code currently does not remove data from current_state_delta_stream, however and notes that we might want to.

@waclaw66
Copy link
Author

waclaw66 commented Feb 26, 2021

There occured no error when the room was deleted, it happens for every federated room you leave, delete and re-join. It's not random.
I think the problem is that purge doesn't clean everything, and re-join doesn't count with some already existing database entries for that room. I tested it for more than 5 rooms.

should have been deleted previously unless an error occurred when you deleted the room

@tleydxdy
Copy link

Same problem here, I encountered this problem when using the irc bridge, I failed to join the #artix room, but my irc side is in the room. now all the irc rooms are broken because I tried a !quit, bad idea :(

@clokep
Copy link
Member

clokep commented Mar 3, 2021

Oops, we'll need the background update to fully fix this.

@clokep clokep reopened this Mar 3, 2021
@clokep
Copy link
Member

clokep commented Mar 3, 2021

So can I try to purge those tables you mentioned manually with that room_id without any harm?

I think that should be OK. Note that some of them don't have a room_id column. You should definitely make a database backup before doing this.

@waclaw66
Copy link
Author

waclaw66 commented Mar 8, 2021

I can confirm that problem with leaving and deleting rooms is solved by Synapse 1.29.0. Tested on a room, that wasn't previously deleted by API.
I would really appriteated also a script to purge remains of those rooms deleted with Synapse < 1.28.0. Thanks.

@MacLemon
Copy link

MacLemon commented Mar 8, 2021

We're seeing the same issue and have already updated to 1.29.0. Since it is cause by events in the past before running 1.29.0 the problem persists.

Documentation of a safe way to fix problems with existing rooms would be much appreciated. (If that requires direct access to synapse, the PostGreSQL database, or Admin-API, that would be a viable solution for us.)

@clokep
Copy link
Member

clokep commented Mar 9, 2021

I would really appriteated also a script to purge remains of those rooms deleted with Synapse < 1.28.0. Thanks.

This will happen automatically once #9542 is merged -- a background update should clean-up the data. Unfortunately this didn't make it into v1.29.0 and will be in v1.30.0.

@MadLittleMods MadLittleMods added the A-Federated-Join joins over federation generally suck label Dec 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
6 participants