-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
major state reset / corruption bug impacting the main official GrapheneOS room #14481
Comments
Room membership was ~14900 across each server. State calculation somehow screwed up and showed it as ~13300 across each server. It has somehow partially fixed itself now and shows as ~14380 across servers. It seems as if we hit some state calculation bug and after more people joined, it resolved itself. I don't know how this works in synapse or the protocol in a detailed way so I don't know how this could be happening. The issue has half resolved itself but it's still happening. I'm surprised that it partially resolved itself this way. |
Many matrix.org users are still unable to join / rejoin without being invited. |
Ugh, sorry about this, and thanks for reporting it. We'll dig into it asap (but may need some server logs to diagnose; please capture them to stop them rotating if you haven't already). |
Saved the current set of logs. The drop in the number of users and the partial increase back towards the previous amount occurred across all servers we checked including matrix.org so it seems that the issue is at least mostly deterministic and occurred on each server. |
xref #8629 |
I took a quick look at this today. It looks like this room has been firefighting resets for a while. Delving into matrix.org's view of things (every row with an
The first reset that affected join rules looks to be from 11th Februrary 2022 (stream ordering 2689269590), when join rules reset to I haven't yet been able to confirm it, but I suspect that the cause is a mixture of a recently-fixed Synapse bug and known defects in state resolution v2. Unfortunately, the former makes it impact on the room (and continues to do so, even after the bug is fixed). As mentioned in #8629 (comment) there's a long-running project trying to characterise and fix these defects in Synapse. We hope to have an update to share soon. |
In the meantime: you could try upgrading the graphene OS room. This is a blunt workaround rather than a comprehensive fix, but it might be the best way to avoid the resets you've been seeing in the short term. If you do so, I'd advise that all privileged users (those able to set power levels and adjust join rules) in that room which use Synapse upgrade their instance to 1.64 or higher to pull in the fix I mentioned. |
I don't really want to do a room upgrade but I don't think we have much choice when it keeps rolling back to 13300 users and blocking people joining. This is a really bad experience. A room upgrade is going to cause us to lose a bunch of users and the room history so that's not great. |
room upgrades do not impact room history, nor do they lose you users. also, what the hell is this? We have burnt a bunch of time investigating this and trying to help you on this today, and in return we get slagged off on twitter?! Perhaps you think this is acceptable behaviour for interacting with an open source project - but as an open source project yourselves you should know better. We have no interest in supporting folks who scream and jeer at bugs when we try to help them. |
@ara4n I don't think the tweet is to be understood in such a way. They're merely communicating to users that they're hitting some protocol-layer bugs, and that it's not a great situation (which is a neutral statement and doesn't imply blame). I'm sure they didn't mean this in a disrespectful manner, and I didn't see them calling the Matrix team out for being unhelpful. Being an open source maintainer myself I understand how such misunderstandings can arise, but please try not to interpret too much into something like that. It just ruins your day for no good reason. Feel free to mark this as off-topic if you think it contributes nothing to the conversation, just trying to mediate. We've all had bad days and took something at face value which wasn't meant that way. Always better to see things in a positive light until proven otherwise. |
@thestinger I just saw your reaction, the same thing applies to you. There's no reason to make this into a fight when it's most likely just a misunderstanding. Calm down, all of you. We're all trying to help each other out here. |
@SplittyDev Are you referring to me closing the issue? |
@thestinger yes, I believe there was a huge misunderstanding. @ara4n understood the tweet to be an offensive statement, which as you clarified, it wasn't. You saw his comment as an attack (which I understand, don't get me wrong), but since this "fight" arose from a misunderstanding I don't think we should come to hasty conclusions. I'm sure everyone here is just as eager to get this resolved. It's a stressful situation for everyone involved. |
We made a post on Twitter with an update on the situation since it occurred again (https://twitter.com/GrapheneOS/status/1593759013061234691). I don't understand what we've done wrong or why we're being attacked for it especially in such a public way. We already have a lot of confused and upset users due to what's happening with the room. As I said, it's not a good situation, and this is not helping. |
I 100% agree. But you know how it is. Matrix or synapse bugs affecting big servers isn't fun for anyone here, the synapse team is trying to figure out how to remedy the situation and felt disrespected by the tweet. I fully understand that the tweet was merely meant as a statement to your users and that there were no ill intentions towards the synapse team. It's a simple misunderstanding, which I'm sure can be resolved in a calm manner. |
@ara4n I've removed both of our threads about the state reset bug on Twitter/Mastodon. The threads were only intended to inform our community about the situation. They weren't intended to be attacks on Matrix as a protocol / platform. The situation is genuinely very frustrating for us and we're not trying to blame or attack Matrix by expressing it. I would greatly appreciate if everything including my comment about not wanting to do a room upgrade (#14481 (comment)) and below could be deleted here. I want to be able to link to this issue to explain why we upgraded the room but feel I can't do that without creating drama here due to this. The room upgrade does preserve the history much better than I remember it previously but it's not seamless. Since people have to notice the room upgrade and manually interact to join the new room, it's inherently going to reduce the room members quite a lot. On the positive side, as far as I can tell doing the room upgrade led to several people leaving and that led to it adding back a bunch of the room state bringing members back to ~14800. |
@thestinger It may reduce numbers, but those wishing to continue to interact will click through; in other words, inactive lurkers are likely the only ones to be lost, but only until they wish to interact again. |
@ThePowerofDreamS Many users are confused about what happened. It's not as clear as it could be and many people miss the notification. It also disrupts the discussion in the room for ages as people gradually move there and are confused about what happened. You can look and see that's happening. It's far from seamless and I really didn't want to upgrade the rooms until years from now when the experience is better. Too late to take it back now and I'm not sure what the alternative would have been. |
We currently have 10% as many room members as before (1500 vs. 15000), which is slowly growing but many of the people who were active did not get any notice about the room upgrade and weren't aware until I had invited them. Some don't accept invites by default, etc. so that's far from perfect. Many are confused about what has happened so it has derailed the discussion in the room. Some people think they did something wrong and were kicked. We would not normally ever do a room upgrade from a room version like 6 because we're very aware of the consequences. I didn't feel we had much choice, but it doesn't change that this is not a good situation for us. Perhaps choosing room version 10 was a slight mistake since a few servers used by a total of maybe a 200 users are outdated by many months and unable to join it but that's insignificant compared to it effectively being a new room. As I said on that thread on Twitter, I'm also concerned about hitting the same issues again. We've always had close to fully updated synapse as have the other servers used by our moderators. It currently feels a lot like rebuilding the community after the freenode takeover even though nothing like that happened, just a bug. If room upgrades worked better and servers automatically moved the users, it'd mostly be fine, but that's not the status quo and we were going to wait until that was fully baked before upgrading our rooms. Server admins can join users to a new room with the API but our users aren't on our server. |
This has happened in 2 more of rooms now, and the project has been significantly impacted by the disruption to the main room already. |
Rooms bricked so far: #grapheneos:grapheneos.org (room upgraded after), #infra:grapheneos.org (room upgraded after), #dev:grapheneos.org (have not done room upgrade yet, probably need to) and potentially others we aren't aware of yet. If any of the newly upgraded rooms end up getting bricked, that's definitely going to be a dealbreaker for us. Same probably applies to our offtopic and community (space) rooms. This is too much. We believe this issue is still happening because nothing was wrong with #dev:grapheneos.org until recently. All of our mods have always had near the latest synapse release for their homeserver. Our rooms get frequently raided and sometimes there are redundant bans where 2 people ban someone at the same time, etc. Perhaps something involved in defending against the raids caused this, particularly raids with mass joins. Maybe it has something to do with server ACLs. For a while we were using allowlist server ACLs to defend against raids. Room upgrades are really bad for us. We have lost a substantial portion of the room members, the room history is not available to anyone not in the old room or has a client unable to transparently search it and we haven't gotten help from any homeserver admins to automatically join users in the old room to the new room. We would not have normally done room upgrades until years from now when the experience is better and users get automatically migrated by their server or client with some kind of anti-DoS limitation such as a limit on how often it can be done automatically (like once per week). Many users thought they were banned/kicked due to the state resets, many users were confused and didn't accept my invite because they didn't realize it was an invite to a new version of the room, etc. It's still regularly a topic of conversation. Many people thought it was someone impersonating me and inviting them to a fake GrapheneOS room because of that repeatedly happening to attack our project as part of the raids on our rooms. I would not be surprised if the cause of this was the raids on our rooms which occasionally caused our synapse instance to run out of memory and get killed, or needed to be restarted after blocking a bunch of IPs with nftables or nginx. As an entirely donation-based project, it deeply matters to us to have a very large and active community. This could have been a positive experience where we reported an issue, it was looked into and resolved. Homeserver admins could have helped us move over users since that's still not automated. My experience was that our posts about this which were based on being quite upset about how bad this is for our project being taken as an attack on Matrix and the developers. That seems to have ended any discussion about it, but the issue is still there, and now this is part of our experience. Not expecting any response to this but this situation is quite bad for us, is getting worse and the end result is going to be us having to figure out a new approach to our chat platform followed by a huge push to get people to deal with the change. Maybe we'll use XMPP, maybe we'll go back to focusing on IRC or maybe we'll use a proprietary chat platform. I don't know, but this really doesn't work for us. The level of abuse we're targeted with, the fact that most of it comes from matrix.org which we can't defederate, our rooms being bricked and overall terrible experiences with trying to get something done about the abuse and now this. |
This has happened again. I think it's caused by setting the rooms invite-only to deal with raids. Switching to invite-only and back appears to have a high likelihood of bricking the rooms by causing endless state resets. Perhaps it's caused by doing it too quickly without waiting a long time between changes for things to sync properly. We need help fixing our rooms or we're going to be unable to use Matrix anymore. We'll have to move to a new platform and publish an article explaining why we're doing that and why Matrix hasn't worked out for us both due to technical reasons, the extremely toxic overall community on the platform involved in attacks on us and the lack of help dealing with either of these things. |
There were rapid join and ban state events during the raids so it's possible that's what caused it too. It was almost certainly caused by these recent events though. |
This impacts both our main room (which was recreated after the last state reset cycle) and our offtopic room. |
@SplittyDev - if the historical states of prior buggy versions prevent remedy of current state, can the histories be summarized or a new version of the protocol/state representation applied as a migration of sorts? This looks like a limiting factor for the SoA if there's a data dependency on currently invalid state differentiating "same" classes of objects from each other by the nature of their data (vs function). |
Our main room is having state resets again. I think we'll be leaving Matrix. |
If this doesn't get resolved, we're just going to be sharing our experiences with the platform in an article and moving to a platform without these problems. |
This comment was marked as off-topic.
This comment was marked as off-topic.
@RokeJulianLockhart I didn't report this as a security bug. If people are aware of it being exploitable without malicious moderators / moderator homeservers, I think that should be disclosed rather than covered up. I do think I could make a good list of guidelines of things to do and not to do in order to avoid this happening.
By following those 4 guidelines, the chance of this happening is minimized. If you've already done any of those things differently then it's probably best to make a new room sooner rather than later, especially if join rules have ever been set to invite-only which seems to be the most damaging issue when resets occur since it invalidates tons of joins and bricks the room for those members. I expect we aren't going to run into comparably awful issues again for our new rooms since we're going to follow those rules, although we still have several home servers for mods but we're likely going to reduce it to only grapheneos.org eventually. |
Indeed, @thestinger, I believe it was referred to as a security concern in redacted communication by the notable counterpart to this issue in response to the post on Twitter.
I implore you to more generally paraphrase this at the linked issue. |
@RokeJulianLockhart this issue is already public. Even if it is possible for a non-privileged user to exploit this as a vulnerability, the issue is already public and therefore a disclosure timeline would not apply. |
Description
#grapheneos:grapheneos.org (!SayHlEYXdrpSerhLMC:matrix.org) which was created with room version 6 has been impacted by a major synapse / protocol bug resulting in losing around 1500 members of the room followed by users being unable to join the room. It was acting as if it was set to invite-only mode despite being public. We think we've worked around that issue by setting it to invite-only and back to public. We know little about the Matrix protocol and synapse so we're unable to determine what happened. Ideally we can also get help with restoring the room members. Users reported getting a 403 error message when trying to rejoin and were confused thinking they had been banned.
Steps to reproduce
We don't know how to reproduce the problem.
Homeserver
all
Synapse Version
all
Installation Method
No response
Platform
All
Relevant log output
Anything else that would be useful to know?
No response
The text was updated successfully, but these errors were encountered: