Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeline corruption on develop - "Live timeline 0 is no longer live" #8593

Closed
ara4n opened this issue Feb 13, 2019 · 30 comments
Closed

Timeline corruption on develop - "Live timeline 0 is no longer live" #8593

ara4n opened this issue Feb 13, 2019 · 30 comments
Assignees
Labels
P1 S-Critical Prevents work, causes data loss and/or has no workaround T-Defect

Comments

@ara4n
Copy link
Member

ara4n commented Feb 13, 2019

I've seen this ~3 times today, and Travis is getting bitten by it repeatedly.

Symptoms are stacktrace of:

Caught /sync error Error: live timeline 0 is no longer live - it has a neighbouring timeline
    at Room../matrix-js-sdk/lib/models/room.js.Room.addLiveEvents (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:28527:19)
    at SyncApi../matrix-js-sdk/lib/sync.js.SyncApi._processRoomEvents (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:35245:10)
    at _callee9$ (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34809:54)
    at tryCatch (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49866:40)
    at Generator.invoke [as _invoke] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:50100:22)
    at Generator.prototype.(anonymous function) [as next] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49918:21)
From previous event:
    at https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34858:46
From previous event:
    at SyncApi._callee10$ (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34657:74)
    at tryCatch (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49866:40)
    at Generator.invoke [as _invoke] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:50100:22)
    at Generator.prototype.(anonymous function) [as next] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49918:21)
From previous event:
    at SyncApi._processSyncResponse (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34946:22)
    at SyncApi._callee7$ (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:34294:60)
    at tryCatch (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49866:40)
    at Generator.invoke [as _invoke] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:50100:22)
    at Generator.prototype.(anonymous function) [as next] (https://riot.im/develop/bundles/c0b74508d5644578d135/bundle.js:49918:21)
consoleObj.(anonymous function) @ rageshake.js:61
16:44:19.277 rageshake.js:61 

And a message you try to send gets stuck in localecho state at the bottom of the timeline (in practice it successfully sends).

@ara4n
Copy link
Member Author

ara4n commented Feb 13, 2019

@turt2live can you rageshake?

@ara4n
Copy link
Member Author

ara4n commented Feb 13, 2019

@turt2live
Copy link
Member

Past related issues (not as common as whatever happened in the last 24h):

@lampholder
Copy link
Member

We're declaring this not a release blocker since we have seen very limit incidence of it. One to keep an eye on, though.

@turt2live
Copy link
Member

Latest symptom from https://github.com/matrix-org/riot-web-rageshakes/issues/1238 is #riot-dev was only showing ~50% of the messages received over /sync, which lead to me thinking I was replying to someone that already had an answer :(

@ara4n ara4n added the S-Critical Prevents work, causes data loss and/or has no workaround label Mar 21, 2019
@ara4n
Copy link
Member Author

ara4n commented Mar 21, 2019

@richvdh observes:

yeah :/. I suspect it's in code I wrote
and happens when there's a fork in the DAG or something

@lampholder
Copy link
Member

Rageshake and reload button?

@aaronraimist
Copy link
Collaborator

aaronraimist commented Mar 22, 2019

Just sent a rageshake on this. I assume I was getting the same issue (the console error is the same but this issue doesn't describe the user facing symptoms very well).

From the rageshake notes: Visible symptoms were that there were lots of old messages below the latest message in the room. Clicking the scroll to the bottom button actually scrolled you up because the latest message in the room was not at the bottom of the page but somewhere in the middle.

@turt2live
Copy link
Member

turt2live commented Mar 22, 2019

That's consistent with the other known (but for some reason not documented) symptom. I assume it felt a bit like a rotating drum rather than a timeline?

Edit: rageshake definitely is the same issue.

@aaronraimist
Copy link
Collaborator

Yes something like that. Now after I sent the rageshake the entire app has become unresponsive. I can't click on anything, the green read receipt bar doesn't update, and animations for other users read receipts is frozen. Do you want another rageshake?

@turt2live
Copy link
Member

I suspect that symptom is a different issue for sure (one of the "the app sets fire to my computer trying to start up" issues), and a rageshake might not reveal it. Would suggest opening a new issue and rageshaking on that just in case though.

@lampholder
Copy link
Member

To add more detail to my very brief comment above - "Rageshake and reload button?" means "Can we mitigate this difficult-to-reproduce/investigate bug by catching the error and prompting the user to 'Rageshake and reload'.

Obviously this is horrible, but better than leaving users to discover the app is broken by its just being super broken and does at least present a way out.

turt2live added a commit to matrix-org/matrix-react-sdk that referenced this issue Mar 23, 2019
Fixes element-hq/element-web#9260
Workaround for element-hq/element-web#8593
Requires matrix-org/matrix-js-sdk#869

We check if any dialogs are open before moving forward because we don't want to risk showing so many dialogs that the user is unable to click a button. We're also not overly concerned if the dialog being shown is irrelevant because whatever the user is doing will likely be unaffected, and we can scream in pain when they're finished.
turt2live added a commit to matrix-org/matrix-react-sdk that referenced this issue Mar 23, 2019
Fixes element-hq/element-web#9260
Workaround for element-hq/element-web#8593
Requires matrix-org/matrix-js-sdk#869

We check if any dialogs are open before moving forward because we don't want to risk showing so many dialogs that the user is unable to click a button. We're also not overly concerned if the dialog being shown is irrelevant because whatever the user is doing will likely be unaffected, and we can scream in pain when they're finished.
@turt2live
Copy link
Member

ftr, tracking the workaround dialog as #9260 to avoid closing this by accident.

@turt2live
Copy link
Member

Daily rageshake review for explosions:

Overall the four potential causes (in terms of what the logs say) are:

  1. Something to do with splicing the live timeline into a position where it cannot operate (what Refuse to link live timelines into the forwards/backwards position when either is invalid matrix-org/matrix-js-sdk#877 and prior try to solve)
  2. Probable side effects of Refuse splicing the live timeline into a broken position matrix-org/matrix-js-sdk#873 causing "cannot reset timeline - it has a neighbouring timeline" when handling a limited sync.
  3. New to this batch - possible rapid splicing causing problems.
  4. Also new to this batch - having pagination tokens on timelines, which may be fixed by Refuse to link live timelines into the forwards/backwards position when either is invalid matrix-org/matrix-js-sdk#877

@turt2live
Copy link
Member

A few more came in:

https://github.com/matrix-org/riot-web-rageshakes/issues/1371, https://github.com/matrix-org/riot-web-rageshakes/issues/1372, and https://github.com/matrix-org/riot-web-rageshakes/issues/1373 all look like a case of number 3. Timelines are being spliced when they are only a few seconds old.

@turt2live
Copy link
Member

Another day, another batch of rageshakes. This is the last batch of rageshakes from /app I'll take a look at until we publish a release with some of the fixes - we have more than enough data points.

@turt2live
Copy link
Member

https://github.com/matrix-org/riot-web-rageshakes/issues/1389 is an explosion on develop which appears to be complaining about pagination tokens. In theory, matrix-org/matrix-js-sdk#885 fixes this.

For context, the error is thrown here:
https://github.com/matrix-org/matrix-js-sdk/blob/b1b49413d0e8c0766bc9fd51ad9acd6da83a41d3/src/models/room.js#L1294-L1297

bwindels pushed a commit to matrix-org/matrix-js-sdk that referenced this issue Apr 8, 2019
Credit to Matthew for basically solving this.

Theoretically fixes spontaneous timeline corruption: element-hq/element-web#8593

When the live timeline ends up in a position where it can no longer be live (such as becoming the second timeline in the set, rather than the first) we end up getting neighbouring timeline errors. By refusing to splice the live timeline into such a position, we hopefully keep the live timeline in a position of still being live for when it is next used.

The running theory that leads to this fix is multiple limited syncs coming in, causing holes in the timeline. When trying to patch up the holes, the timeline set would end up splicing all over the place, leading to potentially splicing the live timeline into a broken position.
bwindels pushed a commit to matrix-org/matrix-js-sdk that referenced this issue Apr 8, 2019
bwindels pushed a commit to matrix-org/matrix-js-sdk that referenced this issue Apr 8, 2019
See element-hq/element-web#8593 (comment)

Previously (#873) we allowed half-linking timelines to each other if they satisfy the conditions, however this appears to not be helping. Instead, it seems like the timelines are getting stuck in a position where one direction is spliced but the other is broken. To avoid this case, we'll just avoid splicing in both directions when one of the directions is invalid.
@turt2live
Copy link
Member

Only one rageshake with the new patches post-release so far: https://github.com/matrix-org/riot-web-rageshakes/issues/1403

Looks like the first error happened in 0004, and the user finally got frustrated enough to send a bug report. Looks to be a case of splicing timelines which are very close to each other, which may be an indicator of a race somewhere:

Already have timeline for $redacted - joining timeline !redacted:2019-03-16T15:11:47.610Z to !redacted:2019-03-16T15:11:47.116Z

That's a 500ms difference between timelines.

@turt2live turt2live self-assigned this Apr 15, 2019
@turt2live
Copy link
Member

There hasn't been much activity in terms of rageshakes for this (even after we brought the rageshake server back online) - we are planning to presume this fixed without complaints saying otherwise.

@Half-Shot
Copy link
Member

My Riot just blew up trying to (forward?)fill #irc:matrix.org, though I don't know if it is this bug. Will rageshake and if one of the Riot devs think it's not this bug, please shout.

@turt2live
Copy link
Member

@Half-Shot your client is missing all of the patches which are supposed to fix this

@Half-Shot
Copy link
Member

Ahhhhhh kk

turt2live added a commit to matrix-org/matrix-react-sdk that referenced this issue May 1, 2019
Concludes element-hq/element-web#8593

We are no longer seeing this error being triggered, and are considering it fixed. As a result, the dialog can be removed to reduce the amount of dead code in the project.
@turt2live
Copy link
Member

There haven't been any complaints and no further rageshakes for a long while, which means I'm confident enough to say this is fixed. If the issue persists for people, please open a new issue.

Leaving this open to track the removal of the prompt: matrix-org/matrix-react-sdk#2939

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 S-Critical Prevents work, causes data loss and/or has no workaround T-Defect
Projects
None yet
Development

No branches or pull requests

6 participants