Oplog processing stops working after MongoDB replica set election #19905

dkoo761 · 2020-12-19T04:36:43Z

Description:

Oplog tailing / processing stops working correctly after a MongoDB replica set election, resulting in real-time data not being delivered to the browser.

This means:

when a user posts a message, it gets saved to the DB but appears greyed out to the user that posted it
new messages in a channel don't show up until the user refreshes the browser
the only way to get the above functionality working again is to restart the Rocketchat server

When using a 3rd party MongoDB provider, such as Atlas, replica set elections can happen quite frequently (a few times a week). Also, there is often no error message in the RocketChat logs indicating something went wrong and standard HTTP web monitoring tools also wouldn't notify the site admin since the issue is specific to the oplog.

It appears that this started failing after the new oplog processing using change streams was added in 3.7.0.

WORKAROUND: Setting environment variable USE_NATIVE_OPLOG=true prevents this issue from happening.

I previously left a detailed analysis in this RC forum post: https://forums.rocket.chat/t/messages-not-displaying-and-greyed-out/2035

Steps to reproduce:

Use a replica set hosted at MongoDB Atlas as your database in MONGO_URL and MONGO_OPLOG_URL
Start Rocketchat server
Let it run overnight (or possibly for a few days if it takes that long for a replica set election to occur at Atlas - it's hard to predict when)
Try to post a message to a channel or a direct message to a user

Expected behavior:

The message should turn black after posting it. Other users viewing that channel should see the message appear immediately.

Actual behavior:

The message is saved to the DB but appears greyed out to the user that posted it. After refreshing their browser, the user that posted the message will then see that it now appears black. But if they post another message, the same problem will occur.

Other users viewing that channel can't see the new message until they refresh their browser.

Server Setup Information:

Version of Rocket.Chat Server: 3.7.1
Operating System: Ubuntu 18.04
Deployment Method: Manual (https://docs.rocket.chat/installation/manual-installation/ubuntu)
Number of Running Instances: 1
DB Replicaset Oplog: Enabled
NodeJS Version: 12.14.0 - x64
MongoDB Version: 4.2.11

Client Setup Information

Desktop App or Browser Version: Chrome 87.0.4280.88 (Official Build) (x86_64)
Operating System: Mac OS 10.15.7 Catalina

Additional context

Relevant logs:

scratttt · 2020-12-19T10:03:50Z

Same issue, I have a case with support but I have no answer.

Although I think this happens if you have the Mongo with PSA (Primary-Secondary-Arbiter) architecture. With PSS (Primary-Secondary-Secondary) it doesn't happen.

geekgonecrazy · 2020-12-20T05:16:36Z

@rodrigok @sampaiodiego this one is super important for us to fix.

looks like something in our new oplog / change streams processing stuff.

Let it run overnight (or possibly for a few days if it takes that long for a replica set election to occur at Atlas - it's hard to predict when)

alternatively local cluster just do res.stepDown()

dkoo761 · 2020-12-20T06:53:44Z

alternatively local cluster just do res.stepDown()

Yes, I only have a single replica-set locally and didn't want to set up multi-node replica set just to verify this. But if you can set up your replica set to function the same way that Atlas does, then you should be able to reproduce locally.

pierreozoux · 2020-12-31T10:03:52Z

I can configrm I have the exact same issue. Messages keep being marked unread although I've read them.

I decided to investigate to fix forward and implement change stream instead of native oplog.

The result of the investigation is this bug report:
#20017 with a workaround.

dkoo761 · 2021-01-14T20:36:46Z

UPDATE: Using the workaround above (USE_NATIVE_OPLOG=true) I haven't had any problems whatsoever with my RocketChat connection to Atlas MongoDB since I posted this issue almost a month ago.

So the workaround seems to be very stable :)

cschockaert · 2022-10-12T12:12:54Z

Hello @dkoo761 you talked about adding readPreference=primaryPreferred on your oplog uri + keep changestream mode. Does that help with mongo altas?

dkoo761 · 2022-10-29T04:59:45Z

@cschockaert If I remember right, I don't think readPreference=primaryPreferred made any difference. My workaround of setting USE_NATIVE_OPLOG=true actually turns off the newer oplog processing that uses change streams.

sampaiodiego · 2022-12-07T12:15:25Z

Change streams is a better way of watching the database for realtime data updates, but it has some drawbacks comparing to Oplog. You can see them in this document https://www.mongodb.com/docs/manual/administration/change-streams-production-recommendations/

I consider this the most important difference:

For example, consider a 3-member replica set with two data-bearing nodes and an arbiter. If the secondary goes down, such as due to failure or an upgrade, writes cannot be majority committed. The change stream remains open, but does not send any notifications.

So it is not in fact a rocket.chat issue not being able to properly connect to MongoDB to receive real time data, but it is actually MongoDB not sending that data because of some concern.

So with this in mind, on Rocket.Chat 5.3.0 we introduced a new endpoint /health that actually checks if the data stream (either oplog or change streams) is working correctly that to this PR #27026. The endpoint will reply accordingly if something is not good.

If for some reason you're not confident enough on your MongoDB set up to provide reliable change streams, you can use the environment variable IGNORE_CHANGE_STREAMS=true in your deployment, this way we'll use Oplog instead.

Starting on Rocket.Chat 5.0 we don't recommend using USE_NATIVE_OPLOG=true anymore due to performance issues. We added a warning on version 5.4 to relfect that.

nmagedman · 2023-01-30T14:43:23Z

FYI, the environment variable is IGNORE_CHANGE_STREAM=true (singular, not plural). You can confirm whether it took effect by checking the logs for "[DatabaseWatcher] Using oplog", as opposed to the default configuration of "[DatabaseWatcher] Using change streams".

But in any case, it still does not work. Post-failover, the "stream-room-messages" and "stream-notify-user" collections cease to come over the websocket.

pierreozoux mentioned this issue Dec 31, 2020

Change stream does not need db admin priviledge #20017

Closed

cschockaert mentioned this issue Oct 12, 2022

Rocket.Chat fails to deliver messages when MongoDB switches the primary #21013

Closed

sampaiodiego closed this as completed Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oplog processing stops working after MongoDB replica set election #19905

Oplog processing stops working after MongoDB replica set election #19905

dkoo761 commented Dec 19, 2020

scratttt commented Dec 19, 2020

geekgonecrazy commented Dec 20, 2020 •

edited

Loading

dkoo761 commented Dec 20, 2020

pierreozoux commented Dec 31, 2020

dkoo761 commented Jan 14, 2021

cschockaert commented Oct 12, 2022

dkoo761 commented Oct 29, 2022

sampaiodiego commented Dec 7, 2022

nmagedman commented Jan 30, 2023 •

edited

Loading

Oplog processing stops working after MongoDB replica set election #19905

Oplog processing stops working after MongoDB replica set election #19905

Comments

dkoo761 commented Dec 19, 2020

Description:

Steps to reproduce:

Expected behavior:

Actual behavior:

Server Setup Information:

Client Setup Information

Additional context

Relevant logs:

scratttt commented Dec 19, 2020

geekgonecrazy commented Dec 20, 2020 • edited Loading

dkoo761 commented Dec 20, 2020

pierreozoux commented Dec 31, 2020

dkoo761 commented Jan 14, 2021

cschockaert commented Oct 12, 2022

dkoo761 commented Oct 29, 2022

sampaiodiego commented Dec 7, 2022

nmagedman commented Jan 30, 2023 • edited Loading

geekgonecrazy commented Dec 20, 2020 •

edited

Loading

nmagedman commented Jan 30, 2023 •

edited

Loading