Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oplog processing stops working after MongoDB replica set election #19905

Closed
dkoo761 opened this issue Dec 19, 2020 · 9 comments
Closed

Oplog processing stops working after MongoDB replica set election #19905

dkoo761 opened this issue Dec 19, 2020 · 9 comments

Comments

@dkoo761
Copy link

dkoo761 commented Dec 19, 2020

Description:

Oplog tailing / processing stops working correctly after a MongoDB replica set election, resulting in real-time data not being delivered to the browser.

This means:

  • when a user posts a message, it gets saved to the DB but appears greyed out to the user that posted it
  • new messages in a channel don't show up until the user refreshes the browser
  • the only way to get the above functionality working again is to restart the Rocketchat server

When using a 3rd party MongoDB provider, such as Atlas, replica set elections can happen quite frequently (a few times a week). Also, there is often no error message in the RocketChat logs indicating something went wrong and standard HTTP web monitoring tools also wouldn't notify the site admin since the issue is specific to the oplog.

It appears that this started failing after the new oplog processing using change streams was added in 3.7.0.

WORKAROUND: Setting environment variable USE_NATIVE_OPLOG=true prevents this issue from happening.

I previously left a detailed analysis in this RC forum post: https://forums.rocket.chat/t/messages-not-displaying-and-greyed-out/2035

Steps to reproduce:

  1. Use a replica set hosted at MongoDB Atlas as your database in MONGO_URL and MONGO_OPLOG_URL
  2. Start Rocketchat server
  3. Let it run overnight (or possibly for a few days if it takes that long for a replica set election to occur at Atlas - it's hard to predict when)
  4. Try to post a message to a channel or a direct message to a user

Expected behavior:

The message should turn black after posting it. Other users viewing that channel should see the message appear immediately.

Actual behavior:

The message is saved to the DB but appears greyed out to the user that posted it. After refreshing their browser, the user that posted the message will then see that it now appears black. But if they post another message, the same problem will occur.

Other users viewing that channel can't see the new message until they refresh their browser.

Server Setup Information:

Client Setup Information

  • Desktop App or Browser Version: Chrome 87.0.4280.88 (Official Build) (x86_64)
  • Operating System: Mac OS 10.15.7 Catalina

Additional context

Relevant logs:

@scratttt
Copy link

Same issue, I have a case with support but I have no answer.

Although I think this happens if you have the Mongo with PSA (Primary-Secondary-Arbiter) architecture. With PSS (Primary-Secondary-Secondary) it doesn't happen.

@geekgonecrazy
Copy link
Contributor

geekgonecrazy commented Dec 20, 2020

@rodrigok @sampaiodiego this one is super important for us to fix.

looks like something in our new oplog / change streams processing stuff.

Let it run overnight (or possibly for a few days if it takes that long for a replica set election to occur at Atlas - it's hard to predict when)

alternatively local cluster just do res.stepDown()

@dkoo761
Copy link
Author

dkoo761 commented Dec 20, 2020

alternatively local cluster just do res.stepDown()

Yes, I only have a single replica-set locally and didn't want to set up multi-node replica set just to verify this. But if you can set up your replica set to function the same way that Atlas does, then you should be able to reproduce locally.

@pierreozoux
Copy link
Contributor

I can configrm I have the exact same issue. Messages keep being marked unread although I've read them.

I decided to investigate to fix forward and implement change stream instead of native oplog.

The result of the investigation is this bug report:
#20017 with a workaround.

@dkoo761
Copy link
Author

dkoo761 commented Jan 14, 2021

UPDATE: Using the workaround above (USE_NATIVE_OPLOG=true) I haven't had any problems whatsoever with my RocketChat connection to Atlas MongoDB since I posted this issue almost a month ago.

So the workaround seems to be very stable :)

@cschockaert
Copy link

Hello @dkoo761 you talked about adding readPreference=primaryPreferred on your oplog uri + keep changestream mode. Does that help with mongo altas?

@dkoo761
Copy link
Author

dkoo761 commented Oct 29, 2022

@cschockaert If I remember right, I don't think readPreference=primaryPreferred made any difference. My workaround of setting USE_NATIVE_OPLOG=true actually turns off the newer oplog processing that uses change streams.

@sampaiodiego
Copy link
Member

Change streams is a better way of watching the database for realtime data updates, but it has some drawbacks comparing to Oplog. You can see them in this document https://www.mongodb.com/docs/manual/administration/change-streams-production-recommendations/

I consider this the most important difference:

For example, consider a 3-member replica set with two data-bearing nodes and an arbiter. If the secondary goes down, such as due to failure or an upgrade, writes cannot be majority committed. The change stream remains open, but does not send any notifications.

So it is not in fact a rocket.chat issue not being able to properly connect to MongoDB to receive real time data, but it is actually MongoDB not sending that data because of some concern.

So with this in mind, on Rocket.Chat 5.3.0 we introduced a new endpoint /health that actually checks if the data stream (either oplog or change streams) is working correctly that to this PR #27026. The endpoint will reply accordingly if something is not good.

If for some reason you're not confident enough on your MongoDB set up to provide reliable change streams, you can use the environment variable IGNORE_CHANGE_STREAMS=true in your deployment, this way we'll use Oplog instead.

Starting on Rocket.Chat 5.0 we don't recommend using USE_NATIVE_OPLOG=true anymore due to performance issues. We added a warning on version 5.4 to relfect that.

@nmagedman
Copy link
Contributor

nmagedman commented Jan 30, 2023

FYI, the environment variable is IGNORE_CHANGE_STREAM=true (singular, not plural). You can confirm whether it took effect by checking the logs for "[DatabaseWatcher] Using oplog", as opposed to the default configuration of "[DatabaseWatcher] Using change streams".

But in any case, it still does not work. Post-failover, the "stream-room-messages" and "stream-notify-user" collections cease to come over the websocket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants