Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBG-1971: TestReplicationConcurrentPush flaky test fix #6173

Merged
merged 6 commits into from
Apr 20, 2023
Merged

Conversation

gregns1
Copy link
Contributor

@gregns1 gregns1 commented Apr 4, 2023

CBG-1971

Test was flaking on waiting for nodes to be assigned. But also did a few runs where that wasn't included and notcied that it also failed on asserting for the number of docs pushed on the second replicator (rep_DEF). After digging into the losg I noticed that in all cases where it failed for this reaon I saw a log line:
[INF] Replicate: t:TestReplicationConcurrentPush c:sgr-mgr- db:activedb Initializing replication <ud>rep_ABC</ud>
But no corresponding log line for the second created replications (rep_DEF). When looking into the code I noticed this is called when refreshing the sgreplicate config. After adding new print lines everywhere I noticed on failure scenarios the reason the test wasn't actually initialising the second replication was due to StartReplications being called after the first replication was created and inserted into sgreplicate. This function created a new suscription and in this time the second replication config is inserted into sgreplicate. There is a race where this second replication is missed as a new addition as a new channel is created in subscribe replicate.

Pre-review checklist

  • Removed debug logging (fmt.Print, log.Print, ...)
  • Logging sensitive data? Make sure it's tagged (e.g. base.UD(docID), base.MD(dbName))
  • Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in docs/api

Integration Tests

@gregns1 gregns1 self-assigned this Apr 4, 2023
@gregns1 gregns1 marked this pull request as ready for review April 5, 2023 15:13
Copy link
Member

@bbrks bbrks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that we're not just avoiding a real-world race condition here that we'd like to handle better (with CAS retry/update callback)

db/sg_replicate_cfg.go Outdated Show resolved Hide resolved
Comment on lines 940 to 961
activeRT.CreateReplication("rep_ABC", remoteURLString, db.ActiveReplicatorTypePush, []string{"ABC"}, true, db.ConflictResolverDefault)
activeRT.CreateReplication("rep_DEF", remoteURLString, db.ActiveReplicatorTypePush, []string{"DEF"}, true, db.ConflictResolverDefault)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there actually a bug here where we're not able to create two replications at the same time?

I'd expect the CAS retry in sgReplicateManager.updateCluster ensures that both could be created, even if concurrently, as it's something the REST API could potentially allow to happen in the real world.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my analysis of the problem (I could be wrong) was that we can create two at the same time, but the changes to sg replicate config come through on a channel cfgEvents and we have a race where sometimes where only the first replication is picked up and takes time to initialise. Then by the time the second replication is processed as a cfg change and goes through the refresh config/initialise replication process the test is already asserting against the stats for that replication. I think in real world for customers the replication will initialise eventually. Also think that this test isn't really a problem when running against server I think its more flaking for walrus and I put that down to the added latency when communicating with server helping in the context of the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that makes me uncertain of that is that we don't see any logging associated with rep_DEF in the entire test (which runs for ~20 seconds)

https://mobile.jenkins.couchbase.com/job/sgw-unix-build/26219/consoleFull

It's almost like DEF never even gets created (or at least assigned to a node)

@gregns1 gregns1 assigned bbrks and unassigned gregns1 Apr 18, 2023
Copy link
Member

@bbrks bbrks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this fix after in-person discussion.

@bbrks bbrks merged commit 4b8dfb4 into master Apr 20, 2023
@bbrks bbrks deleted the CBG-1971 branch April 20, 2023 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants