-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CBG-1971: TestReplicationConcurrentPush flaky test fix #6173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that we're not just avoiding a real-world race condition here that we'd like to handle better (with CAS retry/update callback)
activeRT.CreateReplication("rep_ABC", remoteURLString, db.ActiveReplicatorTypePush, []string{"ABC"}, true, db.ConflictResolverDefault) | ||
activeRT.CreateReplication("rep_DEF", remoteURLString, db.ActiveReplicatorTypePush, []string{"DEF"}, true, db.ConflictResolverDefault) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there actually a bug here where we're not able to create two replications at the same time?
I'd expect the CAS retry in sgReplicateManager.updateCluster
ensures that both could be created, even if concurrently, as it's something the REST API could potentially allow to happen in the real world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my analysis of the problem (I could be wrong) was that we can create two at the same time, but the changes to sg replicate config come through on a channel cfgEvents
and we have a race where sometimes where only the first replication is picked up and takes time to initialise. Then by the time the second replication is processed as a cfg change and goes through the refresh config/initialise replication process the test is already asserting against the stats for that replication. I think in real world for customers the replication will initialise eventually. Also think that this test isn't really a problem when running against server I think its more flaking for walrus and I put that down to the added latency when communicating with server helping in the context of the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing that makes me uncertain of that is that we don't see any logging associated with rep_DEF
in the entire test (which runs for ~20 seconds)
https://mobile.jenkins.couchbase.com/job/sgw-unix-build/26219/consoleFull
It's almost like DEF never even gets created (or at least assigned to a node)
…g (race condition). This commit makes that less likely to be hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this fix after in-person discussion.
CBG-1971
Test was flaking on waiting for nodes to be assigned. But also did a few runs where that wasn't included and notcied that it also failed on asserting for the number of docs pushed on the second replicator (rep_DEF). After digging into the losg I noticed that in all cases where it failed for this reaon I saw a log line:
[INF] Replicate: t:TestReplicationConcurrentPush c:sgr-mgr- db:activedb Initializing replication <ud>rep_ABC</ud>
But no corresponding log line for the second created replications (rep_DEF). When looking into the code I noticed this is called when refreshing the sgreplicate config. After adding new print lines everywhere I noticed on failure scenarios the reason the test wasn't actually initialising the second replication was due to
StartReplications
being called after the first replication was created and inserted into sgreplicate. This function created a new suscription and in this time the second replication config is inserted into sgreplicate. There is a race where this second replication is missed as a new addition as a new channel is created in subscribe replicate.Pre-review checklist
fmt.Print
,log.Print
, ...)base.UD(docID)
,base.MD(dbName)
)docs/api
Integration Tests
GSI=true,xattrs=true
https://jenkins.sgwdev.com/job/SyncGateway-Integration/1648/ (for running particular tests 200 times)GSI=true,xattrs=true
https://jenkins.sgwdev.com/job/SyncGateway-Integration/1650/ (full inegration test run)