Add basic support for group replication #1180

ejortegau · 2020-05-31T20:45:43Z

Related issue: #1179

Description

This PR adds some initial support for group replication in orchestrator.

ToDo:

contributed code is using same conventions as original code
code is formatted via gofmt (please avoid goimports)
code is built via ./build.sh
code is tested via go test ./go/...
relevant documentation has been updated.
upgrade from previous version (master branch) is successful

Assume a 3 member group, plus an async slave replicating from one of them. Orchestrator shows all three members as separate clusters, as it does not understand group replication and there is no traditional replication configured in any group member. The async slave is shown as a slave in the cluster whose master is the group member the slave is set up to replicate from.

Orchestrator also shows the async slave with a problem of errant GTIDs.

With this PR, some basic support for single primary replication groups is added. So, a replication group shown like this in mysqlshell:

{
    "clusterName": "test_cluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:5306", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "127.0.0.1:4306": {
                "address": "127.0.0.1:4306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.20"
            }, 
            "127.0.0.1:5306": {
                "address": "127.0.0.1:5306", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.20"
            }, 
            "127.0.0.1:6306": {
                "address": "127.0.0.1:6306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.20"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "127.0.0.1:5306"
}

Is now understood by orchestrator, as two group secondary group members that replicate from the primary. In addition, the async/replica continues to be shown normally:

Notice also that the errant GTID problem is not shown.

The UI also shows group membership information, including:

An icon showing whether the instance is a group member or not. A different text style is used for
the primary and secondary instances so that they can be easily identified.
Hovering over the icon shows the instance state and role in the replication group.
Group members that are not online are shown as with problems, as shown below:

On top of this, certain replication operations are prevented from taking place. Namely, attempting to relocate group secondaries to replicate from outside the group, since they actually replicate from the primary, as well as attempting to set-up a group primary to replicate from a secondary.

ejortegau · 2020-05-31T20:57:19Z

cc @dveeden , @sjmudd , @luisyonaldo

* Update docs to include information on partial support for GR.

dveeden · 2020-06-02T07:53:40Z

I think group replication is an important feature for Orchestrator as it looks like this is at the core of the long term strategy of Oracle MySQL. Group Replication is a core component of InnoDB Cluster.

Note that some info about group replication is available in the same way as normal replication. There is nothing in SHOW SLAVE STATUS, but performance_schema.replication_connection_status etc show a group_replication_applier and a group_replication_recovery channel. Other newer / advanced replication features like multi source replication and parallel replication use the same performance_schema tables. So I think that where possible these should be used. For GR there are additional tables like replication_group_members.

@ejortegau, Could you squash your commits? Or are there multiple on purpose?

ejortegau · 2020-06-02T07:56:19Z

@ejortegau, Could you squash your commits? Or are there multiple on purpose?

There are multiple on purpose, as I have worked on it and added more and more support so far. I figure they would/could be squashed once the merge is accepted.

go/db/generate_patches.go

go/inst/instance_dao.go

dveeden · 2020-06-02T08:03:33Z

Would this work ok with a group in multi primary mode?

https://dev.mysql.com/doc/refman/8.0/en/group-replication-multi-primary-mode.html

go/inst/instance_dao.go

ejortegau

Would this work ok with a group in multi primary mode?

https://dev.mysql.com/doc/refman/8.0/en/group-replication-multi-primary-mode.html

It will not. I have missed pushing an update to the FAQ, indicating this. For now, this only works with single primary mode.

go/inst/instance_dao.go

dveeden · 2020-06-05T13:30:24Z

For groups we should draw a line around them and maybe show the name of the group.
This is true for Group Replication, NDB and maybe PXC.

With D3.js this can be done with d3.geom.hull. Example in g1 in the image.

The d3.layout.tree() that is used doesn't allow more complex topologies like groups, multi source etc. This could be fixed by moving to d3.layout.force(). Example in g2 in the image.

I also notice that D3.js v3 is used, which is two versions behind the latest major release (D3.js v5).

The code for the example is in:
https://gist.github.com/dveeden/8a5cc4c1230f0442bb8b74c8e0eecad7

Some of this could be done outside of this PR, but we should make sure that the nodelist is populated with group info to allow the frontend part to be done later on.

shlomi-noach

Only an initial, partial review. I'll need to revisit.

go/inst/instance.go

go/inst/instance_dao.go

shlomi-noach

Some further comments. Review is not yet complete.

Thank you!

go/inst/instance_dao.go

go/inst/instance_topology.go

resources/public/js/orchestrator.js

shlomi-noach · 2020-06-08T11:12:53Z

go/inst/instance_dao.go

+	if instance.ReplicationGroupName != "" && instance.ReplicationGroupMemberState != "ONLINE" {
+		instance.Problems = append(instance.Problems, "group_replication_member_not_online")
+	}


Can we "upgrade" this to be analyzed in instance.go?

I can move it there, though I thought this was a better place for it. From what I see, instance.go does not populate problems anywhere; instead, they seem to be populated here.

* Add group replication fields to database_instance table. * Add group replication fields to Instance, and a function to evaluate whether a host is member of a replication group or not. * Modifued `Instance.IsMaster()` to not consider a host as a Master if it is a secondary member of a replication group. * Defined constants for group replication member roles and states. * `ReadTopologyInstanceBufferable()` now attempts to populate group replication fields in Oracle MySQL >= 5.7 * `DiscoverInstance()` not attempts to discover not only an instance's slave hosts but also the member of its replication group

* Adde methods to evaluate whether an instance is either the primary or secondary of a replication group. * Replication group members with secondary role now get their cluster attributes populated from the group's primary. The group's primary gets them, in turn, from its async/semi-sync master (if any). * Group replication attributes from instances are now read from the backend DB, and therefore, are now correctly returned by the API.

* Replication group secondary members now are shown in the UI as replicating from the group primary.

* Stop considering GTIDs coming from the group as errant GTIDs. While at it, fix tiny typo.

* Disallow setting a group primary to replicate from a secondary.

* Split DB migration adding multiple columns in a single ALTER statement to make CI happy.

* Add instance icon showing whether the host is a replication group member as well as its group role and state.

* Replication group members that have been expelled from the group are still identified as part of the same cluster and shown in the topology, instead of appearing as separate chains.

* Group members that are not ONLINE are now exposed as having problems through the API. * The web UI now shows replication group members that have been expelled by the group majority. * Group members that are not ONLINE are now shown in the web UI in either red or orange depending on their state in the group (orange when RECOVERING, red when OFFLINE or in ERROR).

* Update docs to include information on partial support for GR.

* Fix unit tests that were broken due to addition of new columns to `database_instance` table.

* Address some MR comments and failing test.

* Revert bad change to DB migration. * Fix SIGSEV introduced by bad placement of `defer rows.Close()`.

* Fix integration testing that was failing due to missing value on `INSERT` statement for non nullable column `replication_group_members` of table `database_instance` which has no default.

* Instance attribute renamed from `ReplicationGroupPrimaryKey` to `ReplicationGroupPrimaryInstanceKey`. * Use `Instance.Equals()`. * Easier to read `Instance.IsMaster()`. * Don't blindly assume that any error coming from executing a query to find out group replication attributes for an instance means that the instance does not support group replication. Instead, check error codes to determine whether the error comes from that or from something else, and have different behavior depending on the answer. * Fix issue leading to random detection of group secondaries as not group members, coming from the fact that, while attempting to find the member in the records of `performance_schema.replication_group_members`, the required attribute for server UUID had not yet been populated. This should also reduce the loss of parallelism previously present, as new WaitGroup only waits for determining the server UUID instead of for all standing routines to finish. * Prevent replication refactoring operations from taking place for group secondaries since they make no sense for them. * Group secondaries with replicas under them no longer are shown as with a `NoWriteableMasterStructureWarning` problem.

* Remove no longer needed `WaitGroup.wait()`.

* Fix formatting issues.

* Re-order of GR constants to have most frequently used ones above.

…ication

shlomi-noach · 2020-07-23T12:35:35Z

Sorry for the delays. I hope to be able to re-review and merge early next week (Sunday/Monday)

shlomi-noach

Approved. I've made a few formatting changes.

shlomi-noach · 2020-07-26T10:29:22Z

go/inst/analysis_dao.go

+			AND (
+				master_instance.replication_group_name = ''
+				OR master_instance.replication_group_member_role = 'PRIMARY'
+			)


shlomi-noach · 2020-07-26T10:30:37Z

docs/faq.md

+
+No support has been added (yet) to handling group member failure. If all you have is a single replication group, this is fine, because you don't need it; the group will handle all failures as long as it can secure a majority.
+
+If, however, you have the primary of a group as a replica to another instance; or you have replicas under your group


shlomi-noach · 2020-07-26T10:30:45Z

go/db/generate_patches.go

+	`
+		ALTER TABLE
+		database_instance
+			ADD COLUMN replication_group_primary_port smallint(5) unsigned NOT NULL DEFAULT 0 AFTER replication_group_primary_host


shlomi-noach · 2020-07-26T10:30:55Z

go/inst/instance.go

+	ReplicationGroupMembers InstanceKeyMap
+
+	// Primary of the replication group
+	ReplicationGroupPrimaryInstanceKey InstanceKey


shlomi-noach · 2020-07-26T10:31:01Z

go/inst/instance.go

-		Problems: []string{},
+		Replicas:                make(map[InstanceKey]bool),
+		ReplicationGroupMembers: make(map[InstanceKey]bool),
+		Problems:                []string{},


shlomi-noach · 2020-07-26T10:54:21Z

go/inst/instance_dao.go

 	instance.ReplicationDepth = replicationDepth
 	instance.IsCoMaster = isCoMaster
 	instance.AncestryUUID = ancestryUUID
-	instance.masterExecutedGtidSet = masterExecutedGtidSet
+	instance.masterExecutedGtidSet = masterOrGroupPrimaryExecutedGtidSet
 	return nil


shlomi-noach · 2020-07-26T10:54:37Z

go/inst/instance_dao.go

+	// Group replication problems
+	if instance.ReplicationGroupName != "" && instance.ReplicationGroupMemberState != GroupReplicationMemberStateOnline {
+		instance.Problems = append(instance.Problems, "group_replication_member_not_online")
+	}


shlomi-noach · 2020-07-26T10:55:31Z

go/inst/instance_dao.go

+		"replication_group_member_role",
+		"replication_group_members",
+		"replication_group_primary_host",
+		"replication_group_primary_port",


shlomi-noach · 2020-07-26T10:55:36Z

go/inst/instance_dao.go

+		args = append(args, instance.ReplicationGroupMemberRole)
+		args = append(args, instance.ReplicationGroupMembers.ToJSONString())
+		args = append(args, instance.ReplicationGroupPrimaryInstanceKey.Hostname)
+		args = append(args, instance.ReplicationGroupPrimaryInstanceKey.Port)


shlomi-noach · 2020-07-26T10:56:20Z

go/inst/instance_dao.go

+				"part of a replication group", instance.Key)
+		}
+	}
+	return nil


earl86 · 2020-08-13T03:08:38Z

I find when MGR changed master, orchestrator has no call PostFailoverProcesses hooks.

earl86 · 2020-12-10T07:21:21Z

orchestrator monitor mysql group replication(MGR),when MGR PRIMARY changed, can orchestrator update consul key-value info for this MGR cluster?

Initial support for group replication.

cf2b55e

* Update docs to include information on partial support for GR.

dveeden reviewed Jun 2, 2020

View reviewed changes

go/db/generate_patches.go Show resolved Hide resolved

go/inst/instance_dao.go Outdated Show resolved Hide resolved