Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Add basic support for group replication #1180

Conversation

ejortegau
Copy link
Contributor

@ejortegau ejortegau commented May 31, 2020

Related issue: #1179

Description

This PR adds some initial support for group replication in orchestrator.

ToDo:

  • contributed code is using same conventions as original code
  • code is formatted via gofmt (please avoid goimports)
  • code is built via ./build.sh
  • code is tested via go test ./go/...
  • relevant documentation has been updated.
  • upgrade from previous version (master branch) is successful

Assume a 3 member group, plus an async slave replicating from one of them. Orchestrator shows all three members as separate clusters, as it does not understand group replication and there is no traditional replication configured in any group member. The async slave is shown as a slave in the cluster whose master is the group member the slave is set up to replicate from.

Orchestrator also shows the async slave with a problem of errant GTIDs.

With this PR, some basic support for single primary replication groups is added. So, a replication group shown like this in mysqlshell:

{
    "clusterName": "test_cluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:5306", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "127.0.0.1:4306": {
                "address": "127.0.0.1:4306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.20"
            }, 
            "127.0.0.1:5306": {
                "address": "127.0.0.1:5306", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.20"
            }, 
            "127.0.0.1:6306": {
                "address": "127.0.0.1:6306", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.20"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "127.0.0.1:5306"
}

Is now understood by orchestrator, as two group secondary group members that replicate from the primary. In addition, the async/replica continues to be shown normally:

image

Notice also that the errant GTID problem is not shown.

The UI also shows group membership information, including:

  • An icon showing whether the instance is a group member or not. A different text style is used for
    the primary and secondary instances so that they can be easily identified.
  • Hovering over the icon shows the instance state and role in the replication group.
  • Group members that are not online are shown as with problems, as shown below:

image

On top of this, certain replication operations are prevented from taking place. Namely, attempting to relocate group secondaries to replicate from outside the group, since they actually replicate from the primary, as well as attempting to set-up a group primary to replicate from a secondary.

@ejortegau
Copy link
Contributor Author

cc @dveeden , @sjmudd , @luisyonaldo

* Update docs to include information on partial support for GR.
@dveeden
Copy link
Contributor

dveeden commented Jun 2, 2020

I think group replication is an important feature for Orchestrator as it looks like this is at the core of the long term strategy of Oracle MySQL. Group Replication is a core component of InnoDB Cluster.

Note that some info about group replication is available in the same way as normal replication. There is nothing in SHOW SLAVE STATUS, but performance_schema.replication_connection_status etc show a group_replication_applier and a group_replication_recovery channel. Other newer / advanced replication features like multi source replication and parallel replication use the same performance_schema tables. So I think that where possible these should be used. For GR there are additional tables like replication_group_members.

@ejortegau, Could you squash your commits? Or are there multiple on purpose?

@ejortegau
Copy link
Contributor Author

@ejortegau, Could you squash your commits? Or are there multiple on purpose?

There are multiple on purpose, as I have worked on it and added more and more support so far. I figure they would/could be squashed once the merge is accepted.

go/db/generate_patches.go Show resolved Hide resolved
go/inst/instance_dao.go Outdated Show resolved Hide resolved
@dveeden
Copy link
Contributor

dveeden commented Jun 2, 2020

Would this work ok with a group in multi primary mode?

https://dev.mysql.com/doc/refman/8.0/en/group-replication-multi-primary-mode.html

go/inst/instance_dao.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@ejortegau ejortegau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work ok with a group in multi primary mode?

https://dev.mysql.com/doc/refman/8.0/en/group-replication-multi-primary-mode.html

It will not. I have missed pushing an update to the FAQ, indicating this. For now, this only works with single primary mode.

go/inst/instance_dao.go Outdated Show resolved Hide resolved
go/inst/instance_dao.go Outdated Show resolved Hide resolved
go/inst/instance_dao.go Outdated Show resolved Hide resolved
go/inst/instance_dao.go Outdated Show resolved Hide resolved
@ejortegau ejortegau changed the title WIP: Add basic support for group replication Add basic support for group replication Jun 3, 2020
@dveeden
Copy link
Contributor

dveeden commented Jun 5, 2020

For groups we should draw a line around them and maybe show the name of the group.
This is true for Group Replication, NDB and maybe PXC.

With D3.js this can be done with d3.geom.hull. Example in g1 in the image.

The d3.layout.tree() that is used doesn't allow more complex topologies like groups, multi source etc. This could be fixed by moving to d3.layout.force(). Example in g2 in the image.

I also notice that D3.js v3 is used, which is two versions behind the latest major release (D3.js v5).

Screenshot from 2020-06-05 15-21-07

The code for the example is in:
https://gist.github.com/dveeden/8a5cc4c1230f0442bb8b74c8e0eecad7

Some of this could be done outside of this PR, but we should make sure that the nodelist is populated with group info to allow the frontend part to be done later on.

Copy link
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only an initial, partial review. I'll need to revisit.

go/inst/instance.go Outdated Show resolved Hide resolved
go/inst/instance.go Outdated Show resolved Hide resolved
go/inst/instance.go Outdated Show resolved Hide resolved
go/inst/instance.go Outdated Show resolved Hide resolved
go/inst/instance_dao.go Outdated Show resolved Hide resolved
go/inst/instance_dao.go Show resolved Hide resolved
Copy link
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some further comments. Review is not yet complete.

Thank you!

go/inst/instance_dao.go Show resolved Hide resolved
go/inst/instance_topology.go Show resolved Hide resolved
resources/public/js/orchestrator.js Outdated Show resolved Hide resolved
resources/public/js/orchestrator.js Outdated Show resolved Hide resolved
resources/public/js/orchestrator.js Outdated Show resolved Hide resolved
Comment on lines 1254 to 1273
if instance.ReplicationGroupName != "" && instance.ReplicationGroupMemberState != "ONLINE" {
instance.Problems = append(instance.Problems, "group_replication_member_not_online")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we "upgrade" this to be analyzed in instance.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move it there, though I thought this was a better place for it. From what I see, instance.go does not populate problems anywhere; instead, they seem to be populated here.

* Add group replication fields to database_instance table.
* Add group replication fields to Instance, and a function to evaluate whether
  a host is member of a replication group or not.
* Modifued `Instance.IsMaster()` to not consider a host as a Master if it is a
  secondary member of a replication group.
* Defined constants for group replication member roles  and states.
* `ReadTopologyInstanceBufferable()` now attempts to populate group replication
  fields in Oracle MySQL >= 5.7
* `DiscoverInstance()` not attempts to discover not only an instance's slave
  hosts but also the member of its replication group
* Adde methods to evaluate whether an instance is either the primary or
  secondary of a replication group.
* Replication group members with secondary role now get their cluster
  attributes populated from the group's primary. The group's primary gets them,
  in turn, from its async/semi-sync master (if any).
* Group replication attributes from instances are now read from the backend DB,
  and therefore, are now correctly returned by the API.
* Replication group secondary members now are shown in the UI as replicating
  from the group primary.
* Stop considering GTIDs coming from the group as errant GTIDs. While at it,
  fix tiny typo.
* Disallow setting a group primary to replicate from a secondary.
* Split DB migration adding multiple columns in a single ALTER statement to
  make CI happy.
* Add instance icon showing whether the host is a replication group member as
  well as its group role and state.
* Replication group members that have been expelled from the group are still
  identified as part of the same cluster and shown in the topology, instead of
  appearing as separate chains.
* Group members that are not ONLINE are now exposed as having problems through
  the API.
* The web UI now shows replication group members that have been expelled by the
  group majority.
* Group members that are not ONLINE are now shown in the web UI in either red
  or orange depending on their state in the group (orange when RECOVERING, red
  when OFFLINE or in ERROR).
* Update docs to include information on partial support for GR.
* Fix unit tests that were broken due to addition of new columns to
  `database_instance` table.
* Address some MR comments and failing test.
* Revert bad change to DB migration.
* Fix SIGSEV introduced by bad placement of `defer rows.Close()`.
* Fix integration testing that was failing due to missing value on `INSERT`
  statement for non nullable column `replication_group_members` of table
  `database_instance` which has no default.
* Instance attribute renamed from `ReplicationGroupPrimaryKey` to
  `ReplicationGroupPrimaryInstanceKey`.
* Use `Instance.Equals()`.
* Easier to read `Instance.IsMaster()`.
* Don't blindly assume that any error coming from executing a query to find out
  group replication attributes for an instance means that the instance does not
  support group replication. Instead, check error codes to determine whether
  the error comes from that or from something else, and have different behavior
  depending on the answer.
* Fix issue leading to random detection of group secondaries as not group
  members, coming from the fact that, while attempting to find the member in
  the records of `performance_schema.replication_group_members`, the required
  attribute for server UUID had not yet been populated. This should also reduce
  the loss of parallelism previously present, as new WaitGroup only waits for
  determining the server UUID instead of for all standing routines to finish.
* Prevent replication refactoring operations from taking place for group
  secondaries since they make no sense for them.
* Group secondaries with replicas under them no longer are shown as with a
  `NoWriteableMasterStructureWarning` problem.
* Remove no longer needed `WaitGroup.wait()`.
@ejortegau ejortegau force-pushed the ejortegau/add_basic_support_for_group_replication branch from 86197c5 to 105cdc6 Compare June 12, 2020 23:47
* Re-order of GR constants to have most frequently used ones above.
@shlomi-noach
Copy link
Collaborator

Sorry for the delays. I hope to be able to re-review and merge early next week (Sunday/Monday)

Copy link
Collaborator

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I've made a few formatting changes.

AND (
master_instance.replication_group_name = ''
OR master_instance.replication_group_member_role = 'PRIMARY'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


No support has been added (yet) to handling group member failure. If all you have is a single replication group, this is fine, because you don't need it; the group will handle all failures as long as it can secure a majority.

If, however, you have the primary of a group as a replica to another instance; or you have replicas under your group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

`
ALTER TABLE
database_instance
ADD COLUMN replication_group_primary_port smallint(5) unsigned NOT NULL DEFAULT 0 AFTER replication_group_primary_host
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

ReplicationGroupMembers InstanceKeyMap

// Primary of the replication group
ReplicationGroupPrimaryInstanceKey InstanceKey
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Problems: []string{},
Replicas: make(map[InstanceKey]bool),
ReplicationGroupMembers: make(map[InstanceKey]bool),
Problems: []string{},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

instance.ReplicationDepth = replicationDepth
instance.IsCoMaster = isCoMaster
instance.AncestryUUID = ancestryUUID
instance.masterExecutedGtidSet = masterExecutedGtidSet
instance.masterExecutedGtidSet = masterOrGroupPrimaryExecutedGtidSet
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Group replication problems
if instance.ReplicationGroupName != "" && instance.ReplicationGroupMemberState != GroupReplicationMemberStateOnline {
instance.Problems = append(instance.Problems, "group_replication_member_not_online")
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"replication_group_member_role",
"replication_group_members",
"replication_group_primary_host",
"replication_group_primary_port",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

args = append(args, instance.ReplicationGroupMemberRole)
args = append(args, instance.ReplicationGroupMembers.ToJSONString())
args = append(args, instance.ReplicationGroupPrimaryInstanceKey.Hostname)
args = append(args, instance.ReplicationGroupPrimaryInstanceKey.Port)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"part of a replication group", instance.Key)
}
}
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@shlomi-noach shlomi-noach merged commit 65cc04d into openark:master Jul 26, 2020
@earl86
Copy link

earl86 commented Aug 13, 2020

I find when MGR changed master, orchestrator has no call PostFailoverProcesses hooks.

@earl86
Copy link

earl86 commented Dec 10, 2020

orchestrator monitor mysql group replication(MGR),when MGR PRIMARY changed, can orchestrator update consul key-value info for this MGR cluster?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants