Skip to content

Latest commit

 

History

History
132 lines (88 loc) · 6.71 KB

README.md

File metadata and controls

132 lines (88 loc) · 6.71 KB

NOTE: Still in alpha stage. APIs may change.


main Docker Repository on Quay

Overview

zgroup is a Zig library that can manage cluster membership and member failure detection. It uses a combination of SWIM Protocol's gossip-style information dissemination, and Raft's leader election algorithm (minus the log management) to track cluster changes.

On payload size

One of zgroup's main goal is to be able to track clusters with sizes that can change dynamically overtime (e.g. Kubernetes Deployments, GCP Instance Groups, AWS Autoscaling Groups, etc.) with minimal dependencies and network load. All of my previous related works so far, depend on some external service (see spindle, hedge), using traditional heartbeating, to achieve this. This heartbeating technique usually suffers from increasing payload sizes (proportional to cluster sizes) as clusters get bigger. But I wanted a system that doesn't suffer from that side effect. Enter SWIM's infection-style information dissemination. It can use a constant payload size regardless of the cluster size. SWIM uses a combination of PINGs, INDIRECT-PINGs, and ACKs to detect member failures while piggybacking on these same messages to propagate membership updates (gossip protocol). Currently, zgroup only uses SWIM's direct probing protocol; it doesn't fully implement the Suspicion sub-protocol (yet).

At the moment, zgroup uses a single, 64-byte payload for all its messages, including leader election (see below).

On leader election

I also wanted some sort of leader election capability without depending on an external lock service. At the moment, zgroup uses Raft's leader election algorithm sub-protocol (without the log management) to achieve this. I should note that Raft's leader election algorithm depends on stable membership for it work properly, so zgroup's leader election is a best-effort basis only; split-brain can still happen while the cluster size is still changing. Additional code guards are added to minimize split-brain in these scenarios but it's not completely eliminated. In my use-case (and testing), gradual cluster size changes are mostly stable, while sudden changes with huge size deltas are not. For example, a big, sudden jump from three nodes (zgroup's minimum size) to, say, a hundred, due to autoscaling, would cause a split-brain. Once the target size is achieved however, a single leader will always be elected.

A note on Raft's random timeout range during leader election: zgroup's leader tracks ping latency averages and attempts to adjust the timeout range accordingly to accomodate for cluster size changes overtime.

Join address

For a node to join an existing cluster, it needs a joining address. While zgroup exposes a join() function for this, it also provides a callback mechanism, providing callers with a join address. This address can then be stored to an external store for the other nodes to use. Internally, zgroup uses the node with the highest IP(v4) address in the group.

Sample binary

A sample binary is provided to show a way to use the library. There are two ways to run the sample: a) specifying the join address manually, and b) using an external service to get the join address.

To run locally using a), try something like:

# Build the sample binary:
$ zig build --summary all

# Run the 1st process. The expected args look like:
#
#   ./zgroup groupname member_ip:port [join_ip:port]
#

# Run the first process (join to self).
$ ./zig-out/bin/zgroup group1 0.0.0.0:8080 0.0.0.0:8080

# Then you can run additional instances.
# Join through the 1st process/node (different terminal):
$ ./zig-out/bin/zgroup group1 0.0.0.0:8081 0.0.0.0:8080

# Join through the 2nd process/node (different terminal):
$ ./zig-out/bin/zgroup group1 0.0.0.0:8082 0.0.0.0:8081

# Join through the 1st process/node (different terminal):
$ ./zig-out/bin/zgroup group1 0.0.0.0:8083 0.0.0.0:8080

# and so on...

To run locally using b), the sample binary uses a free service, https://keyvalue.immanuel.co/, as a store for the join address.

# Build the sample binary:
$ zig build --summary all

# Generate UUID:
$ uuidgen
{output}

# Run the 1st process. The expected args look like:
#
#   ./zgroup groupname member_ip:port
#

# Run the first process:
$ ZGROUP_JOIN_PREFIX={output} ./zig-out/bin/zgroup group1 0.0.0.0:8080

# Add a second node (different terminal):
$ ZGROUP_JOIN_PREFIX={output} ./zig-out/bin/zgroup group1 0.0.0.0:8081

# Add a third node (different terminal):
$ ZGROUP_JOIN_PREFIX={output} ./zig-out/bin/zgroup group1 0.0.0.0:8082

# Add a fourth node (different terminal):
$ ZGROUP_JOIN_PREFIX={output} ./zig-out/bin/zgroup group1 0.0.0.0:8083

# and so on...

A sample Kubernetes deployment file is also provided. Before deploying though, make sure to update the ZGROUP_JOIN_PREFIX environment variable, like so:

# Generate UUID:
$ uuidgen
{output}

# Update the 'value' part with your output.
  ...
  - name: ZGROUP_JOIN_PREFIX
    value: "{output}"
  ...

# Deploy to Kubernetes:
$ kubectl create -f k8s.yaml

# You will notice some initial errors in the logs.
# Wait for a while before the K/V store is updated.

Getting the list of members

To get the current members of the group, you can try something like:

const members = try fleet.getMembers(gpa.allocator());
defer members.deinit();

for (members.items, 0..) |v, i| {
    defer gpa.allocator().free(v);
    log.info("member[{d}]: {s}", .{ i, v });
}

The tricky part of using zgroup is configuring the timeouts to optimize state dissemination and convergence. The current implementation was only tested within a local network.

TODOs

  • - Provide callbacks for membership changes
  • - Provide an API to get the current leader
  • - Provide an interface for other processes (non-Zig users)
  • - Use multicast (if available) for the join address

PR's are welcome.