Jepsen testing for Couchbase.
- OpenJDK 11.
- The Leiningen build tool.
- Gnuplot and Graphviz to plot performance graphs and render anomalies.
- A Couchbase package as a deb, rpm or a build directory.
- Vagrant and a virtualisation software package (e.g. VirtualBox).
- The OpenSSH package.
- Utilities such as bc, curl, grep, zip, unzip and expect. A more comprehensive list can be found here.
Note: The provision.sh script starts suitable node virtual-machines using Vagrant and also supports cluster-run style nodes from a local build. Some workloads may be incompatible with cluster-run style nodes.
Jepsen requires a cluster of nodes to run Couchbase Server on. If you already have nodes available you need to manually create a nodes file with the IP addresses. Otherwise you can use the script provision.sh that automatically start suitable vagrants and create the corresponding file.
The JVM starting heap size and maximum heap size is set to 32GB by default to prevent garbage collection pauses, this can be configured in project.clj based on available memory on the given system. Note that a test run may crash if the memory requirements of a particular test exceeds the maximum heap size.
The provision script can be used to automatically start suitable VMs.
./provision.sh --type=vagrant --vm-os=ubuntu1604 --action=create --nodes=3
A copy of Couchbase Server is required:
wget http://packages.couchbase.com/releases/6.0.0/couchbase-server-enterprise_6.0.0-ubuntu16.04_amd64.deb
You can then invoke leiningen to run a test:
lein run test --nodes-file=./nodes --package=./couchbase-server-enterprise_6.0.0-ubuntu16.04_amd64.deb --workload=register --ssh-private-key=./resources/my.key
After the tests are complete, the vagrants VMs can be torn down with:
./provision.sh --type=vagrant --action=destroy-all
The script run.sh can be used to automatically run multiple tests from a config file:
./provision.sh --type=vagrant --vm-os=ubuntu1604 --action=create --nodes=3
./run.sh --provisioner=vagrant --package=./couchbase-server-enterprise_6.0.0-ubuntu16.04_amd64.deb --suite=./suites/example.conf
The run.sh script can also be used with cluster-run nodes:
./run.sh --provisioner=cluster-run --package=~/dev/source/install --suite=./suites/example.conf --global=node-count:4
Populate the ./nodes file with one node IP per line.
Generate SSH keys using PEM as the key format.
ssh-keygen -t rsa -N "" -f ~/my.key -m PEM
Copy the public key to each node using:
ssh-copy-id -i ~/my.key user@host
Tests can then be run with the following command (replacing user with the suitable ssh username for the nodes):
./run.sh --provisioner=vmpool --package=./some-package.deb --suite=./some-suite.conf --global=username:user,ssh-private-key=~/my.key
If the host has a directory containing a build suitable for running the nodes, you can supply the install directory to jepsen instead of a deb/rpm package.
./run.sh ... --package ~/dev/source/install
If downloading and building from source, cluster-run is supported providing a simple way to start running tests.
Note that Jepsen will automatically start the required cluster-run nodes. No other cluster-run nodes should be running on the machine, and any leftover data from previous cluster-run nodes will be deleted.
Currently, due to an issue (JDCP-81) with the Java DCP client used by Jepsen,
set workloads crash if the server reports version 0.0.0, as is the default for
custom builds. This can be worked around by setting a product version when
building, for example with:
EXTRA_CMAKE_OPTIONS='-DPRODUCT_VERSION="9.9.9-9999"'
lein run test --cluster-run --node-count 3 --workload register --package ~/dev/source/install
Start a server on http://localhost:8080
to show the results with a simple
interface.
lein trampoline run serve
Workloads can be configured with various command line options. To display the help text run the following command:
lein run test --help
Before pushing your code for review it is important to check that the patch will pass our commit validation. To do this use:
chmod +x cv-checks.sh
./cv-checks.sh
This will run the same checks that are used on our Jenkins commit validation job. Its also important if adding or editing a workload/nemesis that they are tested, as the commit validator is unable to do this.
Jepsen testing involves launching client operations to a database across a set of logically single-threaded processes and recording the operation history while introducing faults such as network partitions into the system via a special nemesis process.
The history is then analysed through a checker for correctness against a consistency model. A consistency model defines the set of legal histories that can be observed when the system conforms to the consistency model specification.
The checker detects consistency errors by identifying histories that do not conform to the consistency model when faults are introduced into the system.
For more details see Jepsen, Jepsen project and Jepsen testing at Couchbase.
A workload is a template of how to perform a test with specified nemeses and operations. A test is an instantiation of a workload with defined parameters. The workloads can be found in this directory.
For the register style workloads we model Couchbase Server as independent compare-and-swap registers using a combination of the following checkers given key-value read and write operations:
- Knossos a linearizability checker.
- Seqchecker a per register sequential consistency checker.
The extended set workloads model Couchbase Server as a set in which a key is a member of the set if it has a corresponding value. The extended set checker checks the intended items are present in the set following add and delete operations.
The counter workloads model Couchbase Server as a counter variable. The sanity counter checker checks if the counter holds the correct summation following increment and decrement operations.
Jepsen testing is not a proof of correctness, but instead detects errors in a subset of the possible histories produced by the implementation. Our project requires independent verification and perhaps additional tests before we can sufficiently make such a claim.
It's important to determine if a test is failing due to a flaw in test code or a consistency error as the former does not indicate a flaw in Couchbase Server. In the case of consistency errors, a bug report would be greatly appreciated.
A result of a flaw in test code looks like the following:
Errors occurred during analysis, but no anomalies found. ಠ~ಠ
Test code flaws often manifest themselves as Jepsen's nemesis process crashing. These are primarily a result of errors in test code or insufficient hardware specifications for the selected configuration of Couchbase Server. Note that the vagrant configuration is below the minimum memory and cpu requirements required to run Couchbase Server.
A consistency error looks like the following:
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
Under some circumstances the linearizability checker may fail. Couchbase Server
does not claim to be linearizable but instead offers sequential
consistency. It's recommended
to check if the failure occurs using the seqchecker by re-running the test with
--use-checker sequential
as the linearizability checker is used by default.
Operations must be configured with the correct durability requirements to offer sequential consistency and a suitable number of replicas to tolerate a certain number of failures.
The --durability L0:L1:L2:L3
options configures the probability of operations
at each level.
Operations can have the following durability level:
Level | Description |
---|---|
L0 | No Synchronous Replication |
L1 | Replicate to Majority |
L2 | Replicate to Majority and Persist to Active |
L3 | Persist to Majority |
For instance, supplying --durability 0:100:0:0
would generate all operations
with L1 (Replicate to Majority). Similarly, --durability 0:0:0:100
would
generate all operations with L2 (Persist to Majority). Not supplying this option
produces all operations with L0 (No Sync Replication). Please ensure that the
durability level is configured to a minimum of L1 to enable sync replication.
See documentation
Replicas can be configured using --replicas REPLICAS
with a supported maximum
of 2 for synchronous replication.
Under synchronous replication, the following number of failures can be tolerated given a specific number of replicas:
Replicas | Majority | Failures |
---|---|---|
0 | 1 | 0 |
1 | 2 | 1 |
2 | 2 | 1 |
An important distinction to make is that in the case of a single failure there will be no data loss with replicas=1 and replicas=2 but the latter will be write available while the former will not.
See documentation
In some cases the linearizability checker, Knossos, may take a long time to complete and may even run out of memory. This is expected behaviour and is dependent on the generated history as the search is exponential in terms of concurrency and linear in terms of history length.
See documentation