Skip to content

The goal of this workshop is to get you familiar with the powerful and versatile tool NoSQLBench. With that, you can perform industry-grade, robust benchmarks aimed at several (distributed) target systems, especially NoSQL databases.

Notifications You must be signed in to change notification settings

datastaxdevs/workshop-nosqlbench

Repository files navigation

Benchmark your Astra DB with NoSQLBench

Gitpod hands-on License Apache2 Discord

Time: 2 hours. Difficulty: Intermediate. Start Building!

The goal of this workshop is to get you familiar with the powerful and versatile tool NoSQLBench. With that, you can perform industry-grade, robust benchmarks aimed at several (distributed) target systems, especially NoSQL databases.

Today you'll be benchmarking Astra DB, a database-as-a-service built on top of Apache Cassandra. Along the way, you will learn the basics of NoSQLBench.

In this repository you will find all material and references you need:

Table of Contents

  1. Before you start
  2. Create your Astra DB instance
  3. Launch Gitpod and setup NoSQLBench
  4. Run benchmarks
  5. Workloads
  6. Homework assignment

Before you start

Heads up: these instructions are available in two forms: a short and to-the-point one (this one), with just the useful commands if you are watching us live; and a longer one, with lots of explanations and details, designed for those who follow this workshop at their own pace. Please choose what best suits you!

FAQ

  • What are the prerequisites?

This workshop is aimed at data architects, solution architects, developers, or anybody who wants to get serious about measuring the performance of their data-intensive system. You should know what a (distributed) database is, and have a general understanding of the challenges of communicating over a network.

  • Do I need to install a database or anything on my machine?

No, no need to install anything. You will do everything in the browser. (That being said, the knowledge you gain today will probably be best put to use once you install NoSQLBench on some client machine to run tests.)

You can also choose to work on your machine instead of using Gitpod: there's no problem with that, just a few setup and operational changes to keep in mind. We will not provide live support in this case, though, assuming you know what you are doing.

  • Is there anything to pay?

No. All materials, services and software used in this workshop is free.

  • Do you cover NoSQLBench 4 or 5?

Ah, I see you are a connoisseur. We focus on the newly-release NoSQLBench 5, but we provide tips and remarks aimed at those still using nb4.

Homework

To complete the workshop and get a verified "NoSQLBench" badge, follow these instructions:

  1. Do the hands-on practice, either during the workshop or by following the instructions in this README;
  2. (optional) Complete the "Lab" assignment as detailed here;
  3. Fill the submission form here. Answer the theory questions and (optionally) provide a screenshot of the completed "Lab" part;
  4. give us a few days to process your submission: you should receive your well-earned badge in your email inbox!

Create your Astra DB instance

First you must create a database: an instance of Astra DB, which you will then benchmark with NoSQLBench.

Don't worry, you will create it within the "Free Tier", which offers quite a generous free allowance in terms of monthly I/O (about 40M operations per month) and storage (80 GB).

You need to:

  • create an Astra DB instance as explained here, with database name = workshops and keyspace name = nbkeyspace;
  • (this will happen automatically with the previous one) generate and retrieve a DB Token as explained here. Important: use the role "DB Administrator" if manually creating the token.
  • generate and download a Secure Connect Bundle as explained here;

Moreover, keep the Astra DB dashboard open: it will be useful later. In particular, locate the Health tab and the CQL Console.

Launch Gitpod and setup NoSQLBench

Ctrl-click on the Gitpod button below to spawn your very own environment + IDE:

Open in Gitpod

In a few minutes, a full IDE will be ready in your browser, with a file explorer on the left, a file editor on the top, and a console (bash) below it.

Install NoSQLBench

To download NoSQLBench, type or paste this command in your Gitpod console:

curl -L -O https://github.com/nosqlbench/nosqlbench/releases/latest/download/nb5

then make it executable and move it to a better place:

chmod +x nb5
sudo mv nb5 /usr/local/bin/

Ok, now check that the program starts: invoking

nb5 --version

should output the program version (something like 5.17.3 or higher).

Version used

This workshop is built for the newly-released NoSQLBench 5.

Upload the Secure Connect Bundle to Gitpod

Locate, with the file explorer on your computer, the bundle file that you downloaded earlier (it should be called secure-connect-workshops.zip) and simply drag-and-drop it to the file navigator panel ("Explorer") on the left of the Gitpod view.

Show me

Once you drop it you will see it listed in the file explorer itself. As a check, you can issue the command

ls /workspace/workshop-nosqlbench/secure*zip -lh

so that you get the absolute path to your bundle file (and also verify that it is the correct size, about 12-13 KB).

Show me

Configure the Astra DB parameters

Copy the provided template file to a new one and open it in the Gitpod file editor:

cp .env.sample .env
gp open .env
# (you can also simply locate the file
#  in the Explorer and click on it)

Insert the "Client ID" and "Client Secret" of the DB Token you created earlier and, if necessary, adjust the other variables.

Show me what the .env file would look like

Now, source this file to make the definitions therein available to this shell:

. .env

To check that the file has been sourced, you can try with:

echo ${ASTRA_DB_KEYSPACE_NAME}

and make sure the output is not an empty line.

(Note that you will have to source the file in any new shell you plan to use).

Run benchmarks

Everything is set to start running the tool.

A short dry run

Try launching this very short "dry-run benchmark", that instead of actually reaching the database simply prints a series of CQL statements to the console (as specified by the driver=stdout parameter):

nb5 cql-keyvalue2 astra                 \
    driver=stdout                       \
    rampup-cycles=10                    \
    main-cycles=10                      \
    keyspace=${ASTRA_DB_KEYSPACE_NAME}

You will see 21 (fully-formed, valid CQL) statements being printed: one CREATE TABLE, then ten INSERTs and then another ten between SELECTs and further INSERTs.

Note: we will use workload cql-keyvalue2 throughout. This is functionally identical to the cql-keyvalue workload but is expressed in the newer syntax for yaml workloads, which comes handy when later dissecting its content. If you are working with NoSQLBench 4, remember to drop the trailing 2 from the workload name in the following!

Now re-launch the above dry run and look for differences in the output:

nb5 cql-keyvalue2 astra                 \
    driver=stdout                       \
    rampup-cycles=10                    \
    main-cycles=10                      \
    keyspace=${ASTRA_DB_KEYSPACE_NAME}

is the output identical to the previous run down to the actual "random" values?

You can also peek at the logs directory now: it is created automatically and populated with some information from the benchmark at each execution of nb.

Benchmark your Astra DB

It is now time to start hitting the database!

This time you will run with driver=cql to actually reach the database: for that to work, you will provide all connection parameters set up earlier.

The next run will ask NoSQLBench to perform a substantial amount of operations, in order to collect enough statistical support for the results.

Here is the full command to launch:

nb5 cql-keyvalue2                                                         \
    astra                                                                 \
    username=${ASTRA_DB_CLIENT_ID}                                        \
    password=${ASTRA_DB_CLIENT_SECRET}                                    \
    secureconnectbundle=${ASTRA_DB_BUNDLE_PATH}                           \
    keyspace=${ASTRA_DB_KEYSPACE_NAME}                                    \
    cyclerate=50                                                          \
    driver=cql                                                            \
    main-cycles=9000                                                      \
    rampup-cycles=9000                                                    \
    errors='OverloadedException:warn'                                     \
    --progress console:5s                                                 \
    --log-histograms 'histogram_hdr_data.log:.*.main.result.*:20s'        \
    --log-histostats 'hdrstats.log:.*.main.result.*:20s'
Show me the command breakdown

Note that some of the parameters (e.g. keyspace) are workload-specific.

command meaning
cql-keyvalue2 workload
astra scenario
username authentication
password authentication
secureconnectbundle Astra DB connection parameters
keyspace target keyspace
cyclerate rate-limiting (cycles per second)
driver=cql driver to use (CQL, for AstraDB/Cassandra)
main-cycles how many operations in the "main" phase
rampup-cycles how many operations in the "rampup" phase
errors behaviour if errors occur during benchmarking
--progress console frequency of console prints
--log-histograms write data to HDR file (see later)
--log-histostats write some basic stats to a file (see later)

This way of invoking nb, the "named scenario" way, is not the only one: it is also possible to have a finer-grained control over what activities should run with a full-fledged CLI scripting syntax.

Note: the syntax of the errors parameter has been improved in NoSQLBench 5 to allow for a finer control (with multiple directives, such as errors='NoNodeAvailable.*:ignore;InvalidQueryException.*:counter;OverloadedException:warn'). On version 4 you should revert to a simpler parameter, such as errors=count, instead of the above.

The benchmark should last about ten minutes, with the progress being printed on the console as it proceeds.

While this runs, have a look around.

Database contents

Now it's time to find out what is actually being written to the database.

Choose your database in the Astra main dashboard and click on it; next, go to the "CQL Console" tab in the main panel. In a few seconds the console will open in your browser, already connected to your database and waiting for your input.

Show me how to get to the CQL Console in Astra

Start by telling the console that you will be using the nbkeyspace keyspace:

USE nbkeyspace;

Check what tables have been created by NoSQLBench in this keyspace:

DESC TABLES;

You should see table keyvalue listed as the sole output. Look at a a few lines from this table:

SELECT * FROM keyvalue LIMIT 20;
Show me what the output looks like

Ok, mystery solved. It looks like the table contains simple key-value pairs, with two columns seemingly of numeric type. Check with:

DESC TABLE keyvalue;

Oh, looks like both the key and the value columns are of type TEXT: good for adapting this ready-made benchmark to other key/value stores.

Show me what the output looks like

Database health

Locate your database in the Astra main dashboard and click on it; next, go to the "Health" tab in the main panel. You will see what essentially is a Grafana dashboard, with a handful of plots being displayed within the tab - all related to how the database is performing in terms of reads and writes.

Show me the Database Health tab in Astra UI

Check the operations per second from the "Requests Combined" plot; then have a look at the "Write Latency" and "Read Latency" plots and take note of some of the percentiles shown there.

Show me "sample values" one could read from the graph

Below is a real-life example of the values that could result from a cql-keyvalue2 benchmark session in the main phase:

Percentile Write Latency Read Latency
P50 709 µs 935 µs
P75 831 µs 1.31 ms
P90 904 µs 1.53 ms
P95 1.04 ms 1.77 ms
P99 2.45 ms 15.6 ms

Final summary in "logs/"

When the benchmark has finished, open the latest *.summary file and look for cqlkeyvalue2_astra_main.result-success.

Under that metric title, you will see something similar to:

cqlkeyvalue2_astra_main.result-success
             count = 15000
         mean rate = 50.00 calls/second
     1-minute rate = 49.94 calls/second
     5-minute rate = 50.29 calls/second
    15-minute rate = 50.57 calls/second

Additional "histostats" datafile

Use this script to generate a graph of the data collected as "histostats":

./hdr_tool/histostats_quick_plotter.py \
    hdrstats.log \
    -m cqlkeyvalue2_astra_main.result-success

and then open, in the Gitpod editor, the hdrstats.png image just created.

Show me the generated "histostats" plot

The version of the plotter script included in this repo is for educational purposes only: for general use, please head to the official release page.

The timings will be larger than those from the Astra health tab: indeed, these are "as seen on the client side" and include more network hops.

HDR extensive histogram data

Use this script to generate plots from the detailed "HDR histogram data" generated during the benchmark:

./hdr_tool/hdr_tool.py \
    histogram_hdr_data.log \
    -b -c -s \
    -p SampleData \
    -m cqlkeyvalue2_astra_main.result-success
Show me the plots generated by the HDR file

The version of the plotter script included in this repo is for educational purposes only: for general use, please head to the official release page.

Again, the timings are larger than those found on the Astra health tab (i.e. on server-side): these measurements are reported "as seen by the testing client".

Metrics, metrics, metrics

Launch a new benchmark, this time having NoSQLBench start a dockerized Grafana/Prometheus stack for metrics (it will take a few more seconds to start):

nb5 cql-keyvalue2                                                         \
    astra                                                                 \
    username=${ASTRA_DB_CLIENT_ID}                                        \
    password=${ASTRA_DB_CLIENT_SECRET}                                    \
    secureconnectbundle=${ASTRA_DB_BUNDLE_PATH}                           \
    keyspace=${ASTRA_DB_KEYSPACE_NAME}                                    \
    cyclerate=50                                                          \
    rampup-cycles=15000                                                   \
    main-cycles=15000                                                     \
    errors='OverloadedException:warn'                                     \
    --progress console:5s                                                 \
    --docker-metrics
Show me the run with Docker metrics

Grafana dashboard

Reach the Grafana container in a new tab, with an URL that has 3000- in front of your Gitpod URL (e.g. https://3000-datastaxdevs-workshopnos-[...].gitpod.io).

The default credentials to log in to Grafana are ... admin/admin. Once you're in, don't bother to reset your password (click "Skip"). You'll get to the Grafana landing page. Find the "Dashboards" icon in the leftmost menu bar and pick the "Manage" menu item: finally, click on the "NB4 Dashboard" item you should see listed there. Congratulations, you are seeing the data coming from NoSQLBench.

Show me how to get to the Grafana plots

You may find it convenient to set the update frequency to something like 10 seconds and the displayed time window to 5 minutes or so (upper-right controls).

The dashboard comprises several (interactive) plots, updated in real time.

Show me the dashboard contents

A glance at Prometheus

To reach the Prometheus container, which handles the "raw" data behind Grafana, open a modified URL (this time with 9090-) in a new tab.

Show me the Prometheus UI

Click on the "world" icon next to the "Execute" button in the search bar: in the dialog that appears you can look for specific metrics. Try to look for result_success and confirm, then click "Execute".

Tip: switch to the "Graph" view for a more immediate visualization. The graphs display "raw" data, hence are in units of nanoseconds.

To make sense of the (heterogeneous) results, some filtering is in order -- but we are not entering too much into the details of Prometheus here.

Just to pique your interest, try pasting these examples and click "Execute":

# filtering by metadata
{__name__="result_success", type="pctile", alias=~".*main.*"}

# aggregation
avg_over_time({__name__="result_success", type="pctile", alias=~".*main.*"}[10m])

# another aggregation, + filtering
max_over_time({__name__="result_success", type="pctile", alias=~".*main.*"}[10m])

Workloads

This part is about how workloads are defined.

Tip: feel free to interrupt the previous benchmark, if it still runs, with Ctrl-C. You won't need it anymore.

Inspect "cql-keyvalue"

Ask NoSQLBench to dump to a file the yaml defining the workload you just ran:

    nb5 --copy cql-keyvalue2

(you can also get a comprehensive list of all available workloads with nb5 --list-workloads, by the way, and a more fine-grained output with nb5 --list-scenarios.)

A file cql-keyvalue2.yaml is created in the working directory. You can open it (clicking on it in the Gitpod explorer or by running gp open cql-keyvalue2.yaml).

Have a look at the file and try to identify its structure and the various phases the benchmark is organized into.

There are profound differences in the way the same workload is expressed in the NoSQLBench 4 yaml file and the NoSQLBench 5 format.

Show me the differences

Play with workloads

A good way to understand workload construction is to start from simple ones.

To run the following examples please go to the appropriate subdirectory:

cd workloads

Example 1: talking about food

Run the first example (and then look at the corresponding simple-workload.yaml) with:

nb5 run driver=stdout workload=simple-workload cycles=12

Look at how bindings connect the sequence of operations to "execute" (in this case, simply print on screen) with the data to be used in them.

Example 2: animal meeting

Run the second example, which is an example of structuring a workload in phases (and then open workload-with-phases.yaml):

nb5 workload-with-phases default driver=stdout

Notable features of this workload are its multi-phase structure (a nearly universal feature of actual benchmarks), the use of the ratio parameter, and the usage of template parameters in the definition.

Homework assignment

The "Lab" part of the homework, which requires you to finalize a workload yaml and make it work according to specifications, is detailed on this page.

To submit your homework, please use this form.

About

The goal of this workshop is to get you familiar with the powerful and versatile tool NoSQLBench. With that, you can perform industry-grade, robust benchmarks aimed at several (distributed) target systems, especially NoSQL databases.

Topics

Resources

Stars

Watchers

Forks