Performance Soak Testing: Faster Loop, More Convenience #9515

blt · 2021-10-08T00:05:35Z

The Vector project today relies on "soak tests" for its all-up, integrated performance testing. What we have today is the "Run Vector Continuously" option from RFC 6531. These tests are run 24/7 on the current nightly build for a set list of configurations. Load is generated with lading and sinks are real. Results from the soak tests have been promising to this point. In the following table we summarize the worked configurations so far:

Soak Test	Throughput MB/s (initial)	Throughput MB/s (current)	Delta
Datadog agent => remap => Datadog logs	3	46	+43
Kafka => JSON parser => Elasticsearch	3	3	45
Syslog => regex parser => Logs to Metrics => Datadog metrics	12	12	+0
Splunk forwarder => remap => Datadog logs	27	27	+0
File => remap => Blackhole	23	257	+234

Each configuration has seen a steep drop in CPU consumption overall, as well as memory consumption in our soak rig. This work has started us down architectural and operational improvements to Vector -- for instance #9261, #9480, #9477, #9165 -- that would not have been as clearly motivated without these soak tests.

However, the soak tests as they exist today are very much a first pass on an idea and have some serious drawbacks.

How Soak Tests Are Made

A "soak test" is a VM that runs 24/7 in project infrastructure. The VM is specially rigged to run nightly Vector, lading and has access to services we've spun up in AWS, where the VMs run. Each soak testing VM uses the same base image -- built with ansible -- and all that differs between machines are configuration files for Vector and lading. The VMs and services are managed with terraform. Deploying a new soak test requires permission to build the ansible image and permission to run terraform, of which only a subset of the paid Vector team is practically capable. Telemetry is rigged to emit into Datadog and we have a dashboard for viewing the results one test at a time. It takes about 15 minutes to build a new image with ansible and then run the terraform apply before any changes are present in the soak testing infra.

Debugging issues with a new soak test is awkward and there's no convenient local testing -- especially if you work from a Mac -- meaning it can take a day or two to build a new soak test. which is convenient for the project since it's free but does make sharing results a hard problem. This whole setup effectively makes our soak tests closed-source, which, uh, we don't love.

What's Tough About our Soak Tests

The biggest drawbacks with our current soak test approach can be summarized pretty simply:

they are too hard to add to,
they don't allow ad-hoc experiments,
they take too long to get results and
they don't attribute regressions well.

Vector developers create new soak tests with a combination of ansible and terraform -- more on this below -- which then wait to be rolled out into our soak infra, telemetry from these tests being terminated into Datadog for dashboarding. In order for changes to be soaked they must be merged to master: it is inevitable that we'll commit regressions and see them in nightly before they are detected by our current approach. Asking everyone on the team to learn ansible and terraform is a Big Ask, not to mention we have no way of sharing this with the broader community. Because of the way our soak tests are built we can't run them local to our dev environments, meaning PR work is done in an ad hoc process that is similar but different from how soak tests run. You can see this in my investigation issues, example. PR authors have to take special, extra steps to interrogate their work and may accidentally regress important use cases outside the scope of their work. The nightly-only nature of our soak tests mean changes have, generally, a 24 hour wait for feedback, by which time you may have already mentally moved on to new work. Moreover, because of this nightly roll-up regression attribution is challenging today; a developer must notice that performance has dropped, figure out which nightly is at cause and then look up which commits are present in that nightly and work out which might be involved.

What We Know Now

When RFC 6531 was written we didn't appreciate a handful of important factors. They are:

In the present state of the project relatively brief runtimes are indicative of throughput performance.
Ad-hoc, uniform performance experiments -- PR based, say -- would be a huge productivity boon.
Waiting a day for feedback is too long.
Active feedback is better than passive feedback.

So What Do?

We can build on our existing soak testing work and make it better for everyone. In this project we see rustc's perf as a kind of northstar for where we want to get to, especially in its integration with PRs via bors. We've had a great deal of success with our clippy / Github integration and there's every indication we should be able to pull off the same trick with soak testing. We can also, I believe, reduce the cognitive burden of adding a soak test on our developers by eschewing ansible in favor of containerized vector and support infra, where appropriate. If we flesh out our terraform support for soak testing we end up needing to ask people to follow examples, for the most part. Avoiding AMIs as the basic image for soak testing especially opens up the possibility of local soak testing in a project uniform way. This makes it easier for us to improve our soak test infrastructure, since we no longer have to test live, not to mention the benefit to day to day development on Vector itself. Much like rustc's perf we want to capture Vector's performance relevant telemetry, offer a simple, comparative display of that data and provide PR feedback in addition to the nightly rollup feedback.

To that end we need:

We've seen recent indications that collecting heap information from vector offers useful optimization paths, so being able to backfill soaks with a new methodology or to fill in data for new work would be valuable if it's straightforward to achieve. The biggest wins will be for us to get local soak testing available for the project and, second, for PR feedback to be good and useful.

The text was updated successfully, but these errors were encountered:

This commit makes zstd an optional dependency, shaving 12 seconds off a `--no-default-features` build. Discovered while working on #9515. Relates to: * #9538 * #9537 * #9535 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

This pull-request introduces localhost soak testing to the vector project, part of the work to satisfy #9515. The soak introduced here is log aggregation, that is datadog_agent -> remap -> datadog_log. Running the soak is discussed in soak/README.md. Soaks are defined solely in terraform, which I expect will move around some as #9618 is worked. Vector containers are built locally for now, but once #9543 is available we can lean on the CI infrastructure to build containers some. Closes #9616 Closes #9617 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

chore: Allow for localhost soak testing This pull-request introduces localhost soak testing to the vector project, part of the work to satisfy #9515. The soak introduced here is log aggregation, that is datadog_agent -> remap -> datadog_log. Running the soak is discussed in soak/README.md. Soaks are defined solely in terraform, which I expect will move around some as #9618 is worked. Vector containers are built locally for now, but once #9543 is available we can lean on the CI infrastructure to build containers some. Closes #9616 Closes #9617 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

Consistent with [this comment](dependabot/dependabot-core#3253 (comment)) the dependabot only has read-only access to the project and can't push images. We're wasting project resources building an image that can't be pushed. REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

@StephenWakely

This commit introduces a new soak that does not output, intending to test only VRL performance. This is in service to the VRL performance work being done by @StephenWakely. Closes #9831 Closes #9832 REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

@StephenWakely

This commit introduces a new soak that does not output, intending to test only VRL performance. This is in service to the VRL performance work being done by @StephenWakely. Closes #9831 Closes #9832 REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

@StephenWakely

* Introduce new datadog-agent -> vrl -> blackhole soak This commit introduces a new soak that does not output, intending to test only VRL performance. This is in service to the VRL performance work being done by @StephenWakely. Closes #9831 Closes #9832 REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * add soak to workflow Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * adjust naming scheme Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * update with master Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * removing del because it's not well supported Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

@StephenWakely

* Introduce new datadog-agent -> vrl -> blackhole soak This commit introduces a new soak that does not output, intending to test only VRL performance. This is in service to the VRL performance work being done by @StephenWakely. Closes #9831 Closes #9832 REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * add soak to workflow Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * adjust naming scheme Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * update with master Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * removing del because it's not well supported Signed-off-by: Brian L. Troutwine <brian@troutwine.us>

blt · 2021-12-01T18:46:41Z

With #9926 closed the major work for this epic is now completed. PRs get statistically significant feedback, we have a CI check for serious regressions and there's a path forward to completed #9623, #9624 as a matter of typing, both of which are low priority. #10001 is also feasible if we find an interested party to do the work.

blt added type: task Generic non-code related tasks domain: performance Anything related to Vector's performance labels Oct 8, 2021

blt self-assigned this Oct 8, 2021

blt mentioned this issue Oct 8, 2021

On demand container images of release + debug symbols Vector #9531

Closed

1 task

blt mentioned this issue Oct 8, 2021

chore: Make zstd an optional dependency #9539

Merged

blt mentioned this issue Oct 19, 2021

chore: Allow for localhost soak testing #9699

Merged

blt mentioned this issue Oct 27, 2021

chore: Disallow dependabot from running soaks #9816

Merged

This was referenced Oct 29, 2021

Develop a soak test for Reference Counting within Vrl #9831

Closed

Develop a soak test for the Vrl Bytecode VM #9832

Closed

blt mentioned this issue Nov 1, 2021

chore: Introduce new datadog-agent -> vrl -> blackhole soak #9849

Merged

This was referenced Nov 5, 2021

Improve soak comment aesthetic #9926

Closed

chore: Introduce splunk_hec -> route -> s3 soak #9942

Merged

blt closed this as completed Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Soak Testing: Faster Loop, More Convenience #9515

Performance Soak Testing: Faster Loop, More Convenience #9515

blt commented Oct 8, 2021 •

edited

Loading

blt commented Dec 1, 2021

Performance Soak Testing: Faster Loop, More Convenience #9515

Performance Soak Testing: Faster Loop, More Convenience #9515

Comments

blt commented Oct 8, 2021 • edited Loading

How Soak Tests Are Made

What's Tough About our Soak Tests

What We Know Now

So What Do?

blt commented Dec 1, 2021

blt commented Oct 8, 2021 •

edited

Loading