-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Soak Testing: Faster Loop, More Convenience #9515
Labels
domain: performance
Anything related to Vector's performance
type: task
Generic non-code related tasks
Comments
blt
added
type: task
Generic non-code related tasks
domain: performance
Anything related to Vector's performance
labels
Oct 8, 2021
1 task
blt
added a commit
that referenced
this issue
Oct 8, 2021
blt
added a commit
that referenced
this issue
Oct 9, 2021
blt
added a commit
that referenced
this issue
Oct 9, 2021
blt
added a commit
that referenced
this issue
Oct 19, 2021
This pull-request introduces localhost soak testing to the vector project, part of the work to satisfy #9515. The soak introduced here is log aggregation, that is datadog_agent -> remap -> datadog_log. Running the soak is discussed in soak/README.md. Soaks are defined solely in terraform, which I expect will move around some as #9618 is worked. Vector containers are built locally for now, but once #9543 is available we can lean on the CI infrastructure to build containers some. Closes #9616 Closes #9617 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt
added a commit
that referenced
this issue
Oct 21, 2021
chore: Allow for localhost soak testing This pull-request introduces localhost soak testing to the vector project, part of the work to satisfy #9515. The soak introduced here is log aggregation, that is datadog_agent -> remap -> datadog_log. Running the soak is discussed in soak/README.md. Soaks are defined solely in terraform, which I expect will move around some as #9618 is worked. Vector containers are built locally for now, but once #9543 is available we can lean on the CI infrastructure to build containers some. Closes #9616 Closes #9617 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt
added a commit
that referenced
this issue
Oct 27, 2021
Consistent with [this comment](dependabot/dependabot-core#3253 (comment)) the dependabot only has read-only access to the project and can't push images. We're wasting project resources building an image that can't be pushed. REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt
added a commit
that referenced
this issue
Oct 27, 2021
Consistent with [this comment](dependabot/dependabot-core#3253 (comment)) the dependabot only has read-only access to the project and can't push images. We're wasting project resources building an image that can't be pushed. REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This was referenced Oct 29, 2021
blt
added a commit
that referenced
this issue
Nov 1, 2021
This commit introduces a new soak that does not output, intending to test only VRL performance. This is in service to the VRL performance work being done by @StephenWakely. Closes #9831 Closes #9832 REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt
added a commit
that referenced
this issue
Nov 2, 2021
This commit introduces a new soak that does not output, intending to test only VRL performance. This is in service to the VRL performance work being done by @StephenWakely. Closes #9831 Closes #9832 REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
blt
added a commit
that referenced
this issue
Nov 2, 2021
* Introduce new datadog-agent -> vrl -> blackhole soak This commit introduces a new soak that does not output, intending to test only VRL performance. This is in service to the VRL performance work being done by @StephenWakely. Closes #9831 Closes #9832 REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * add soak to workflow Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * adjust naming scheme Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * update with master Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * removing del because it's not well supported Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
lucperkins
pushed a commit
that referenced
this issue
Nov 2, 2021
* Introduce new datadog-agent -> vrl -> blackhole soak This commit introduces a new soak that does not output, intending to test only VRL performance. This is in service to the VRL performance work being done by @StephenWakely. Closes #9831 Closes #9832 REF #9515 Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * add soak to workflow Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * adjust naming scheme Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * update with master Signed-off-by: Brian L. Troutwine <brian@troutwine.us> * removing del because it's not well supported Signed-off-by: Brian L. Troutwine <brian@troutwine.us>
This was referenced Nov 5, 2021
With #9926 closed the major work for this epic is now completed. PRs get statistically significant feedback, we have a CI check for serious regressions and there's a path forward to completed #9623, #9624 as a matter of typing, both of which are low priority. #10001 is also feasible if we find an interested party to do the work. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
domain: performance
Anything related to Vector's performance
type: task
Generic non-code related tasks
The Vector project today relies on "soak tests" for its all-up, integrated performance testing. What we have today is the "Run Vector Continuously" option from RFC 6531. These tests are run 24/7 on the current nightly build for a set list of configurations. Load is generated with lading and sinks are real. Results from the soak tests have been promising to this point. In the following table we summarize the worked configurations so far:
Each configuration has seen a steep drop in CPU consumption overall, as well as memory consumption in our soak rig. This work has started us down architectural and operational improvements to Vector -- for instance #9261, #9480, #9477, #9165 -- that would not have been as clearly motivated without these soak tests.
However, the soak tests as they exist today are very much a first pass on an idea and have some serious drawbacks.
How Soak Tests Are Made
A "soak test" is a VM that runs 24/7 in project infrastructure. The VM is specially rigged to run nightly Vector, lading and has access to services we've spun up in AWS, where the VMs run. Each soak testing VM uses the same base image -- built with ansible -- and all that differs between machines are configuration files for Vector and lading. The VMs and services are managed with terraform. Deploying a new soak test requires permission to build the ansible image and permission to run terraform, of which only a subset of the paid Vector team is practically capable. Telemetry is rigged to emit into Datadog and we have a dashboard for viewing the results one test at a time. It takes about 15 minutes to build a new image with ansible and then run the terraform apply before any changes are present in the soak testing infra.
Debugging issues with a new soak test is awkward and there's no convenient local testing -- especially if you work from a Mac -- meaning it can take a day or two to build a new soak test. which is convenient for the project since it's free but does make sharing results a hard problem. This whole setup effectively makes our soak tests closed-source, which, uh, we don't love.
What's Tough About our Soak Tests
The biggest drawbacks with our current soak test approach can be summarized pretty simply:
Vector developers create new soak tests with a combination of ansible and terraform -- more on this below -- which then wait to be rolled out into our soak infra, telemetry from these tests being terminated into Datadog for dashboarding. In order for changes to be soaked they must be merged to master: it is inevitable that we'll commit regressions and see them in nightly before they are detected by our current approach. Asking everyone on the team to learn ansible and terraform is a Big Ask, not to mention we have no way of sharing this with the broader community. Because of the way our soak tests are built we can't run them local to our dev environments, meaning PR work is done in an ad hoc process that is similar but different from how soak tests run. You can see this in my investigation issues, example. PR authors have to take special, extra steps to interrogate their work and may accidentally regress important use cases outside the scope of their work. The nightly-only nature of our soak tests mean changes have, generally, a 24 hour wait for feedback, by which time you may have already mentally moved on to new work. Moreover, because of this nightly roll-up regression attribution is challenging today; a developer must notice that performance has dropped, figure out which nightly is at cause and then look up which commits are present in that nightly and work out which might be involved.
What We Know Now
When RFC 6531 was written we didn't appreciate a handful of important factors. They are:
So What Do?
We can build on our existing soak testing work and make it better for everyone. In this project we see rustc's perf as a kind of northstar for where we want to get to, especially in its integration with PRs via bors. We've had a great deal of success with our clippy / Github integration and there's every indication we should be able to pull off the same trick with soak testing. We can also, I believe, reduce the cognitive burden of adding a soak test on our developers by eschewing ansible in favor of containerized vector and support infra, where appropriate. If we flesh out our terraform support for soak testing we end up needing to ask people to follow examples, for the most part. Avoiding AMIs as the basic image for soak testing especially opens up the possibility of local soak testing in a project uniform way. This makes it easier for us to improve our soak test infrastructure, since we no longer have to test live, not to mention the benefit to day to day development on Vector itself. Much like rustc's perf we want to capture Vector's performance relevant telemetry, offer a simple, comparative display of that data and provide PR feedback in addition to the nightly rollup feedback.
To that end we need:
soaks/soak.sh
#9752We've seen recent indications that collecting heap information from vector offers useful optimization paths, so being able to backfill soaks with a new methodology or to fill in data for new work would be valuable if it's straightforward to achieve. The biggest wins will be for us to get local soak testing available for the project and, second, for PR feedback to be good and useful.
The text was updated successfully, but these errors were encountered: