Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat use tdigest for metrics #546

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nathanielc
Copy link

This change replaces this existing metrics collections with a tdigest implementation. This is beneficial for a few reasons:

  1. It is bounded in the amount of memory space it uses
  2. It is a statistically sound approach to approximating quantiles of a distribution
  3. Tdigest is serializable and multiple tdigests can be merged into a single tdigest making it well suited to gaggle use cases.

This means we can get more accurate quantiles without having to use more memory that we currently do.

The previous logic would round values in order to save space, tdigest does something similar but does so in a way that minimizes the accumulated error introduced from the rounding. Instead of rounding values to predetermined values it rounds values to the closest existing point already in the distribution and only rounds when there is a need (i.e. its reached its space limit).

This PR is still WIP but wanted to start the conversation around the contribution before continuing:

TODO:

  • Update all tests
  • Update all places that previously used BTreeMap<usize, usize> to use tdigest
  • Rebase against main branch. We use gaggle so this PR is a branch off 0.16.4, not really sure how much work it would need to target main.

This PR addresses part of the issue #366

As an side we are using this branch successfully already getting good quantiles from our scenarios runs.

Copy link
Member

@jeremyandrews jeremyandrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like a positive step forward, but a few concerns:

  • in order to merge it in, we'll need a patch against the latest development release (yes, this temporarily does not have gaggle support)
  • are there any concerns with tdigest not having commits for a couple of years?
  • we'll need to run a 2-3 day load test and confirm that this new data structure continues to scale well without slowing anything down

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to be able to merge this PR, it will need to be against the latest development version of the codebase. Please rebase. (I understand you desire the use of Gaggles, there is work underway to add them back in 0.18 (more likely to be called 1.0).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rebase the work doesn't look to be too much work.

serde_cbor = "0.11"
serde_json = "1.0"
simplelog = "0.10"
strum = "0.24"
strum_macros = "0.24"
tdigest = { version = "0.2.3", features = ["serde", "use_serde"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this library hasn't been updated in 2 years, is there anything that has replaced it? Does it matter that it's not being updated?
https://github.com/MnO2/t-digest

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, there is another implementation here https://docs.rs/pdatastructs/latest/pdatastructs/tdigest/struct.TDigest.html however that doesn't implement serde serialize/deserialize which we would need.

I am not concerned about the crate not getting updates as this is the kind of things that once implemented correctly doesn't need ongoing maintanence. However if you would prefer I can change out the implementation.

Or as we discussed I could make the implementation generic and then users can bring their own solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, with your effort toward making it Generic, I agree there's no huge risk using the library you have chosen.

src/manager.rs Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not in the current development branch ...

@nathanielc
Copy link
Author

we'll need to run a 2-3 day load test and confirm that this new data structure continues to scale well without slowing anything down

This sounds great, there is a parameter we can tune to adjust performance. When creating the tdigest you provide it with a max size, I use 100 a reasonable default, but we could potentially change it as needed. A lower value will mean less allocations an likely better performance as a result.

@nathanielc
Copy link
Author

Thanks for the quick feedback I have update the PR to be off main branch.

Additionally I updated all the tests and introduced a Digest type. Now scenario, transaction and request types are all written in terms of the Digest type which hides TDigest as an implementation detail. This gets us one step closer to allowing generic digest implementation if we wish.

This change might have introduced some new dead code. I am going to do another pass to see if that is so and delete it accordingly.

@jeremyandrews
Copy link
Member

jeremyandrews commented May 10, 2023

By uncommenting line 236 of tests/controller.rs I was able to debug the CI build failure. When requesting metrics-json the result is much longer than expected and overfilling the buf causing the test to fail several steps later. If there's no way to usefully reduce the size of the object returned, it's probably best to simply remove metrics-json from the controller test.

@nathanielc
Copy link
Author

nathanielc commented May 15, 2023

Thanks for the hints on the failing tests. I still haven't fixed them yet but I have learned a few things:

  1. The change to tdigest has caused several failures because its serialized format is large. (overflows the buffer)
    • Removing the metrics-json test bits is not enough. For example the step in the test that tries to set the host an expects a failure fails because the metrics are returned in json and it breaks the message passing logic.
    • I expect other code paths will also break
  2. Its not possible to change the serialization format to something more compact because the tdigest crate does not expose the centroid data directly.

I see a few ways forward.

  1. Submit a PR to the tdigest crate to improve its serialization format
  2. Change to the t-digest implementation to this other crate https://docs.rs/pdatastructs/latest/pdatastructs/tdigest/struct.TDigest.html That crate does not have a serialization format currently however they have an open issue for it Serde support crepererum-oss/pdatastructs.rs#61 and our implementation could ensure its compact enough.
  3. Change something in Goose to allow for larger data transfer. I haven't dug into the message passing logic much yet, but my assumption is we have a fixed size buffer to keep the packet size low for telnet? Maybe there is something we can do here?
  4. Something else?

My vote is option 2 as it seems like the best long term solution since that crate is more active and more likely to accept the change.

Why is the format so large?

Under the hood a t-digest is a set of centroids, each centroid is two numeric values (i.e. the mean and the weight). With the max_size set to 100 the tdigest will store at most 100 centroids. However each centroid is serialized as {"mean": 1.23456789, "weight": 1.23456789}, that repeated 100x is not efficient. A more efficient implementation would be store serialize a list of means separate from a list of weights and zip them back up at serialization time, i.e {"means": [1, 2, 3, 4, 5], "weights": [6,7,8,9,0]}. This will save a significant amount of bytes in the serialized format. However do we still expect that two lists of 100 numbers each will still be small enough? If not we might need to consider other options like 3 to allow for larger data transfers.

The trade off with tdigest is that it is guaranteed to never use more than 100 centroids where as the previous implementation was technically unbounded, however tdigest is much more likely to use all 100 centroids. Meaning if we can get the logic working for 100 centroids we know it will always work. Additionally the value 100 was choosen somewhat arbitrarily, we could choose a smaller value if needed.

Thanks for working through these details with me. Happy to go in whichever direction.

@jeremyandrews
Copy link
Member

The size of the tdigest serialization format shouldn't matter if we remove MetricsJson from the tests... The only reason I'd see wanting to reduce the size of this object is if it's growing too large for Goose to sustain multi-day load tests. Have you tried extended load tests and seen how large the metrics grow?

To remove from the tests, I was thinking something like this patch:

diff --git a/tests/controller.rs b/tests/controller.rs
index a66ebc3..ddacd0d 100644
--- a/tests/controller.rs
+++ b/tests/controller.rs
@@ -560,19 +560,8 @@ async fn run_standalone_test(test_type: TestType) {
                     }
                 }
                 ControllerCommand::MetricsJson => {
-                    match test_state.step {
-                        // Request the running metrics in json format.
-                        0 => {
-                            make_request(&mut test_state, "metrics-json\r\n").await;
-                        }
-                        // Confirm the metrics are returned in json format.
-                        _ => {
-                            assert!(response.starts_with(r#"{"hash":0,"#));
-
-                            // Move onto the next command.
-                            test_state = update_state(Some(test_state), &test_type).await;
-                        }
-                    }
+                    // Move onto the next command.
+                    test_state = update_state(Some(test_state), &test_type).await;
                 }
                 ControllerCommand::Start => {
                     match test_state.step {
@@ -715,9 +704,9 @@ async fn update_state(test_state: Option<TestState>, test_type: &TestType) -> Te
         ControllerCommand::Config,
         ControllerCommand::ConfigJson,
         ControllerCommand::Metrics,
-        ControllerCommand::MetricsJson,
         ControllerCommand::Stop,
         ControllerCommand::Shutdown,
+        ControllerCommand::MetricsJson,
     ];
 
     if let Some(mut state) = test_state {

By moving ::MetricsJson to after ::Shutdown it effectively is disabled. (Something more/different will need to be done to address the WebSocket tests: I was looking at the Telnet tests for now.)

That said, now there's an (apparently) unrelated problem in which users aren't starting quickly enough and so we instead fail with the following error:

thread 'test_telnet_controller' panicked at 'assertion failed: goose_metrics.total_users == MAX_USERS', tests/controller.rs:142:5

At a quick glance I'm not sure why this would happen, as --hatch-rate seems to correctly be set to 25 and it's sleeping 1 second which should allow all 20 users to hatch. Did you already rebase against main where tests are working?

@nathanielc
Copy link
Author

Thanks for the diff, I did something similar and got past that part of the test failures.

Did you already rebase against main where tests are working?

Yes, and I ran into the same issue with MAX_USERS. I traced the issue down to these lines https://github.com/tag1consulting/goose/blob/main/tests/controller.rs#L608-L621 The assertion on line 614 fails and so the user count is never changed to MAX_USERS. The reason the assertion fails has something to do with the large size of the metrics in json format. When I printed out the response it was a bunch of JSON about the means and weights. I am not sure why the response to the host command contains JSON of the metrics but it did and that broke the test system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants