Feat use tdigest for metrics #546

nathanielc · 2023-05-09T14:38:31Z

This change replaces this existing metrics collections with a tdigest implementation. This is beneficial for a few reasons:

It is bounded in the amount of memory space it uses
It is a statistically sound approach to approximating quantiles of a distribution
Tdigest is serializable and multiple tdigests can be merged into a single tdigest making it well suited to gaggle use cases.

This means we can get more accurate quantiles without having to use more memory that we currently do.

The previous logic would round values in order to save space, tdigest does something similar but does so in a way that minimizes the accumulated error introduced from the rounding. Instead of rounding values to predetermined values it rounds values to the closest existing point already in the distribution and only rounds when there is a need (i.e. its reached its space limit).

This PR is still WIP but wanted to start the conversation around the contribution before continuing:

TODO:

Update all tests
Update all places that previously used BTreeMap<usize, usize> to use tdigest
Rebase against main branch. We use gaggle so this PR is a branch off 0.16.4, not really sure how much work it would need to target main.

This PR addresses part of the issue #366

As an side we are using this branch successfully already getting good quantiles from our scenarios runs.

jeremyandrews

Overall this looks like a positive step forward, but a few concerns:

in order to merge it in, we'll need a patch against the latest development release (yes, this temporarily does not have gaggle support)
are there any concerns with tdigest not having commits for a couple of years?
we'll need to run a 2-3 day load test and confirm that this new data structure continues to scale well without slowing anything down

jeremyandrews · 2023-05-09T14:43:26Z

Cargo.toml

In order to be able to merge this PR, it will need to be against the latest development version of the codebase. Please rebase. (I understand you desire the use of Gaggles, there is work underway to add them back in 0.18 (more likely to be called 1.0).

I'll rebase the work doesn't look to be too much work.

jeremyandrews · 2023-05-09T14:46:23Z

Cargo.toml

 serde_cbor = "0.11"
 serde_json = "1.0"
 simplelog = "0.10"
 strum = "0.24"
 strum_macros = "0.24"
+tdigest = { version = "0.2.3", features = ["serde", "use_serde"] }


It looks like this library hasn't been updated in 2 years, is there anything that has replaced it? Does it matter that it's not being updated?
https://github.com/MnO2/t-digest

Good question, there is another implementation here https://docs.rs/pdatastructs/latest/pdatastructs/tdigest/struct.TDigest.html however that doesn't implement serde serialize/deserialize which we would need.

I am not concerned about the crate not getting updates as this is the kind of things that once implemented correctly doesn't need ongoing maintanence. However if you would prefer I can change out the implementation.

Or as we discussed I could make the implementation generic and then users can bring their own solution.

Ok, with your effort toward making it Generic, I agree there's no huge risk using the library you have chosen.

jeremyandrews · 2023-05-09T14:50:11Z

src/manager.rs

This file is not in the current development branch ...

nathanielc · 2023-05-09T18:15:20Z

we'll need to run a 2-3 day load test and confirm that this new data structure continues to scale well without slowing anything down

This sounds great, there is a parameter we can tune to adjust performance. When creating the tdigest you provide it with a max size, I use 100 a reasonable default, but we could potentially change it as needed. A lower value will mean less allocations an likely better performance as a result.

nathanielc · 2023-05-09T19:56:10Z

Thanks for the quick feedback I have update the PR to be off main branch.

Additionally I updated all the tests and introduced a Digest type. Now scenario, transaction and request types are all written in terms of the Digest type which hides TDigest as an implementation detail. This gets us one step closer to allowing generic digest implementation if we wish.

This change might have introduced some new dead code. I am going to do another pass to see if that is so and delete it accordingly.

jeremyandrews · 2023-05-10T14:37:03Z

By uncommenting line 236 of tests/controller.rs I was able to debug the CI build failure. When requesting metrics-json the result is much longer than expected and overfilling the buf causing the test to fail several steps later. If there's no way to usefully reduce the size of the object returned, it's probably best to simply remove metrics-json from the controller test.

nathanielc · 2023-05-15T20:30:59Z

Thanks for the hints on the failing tests. I still haven't fixed them yet but I have learned a few things:

The change to tdigest has caused several failures because its serialized format is large. (overflows the buffer)
- Removing the metrics-json test bits is not enough. For example the step in the test that tries to set the host an expects a failure fails because the metrics are returned in json and it breaks the message passing logic.
- I expect other code paths will also break
Its not possible to change the serialization format to something more compact because the tdigest crate does not expose the centroid data directly.

I see a few ways forward.

Submit a PR to the tdigest crate to improve its serialization format
Change to the t-digest implementation to this other crate https://docs.rs/pdatastructs/latest/pdatastructs/tdigest/struct.TDigest.html That crate does not have a serialization format currently however they have an open issue for it Serde support crepererum-oss/pdatastructs.rs#61 and our implementation could ensure its compact enough.
Change something in Goose to allow for larger data transfer. I haven't dug into the message passing logic much yet, but my assumption is we have a fixed size buffer to keep the packet size low for telnet? Maybe there is something we can do here?
Something else?

My vote is option 2 as it seems like the best long term solution since that crate is more active and more likely to accept the change.

Why is the format so large?

Under the hood a t-digest is a set of centroids, each centroid is two numeric values (i.e. the mean and the weight). With the max_size set to 100 the tdigest will store at most 100 centroids. However each centroid is serialized as {"mean": 1.23456789, "weight": 1.23456789}, that repeated 100x is not efficient. A more efficient implementation would be store serialize a list of means separate from a list of weights and zip them back up at serialization time, i.e {"means": [1, 2, 3, 4, 5], "weights": [6,7,8,9,0]}. This will save a significant amount of bytes in the serialized format. However do we still expect that two lists of 100 numbers each will still be small enough? If not we might need to consider other options like 3 to allow for larger data transfers.

The trade off with tdigest is that it is guaranteed to never use more than 100 centroids where as the previous implementation was technically unbounded, however tdigest is much more likely to use all 100 centroids. Meaning if we can get the logic working for 100 centroids we know it will always work. Additionally the value 100 was choosen somewhat arbitrarily, we could choose a smaller value if needed.

Thanks for working through these details with me. Happy to go in whichever direction.

jeremyandrews · 2023-05-16T10:42:16Z

The size of the tdigest serialization format shouldn't matter if we remove MetricsJson from the tests... The only reason I'd see wanting to reduce the size of this object is if it's growing too large for Goose to sustain multi-day load tests. Have you tried extended load tests and seen how large the metrics grow?

To remove from the tests, I was thinking something like this patch:

diff --git a/tests/controller.rs b/tests/controller.rs
index a66ebc3..ddacd0d 100644
--- a/tests/controller.rs
+++ b/tests/controller.rs
@@ -560,19 +560,8 @@ async fn run_standalone_test(test_type: TestType) {
                     }
                 }
                 ControllerCommand::MetricsJson => {
-                    match test_state.step {
-                        // Request the running metrics in json format.
-                        0 => {
-                            make_request(&mut test_state, "metrics-json\r\n").await;
-                        }
-                        // Confirm the metrics are returned in json format.
-                        _ => {
-                            assert!(response.starts_with(r#"{"hash":0,"#));
-
-                            // Move onto the next command.
-                            test_state = update_state(Some(test_state), &test_type).await;
-                        }
-                    }
+                    // Move onto the next command.
+                    test_state = update_state(Some(test_state), &test_type).await;
                 }
                 ControllerCommand::Start => {
                     match test_state.step {
@@ -715,9 +704,9 @@ async fn update_state(test_state: Option<TestState>, test_type: &TestType) -> Te
         ControllerCommand::Config,
         ControllerCommand::ConfigJson,
         ControllerCommand::Metrics,
-        ControllerCommand::MetricsJson,
         ControllerCommand::Stop,
         ControllerCommand::Shutdown,
+        ControllerCommand::MetricsJson,
     ];
 
     if let Some(mut state) = test_state {

By moving ::MetricsJson to after ::Shutdown it effectively is disabled. (Something more/different will need to be done to address the WebSocket tests: I was looking at the Telnet tests for now.)

That said, now there's an (apparently) unrelated problem in which users aren't starting quickly enough and so we instead fail with the following error:

thread 'test_telnet_controller' panicked at 'assertion failed: goose_metrics.total_users == MAX_USERS', tests/controller.rs:142:5

At a quick glance I'm not sure why this would happen, as --hatch-rate seems to correctly be set to 25 and it's sleeping 1 second which should allow all 20 users to hatch. Did you already rebase against main where tests are working?

nathanielc · 2023-05-16T14:24:42Z

Thanks for the diff, I did something similar and got past that part of the test failures.

Did you already rebase against main where tests are working?

Yes, and I ran into the same issue with MAX_USERS. I traced the issue down to these lines https://github.com/tag1consulting/goose/blob/main/tests/controller.rs#L608-L621 The assertion on line 614 fails and so the user count is never changed to MAX_USERS. The reason the assertion fails has something to do with the large size of the metrics in json format. When I printed out the response it was a bunch of JSON about the means and weights. I am not sure why the response to the host command contains JSON of the metrics but it did and that broke the test system.

jeremyandrews requested changes May 9, 2023

View reviewed changes

nathanielc added 2 commits May 9, 2023 12:18

feat: use tdigest for metrics

ba7f66d

fix: use digest for scenario and tx metrics

3f10d0d

nathanielc force-pushed the fixes branch from 1ba6672 to 3f10d0d Compare May 9, 2023 19:51

fix: remove dead code merge_times

a8f63b4

nathanielc mentioned this pull request Jun 13, 2023

fix: use digest for scenario and tx metrics 3box/goose#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat use tdigest for metrics #546

Feat use tdigest for metrics #546

nathanielc commented May 9, 2023

jeremyandrews left a comment

jeremyandrews May 9, 2023

nathanielc May 9, 2023

jeremyandrews May 9, 2023

nathanielc May 9, 2023

jeremyandrews May 10, 2023

jeremyandrews May 9, 2023

nathanielc commented May 9, 2023

nathanielc commented May 9, 2023

jeremyandrews commented May 10, 2023 •

edited

Loading

nathanielc commented May 15, 2023 •

edited

Loading

jeremyandrews commented May 16, 2023

nathanielc commented May 16, 2023

Feat use tdigest for metrics #546

Are you sure you want to change the base?

Feat use tdigest for metrics #546

Conversation

nathanielc commented May 9, 2023

jeremyandrews left a comment

Choose a reason for hiding this comment

jeremyandrews May 9, 2023

Choose a reason for hiding this comment

nathanielc May 9, 2023

Choose a reason for hiding this comment

jeremyandrews May 9, 2023

Choose a reason for hiding this comment

nathanielc May 9, 2023

Choose a reason for hiding this comment

jeremyandrews May 10, 2023

Choose a reason for hiding this comment

jeremyandrews May 9, 2023

Choose a reason for hiding this comment

nathanielc commented May 9, 2023

nathanielc commented May 9, 2023

jeremyandrews commented May 10, 2023 • edited Loading

nathanielc commented May 15, 2023 • edited Loading

jeremyandrews commented May 16, 2023

nathanielc commented May 16, 2023

jeremyandrews commented May 10, 2023 •

edited

Loading

nathanielc commented May 15, 2023 •

edited

Loading