example of how to do a statistical performance regression test #16

bengland2 · 2020-04-16T19:40:00Z

plan to make this callable, but
it can be run today using input files containing samples

plan to make this callable, but it can be run today using input files containing samples

aakarshg · 2020-04-16T20:49:48Z

src/touchstone/utils/t_test.py

+from sys import argv, exit
+import math
+import numpy
+import scipy


we'll need to add these to the https://github.com/cloud-bulldozer/touchstone/blob/master/requirements.txt

aakarshg

Added a comment about requirements, and moving forward if we'd like to make use of this we'll need to make changes to touchstone as for the es module we dont just get samples, and then do aggregations but if i understand correctly we can make use of this t_test if we can expose the samples directly...

jtaleric · 2021-07-28T18:57:36Z

@bengland2 doing so late spring cleaning, do we want to keep this around?

bengland2 · 2021-07-28T20:46:28Z

@jtaleric This dates back to a disagreement with aakarsh a year ago. Does benchmark-comparison (formerly touchstone) do a statistically valid comparison of 2 sample sets? It's not valid to conclude that two sample sets are different by just comparing sample set means, you have to incorporate variance of sample sets somehow. I'd have to look closer at benchmark-comparison to see what it does.

mfleader · 2021-08-06T20:48:21Z

@bengland2 Has a point about comparing two sets of samples on an index value like a mean. However, use of the frequentist t-test on any given pair of samples of continuous data is not justified.

bengland2 · 2021-10-25T14:09:47Z

There are two kinds of t-test, one uses pairs of samples from different sets, one does not. I used the latter. inevity mentioned that a u-test might be better, see this comment.

jtaleric · 2021-11-01T11:39:15Z

@jtaleric This dates back to a disagreement with aakarsh a year ago. Does benchmark-comparison (formerly touchstone) do a statistically valid comparison of 2 sample sets?

Today, it doesn't do inter-uuid comparisons to determine variance between samples.

We would need to define the sample field to do the variance comparison on. We could make some default assumptions, for tings like FIO, but we would want to allow some generalizations too.

Instead of a stand-alone script, I think it might make more sense to have this part of the gen_result_dict method?

It's not valid to conclude that two sample sets are different by just comparing sample set means, you have to incorporate variance of sample sets somehow. I'd have to look closer at benchmark-comparison to see what it does.

Understood. We just need defined ways to capture and compare the samples.

I think we can close this for now, and start a RFE / Issue to discuss how to implement something like this? Thoughts?

mfleader · 2021-11-01T12:06:24Z

I'm super interested in working on this as soon I get the cycles. I feel like @whitleykeith has also mentioned being interested in this and how it can integrate with the dashboard.

jtaleric · 2021-11-01T13:30:56Z

I'm super interested in working on this as soon I get the cycles. I feel like @whitleykeith has also mentioned being interested in this and how it can integrate with the dashboard.

I am not sure I have explained myself well here... There is quite a bit of context you would need in order to tackle this. Not every workload would have "sample".

mfleader · 2021-11-01T14:45:02Z

That is a good point, workloads would need to output the raw data, and not summary statistics like the mean, median, p99, or standard deviation. @bengland2 and I have talked a little bit about this, and some workloads, like FIO, would need to output a histogram of a run's results to facilitate estimating dispersion and extreme values, like p99. Outside of that, we can talk about workloads that do not make raw data available, and strategize potential solutions.

bengland2 · 2021-11-01T15:18:07Z

Matt Leader ( @mfleader ) showed me this article, which seems like a generalization of what Karl Cronberg and I did with fio histogram logging in 2016. This might be a way to implement this kind of analysis in a more benchmark-independent way, since many perf. benchmarks generate response time data. fio's histogram logging method is that it allows you to generate cluster-wide response time percentiles in a statistically valid way, something that does not exist in benchmark-operator today.

bengland2 · 2022-04-25T16:58:49Z

I retired.

example of how to do a statistical performance regression test

1a0d9a8

plan to make this callable, but it can be run today using input files containing samples

aakarshg reviewed Apr 16, 2020

View reviewed changes

aakarshg suggested changes Apr 16, 2020

View reviewed changes

inevity mentioned this pull request Sep 18, 2021

account for variance in samples #10

Open

switch to using Mann-Whitney U test by default

4e70b5c

bengland2 closed this Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example of how to do a statistical performance regression test #16

example of how to do a statistical performance regression test #16

bengland2 commented Apr 16, 2020

aakarshg Apr 16, 2020

aakarshg left a comment

jtaleric commented Jul 28, 2021

bengland2 commented Jul 28, 2021

mfleader commented Aug 6, 2021 •

edited

Loading

bengland2 commented Oct 25, 2021

jtaleric commented Nov 1, 2021

mfleader commented Nov 1, 2021

jtaleric commented Nov 1, 2021

mfleader commented Nov 1, 2021

bengland2 commented Nov 1, 2021

bengland2 commented Apr 25, 2022

example of how to do a statistical performance regression test #16

example of how to do a statistical performance regression test #16

Conversation

bengland2 commented Apr 16, 2020

aakarshg Apr 16, 2020

Choose a reason for hiding this comment

aakarshg left a comment

Choose a reason for hiding this comment

jtaleric commented Jul 28, 2021

bengland2 commented Jul 28, 2021

mfleader commented Aug 6, 2021 • edited Loading

bengland2 commented Oct 25, 2021

jtaleric commented Nov 1, 2021

mfleader commented Nov 1, 2021

jtaleric commented Nov 1, 2021

mfleader commented Nov 1, 2021

bengland2 commented Nov 1, 2021

bengland2 commented Apr 25, 2022

mfleader commented Aug 6, 2021 •

edited

Loading