-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
example of how to do a statistical performance regression test #16
Conversation
plan to make this callable, but it can be run today using input files containing samples
from sys import argv, exit | ||
import math | ||
import numpy | ||
import scipy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need to add these to the https://github.com/cloud-bulldozer/touchstone/blob/master/requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment about requirements, and moving forward if we'd like to make use of this we'll need to make changes to touchstone as for the es module we dont just get samples, and then do aggregations but if i understand correctly we can make use of this t_test if we can expose the samples directly...
@bengland2 doing so late spring cleaning, do we want to keep this around? |
@jtaleric This dates back to a disagreement with aakarsh a year ago. Does benchmark-comparison (formerly touchstone) do a statistically valid comparison of 2 sample sets? It's not valid to conclude that two sample sets are different by just comparing sample set means, you have to incorporate variance of sample sets somehow. I'd have to look closer at benchmark-comparison to see what it does. |
@bengland2 Has a point about comparing two sets of samples on an index value like a mean. However, use of the frequentist t-test on any given pair of samples of continuous data is not justified. |
There are two kinds of t-test, one uses pairs of samples from different sets, one does not. I used the latter. inevity mentioned that a u-test might be better, see this comment. |
Today, it doesn't do inter-uuid comparisons to determine variance between samples. We would need to define the sample field to do the variance comparison on. We could make some default assumptions, for tings like FIO, but we would want to allow some generalizations too. Instead of a stand-alone script, I think it might make more sense to have this part of the
Understood. We just need defined ways to capture and compare the samples. I think we can close this for now, and start a RFE / Issue to discuss how to implement something like this? Thoughts? |
I'm super interested in working on this as soon I get the cycles. I feel like @whitleykeith has also mentioned being interested in this and how it can integrate with the dashboard. |
I am not sure I have explained myself well here... There is quite a bit of context you would need in order to tackle this. Not every workload would have "sample". |
That is a good point, workloads would need to output the raw data, and not summary statistics like the mean, median, p99, or standard deviation. @bengland2 and I have talked a little bit about this, and some workloads, like FIO, would need to output a histogram of a run's results to facilitate estimating dispersion and extreme values, like p99. Outside of that, we can talk about workloads that do not make raw data available, and strategize potential solutions. |
Matt Leader ( @mfleader ) showed me this article, which seems like a generalization of what Karl Cronberg and I did with fio histogram logging in 2016. This might be a way to implement this kind of analysis in a more benchmark-independent way, since many perf. benchmarks generate response time data. fio's histogram logging method is that it allows you to generate cluster-wide response time percentiles in a statistically valid way, something that does not exist in benchmark-operator today. |
I retired. |
plan to make this callable, but
it can be run today using input files containing samples