Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example of how to do a statistical performance regression test #16

Closed
wants to merge 2 commits into from
Closed

Conversation

bengland2
Copy link

plan to make this callable, but
it can be run today using input files containing samples

plan to make this callable, but
it can be run today using input files containing samples
from sys import argv, exit
import math
import numpy
import scipy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@aakarshg aakarshg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about requirements, and moving forward if we'd like to make use of this we'll need to make changes to touchstone as for the es module we dont just get samples, and then do aggregations but if i understand correctly we can make use of this t_test if we can expose the samples directly...

@jtaleric
Copy link
Member

@bengland2 doing so late spring cleaning, do we want to keep this around?

@bengland2
Copy link
Author

@jtaleric This dates back to a disagreement with aakarsh a year ago. Does benchmark-comparison (formerly touchstone) do a statistically valid comparison of 2 sample sets? It's not valid to conclude that two sample sets are different by just comparing sample set means, you have to incorporate variance of sample sets somehow. I'd have to look closer at benchmark-comparison to see what it does.

@mfleader
Copy link

mfleader commented Aug 6, 2021

@bengland2 Has a point about comparing two sets of samples on an index value like a mean. However, use of the frequentist t-test on any given pair of samples of continuous data is not justified.

@bengland2
Copy link
Author

There are two kinds of t-test, one uses pairs of samples from different sets, one does not. I used the latter. inevity mentioned that a u-test might be better, see this comment.

@jtaleric
Copy link
Member

jtaleric commented Nov 1, 2021

@jtaleric This dates back to a disagreement with aakarsh a year ago. Does benchmark-comparison (formerly touchstone) do a statistically valid comparison of 2 sample sets?

Today, it doesn't do inter-uuid comparisons to determine variance between samples.

We would need to define the sample field to do the variance comparison on. We could make some default assumptions, for tings like FIO, but we would want to allow some generalizations too.

Instead of a stand-alone script, I think it might make more sense to have this part of the gen_result_dict method?

It's not valid to conclude that two sample sets are different by just comparing sample set means, you have to incorporate variance of sample sets somehow. I'd have to look closer at benchmark-comparison to see what it does.

Understood. We just need defined ways to capture and compare the samples.

I think we can close this for now, and start a RFE / Issue to discuss how to implement something like this? Thoughts?

@mfleader
Copy link

mfleader commented Nov 1, 2021

I'm super interested in working on this as soon I get the cycles. I feel like @whitleykeith has also mentioned being interested in this and how it can integrate with the dashboard.

@jtaleric
Copy link
Member

jtaleric commented Nov 1, 2021

I'm super interested in working on this as soon I get the cycles. I feel like @whitleykeith has also mentioned being interested in this and how it can integrate with the dashboard.

I am not sure I have explained myself well here... There is quite a bit of context you would need in order to tackle this. Not every workload would have "sample".

@mfleader
Copy link

mfleader commented Nov 1, 2021

That is a good point, workloads would need to output the raw data, and not summary statistics like the mean, median, p99, or standard deviation. @bengland2 and I have talked a little bit about this, and some workloads, like FIO, would need to output a histogram of a run's results to facilitate estimating dispersion and extreme values, like p99. Outside of that, we can talk about workloads that do not make raw data available, and strategize potential solutions.

@bengland2
Copy link
Author

Matt Leader ( @mfleader ) showed me this article, which seems like a generalization of what Karl Cronberg and I did with fio histogram logging in 2016. This might be a way to implement this kind of analysis in a more benchmark-independent way, since many perf. benchmarks generate response time data. fio's histogram logging method is that it allows you to generate cluster-wide response time percentiles in a statistically valid way, something that does not exist in benchmark-operator today.

@bengland2
Copy link
Author

I retired.

@bengland2 bengland2 closed this Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants