-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EconomicRatio PostProcessor Enhancements #1763
EconomicRatio PostProcessor Enhancements #1763
Conversation
…hecking no need to overload
…nomicRatio and not throw warnings
…rtinoRatio, gainLossRatio, and expectedShortfall
I think tests will need to be modified or added for BasicStatistics, but I don't know which ones. I did not find tests for EconomicRatio, so suggestions on good tests to write are appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Nice implementation of the percentile STE and callback to BasicStatistics.
A couple comments and one change suggestion to review.
You can find the BasicStatistics with standard error tests in For EconomicRatio, I sent an email seeing if anyone knows where one is; it might need to be created new. |
Code changes are good, we just need a test to cover EconomicRatio. I note that several tests using BasicStatistics are failing due to the added percentile STE column, so we'll have to regold those after checking they're the same otherwise. |
Notification to plugin developers should be sent noting that use of the BasicStatistics is producing more information than it used to, which may fail some tests using the output from that postprocessor. |
@wangcj05 do we want to send an email to the user group about the new percentile standard error? |
@dgarrett622 Email to the user group is optional, we are required to send the email when there is a defect in the code that will cause incorrect results. |
kde = stats.gaussian_kde(group.values, weights=targWeight) | ||
val = calculatedPercentiles[target].sel(**{'percent': pct, self.pivotParameter: label}).values | ||
subPercentileSte.append(factor/kde(val)[0]) | ||
percentileSte.append(subPercentileSte) | ||
da = xr.DataArray(percentileSte, dims=('percent', self.pivotParameter), coords={'percent': percent, self.pivotParameter: self.pivotValue}) | ||
percentileSteSet[target] = da | ||
else: | ||
calcPercentiles = calculatedPercentiles[target] | ||
if targDa.values.min() == targDa.values.max(): | ||
# distribution is a delta function, so no KDE construction | ||
percentileSte = list(np.zeros(calcPercentiles.shape)) | ||
else: | ||
# get KDE | ||
kde = stats.gaussian_kde(targDa.values, weights=targWeight) | ||
factor = np.sqrt(np.array(percent)*(1.0 - np.array(percent))/en) | ||
percentileSte = list(factor/kde(calcPercentiles.values)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgarrett622 I have a question regarding the calculation of standard error for percentile, from the reference (see the following snapshot)
The standard normal distribution is used instead of KDE of real data, could you double check with the reference, and let me know what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangcj05 That reference gives an equation for the asymptotic standard error estimate of a percentile assuming the random variable belongs to a normal distribution. There is a more general form of this asymptotic estimate that can be found in various references such as:
http://www.medicine.mcgill.ca/epidemiology/hanley/bios601/DescriptiveStatistics/Var(percentile).pdf
Here is a snapshot of the more general asymptotic estimate from the reference above which I implemented:
The KDE tries to get the real distribution from the data without making the assumption that the data are normally distributed. The percentile, its standard error, and the KDE perform better with more samples, but that was always going to be a limitation when requesting a percentile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgarrett622 Thanks for the reference, as you mentioned and also concluded by the paper, the limitation is the number of samples. I would like this information to be included in the user manual.
I do have some question about your implementation. In the code, you are using gaussian_kde which is based on gaussian kernels. In this case, it seems the two equations from the different papers are equivalent. Basically, you can transform your fitted PDF to standard normal distribution, and you will get the same equation. If this is the case, I prefer to use the standard normal distribution to compute the error. The reason for that is: you do not need to fit your data with gaussian_kde model for target variable. Especially when you have a larger sample size, the fitting will take more time to process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly then that you should only use the generic formula when you have a great many samples, which will take forever due to creating the kernel?
I do agree that generally creating the kernel is costly and we've often found ways to work around it in RAVEN for that reason. I did not check the time impact of the percentile_ste
addition; we really need some timing tests in RAVEN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangcj05 The two equations are not equivalent. Using a Gaussian kernel is not the same as fitting a Gaussian to the data. The Gaussian kernel is like smoothing or interpolating to fit a distribution to the real data. Here is an example of using a Gaussian kernel to find a distribution that is definitely not Gaussian:
https://gsalvatovallverdu.gitlab.io/python/kernel_density_estimation/
With the distribution fit to the data becoming:
The number of samples is an issue here, both in accuracy and computational cost. Determining the percentile itself requires many samples, especially if the requested percentile is in the tail of the distribution. The more samples you take, the more time it takes to compute the percentile since we do sorting on the data itself. The limitation in the paper applies to the calculation of the percentile itself and the standard error calculation, whether the standard normal formulation or the KDE method is used since it is based on number of samples. However, the standard normal formulation assumes that the random variable is distributed normally while the KDE method makes no assumption about the distribution.
An alternate method would be to use bootstrapped samples, but again there are concerns about computational time with that method.
Pull Request Description
What issue does this change request address? (Use "#" before the issue to link it, i.e., #42.)
#1752
What are the significant changes in functionality due to this change request?
percentile_10_ste_x
VaR_0.05_NPV
,VaR_0.05_ste_NPV
,es_0.10_NPV
,gainLoss_zero_NPV
,sort_median_NPV
For Change Control Board: Change Request Review
The following review must be completed by an authorized member of the Change Control Board.
<internalParallel>
to True.raven/tests/framework/user_guide
andraven/docs/workshop
) have been changed, the associated documentation must be reviewed and assured the text matches the example.