-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adapt RPS #277
adapt RPS #277
Conversation
Here in https://www.cawcr.gov.au/projects/verification/verif_web_page.html#RPS RPS is defined as the mean over all categories. In wilks 2006 it’s the sum over all categories. Have to add a note to clearly specify which one we use. |
I believe the formula on the webpage is wrong. |
In the previous version of rps, no member dimension was needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for doing this @aaronspring! And thanks for the motivation @judithberner.
I agree that there seems to be a discrepancy across definitions of RPS. Most uses and implementations do seem to use the sum (though some don't, for example, the verification
package in R takes the mean: http://finzi.psych.upenn.edu/R/library/verification/html/rps.html). I think this change makes us consistent with the majority of the literature and with the other rps-type functions in xskillscore. Also the ability to have multi-dimensional edges is really great!
Some comments below from just reading the code. Going to play about with actually using the function now and may add some additional comments.
@dougiesquire thanks for the review. what about we add |
The new way yields identical results to the histogram algorithm before. The only difference is how the upper edge is treated. I see a few ways how to make this more straightforward:
|
Doesn't the current approach already implicitly cover -inf to inf? Because the binning is done with So we could just leave as is and be clear in the docstring that this is the case. Alternatively, if you'd prefer to explicitly add |
Hm. So for now I throw away the first Fc category edges bin because this corresponds to smaller than the first category_edges bin. The question is what to expect the user to put in as argument. As category_edges has to span the full distribution, I think going from - inf to +inf is anyways required. I tend to just let the user add the edges in between. Will be an easier interface. And the topmost edge is not needed for rps calculation anyways, as Fc and Oc have to be 1 and hence cancel. |
Yes sorry, my above comment is only true if we don't throw away the first category_edge. Then, I think the user could input, for example, [1/3, 2/3] for terciles. |
Something like this for the docstring: |
Sorry, the last sentence here is misleading. The cdf is really computed for [-inf, 0), [-inf, 1) and [-inf, inf). Maybe change this or remove the last sentence. Also, it's probably more important to emphasise that the edges are right-edge exclusive rather than left-edge inclusive (given that the left edge is always effectively -inf). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @aaronspring. One more very minor question/comment.
forecasts_category_edge <U38 '[0.0, 0.33), [0.33, 0.66), [0.66, 1.0]' | ||
observations_category_edge <U38 '[0.0, 0.33), [0.33, 0.66), [0.66, 1.0]' | ||
forecasts_category_edge <U38 '[-np.inf, 0.33), [0.33, 0.66), [0.66, np.inf]' | ||
observations_category_edge <U38 '[-np.inf, 0.33), [0.33, 0.66), [0.66, np.inf]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering now if these might be confusing. As in the docstring, the cdf bins are actually all bounded on the left by -np.inf
, but these are correct if one is thinking about the pdf. I'm not sure what's clearest for the user:
'[-np.inf, 0.33), [0.33, 0.66), [0.66, np.inf]'
; or'[-np.inf, 0.33), [-np.inf, 0.66), [-np.inf, np.inf]'
Happy for you to decide what you think is the clearest, but we should probably be consistent between the docstring and _assign_rps_category_bounds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dos the question is whether we think in cdf or pdf? Cumulative bins or single size bins? Although rps is computed based on cdfs, I would still call it category_edges and therefore use the first choice of category edges as coords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, but then we should maybe also change the docstring to be consistent. That is, back to "For example, specifying category_edges = [0,1] will compute the rps for bins [-inf, 0), [0, 1) and [1, inf)"
Then I happy to merge, thanks!
Description
fair
gist: https://gist.github.com/aaronspring/3c3db6b7d5f39c08643e818b0964ee6c
Closes #275, #266
Type of change
Please delete options that are not relevant.
asv
to detect performance changes)How Has This Been Tested?
Please describe the tests that you ran to verify your changes. This could point to a cell in the updated notebooks. Or a snippet of code with accompanying figures here.
Checklist (while developing)
pytest
, if necessary.Pre-Merge Checklist (final steps)
References
Please add any references to manuscripts, textbooks, etc.