-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First runner manipulating statistical model #50
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit
pep8
alea/runner.py|211 col 1| F811 redefinition of unused 'StatisticalModel' from line 11
alea/runner.py|211 col 1| E402 module level import not at top of file
alea/runner.py|214 col 1| F811 redefinition of unused 'Runner' from line 14
alea/runner.py|214 col 1| WPS338 Found incorrect order of methods in a class
alea/runner.py|214 col 1| WPS230 Found too many public instance attributes: 14 > 6
alea/runner.py|271 col 9| WPS414 Found incorrect unpacking target
alea/runner.py|271 col 9| WPS414 Found incorrect unpacking target
alea/runner.py|271 col 9| WPS414 Found incorrect unpacking target
alea/runner.py|281 col 1| E800 Found commented out code
alea/runner.py|282 col 1| E800 Found commented out code
alea/runner.py|283 col 1| E800 Found commented out code
alea/runner.py|318 col 9| WPS221 Found line with high Jones Complexity: 15 > 14
alea/runner.py|320 col 22| WPS510 Found in
used with a non-set container
alea/runner.py|328 col 48| WPS441 Found control variable used after block: ea
alea/runner.py
Outdated
confidence_interval_threshold: Callable[[float], float] = None, | ||
poi: str = None, | ||
hypotheses: list = None, | ||
common_generate_values: dict = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the distinction between common_generate_values and true_generate_values?
the generate values are always true for the datasets generated with them, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in previous commits.
alea/runner.py
Outdated
# pass | ||
return parameter_list, result_list, result_dtype | ||
|
||
def _get_generate_values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming inside here mixed hypothesis and generate values even for when hypothesis is not "true"-- I think this is unfortunate, separating them conceptually is a key to getting the concepts right in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in previous commits.
Pull Request Test Coverage Report for Build 5744366197
💛 - Coveralls |
Encourage people to use h5 instead of hdf5
@kdund I am done with your suggestions. I will then tune the pytest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vvv nice! Some comments
@@ -245,12 +264,20 @@ def _generate_data(self, **generate_values) -> dict: | |||
data["generate_values"] = dict_to_structured_array(generate_values) | |||
return data | |||
|
|||
def store_data(self, file_name, data_list, data_name_list=None, metadata=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this here, but I feel like it is a bit troublesome-- we want to say you only need ll, generate_data, but here we're redefining other functionality also, when we could just set the data_name_list at init, I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to change likelihood_names
here to data_name_list
:
Lines 214 to 218 in b710149
if data_name_list is None: | |
if hasattr(self, "likelihood_names"): | |
data_name_list = self.likelihood_names | |
else: | |
data_name_list = ["{:d}".format(i) for i in range(len(_data_list[0]))] |
I think the logic is:
ll
andgenerate_data
are mandatory for a new model, this is already demonstrated atGaussianModel
BlueiceExtendedModel
needs an overwrittenstore_data
function because it has an advantage thatgenerate_values
is also in theself.data
, so that theStatisticalModel.store_data
can not be directly applied onBlueiceExtendedModel
.- Someone overwriting
store_data
, does not violate the truth that onlyll
andgenerate_data
are mandatory. In principle, one can overwrite everything.
alea/parameters.py
Outdated
value = [-MAX_FLOAT, MAX_FLOAT] | ||
else: | ||
if value[0] is None: | ||
value[0] = -MAX_FLOAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why max float and not inf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If set the boundary parameter_interval_bounds
to np.inf, we will see errors like:
0%| | 0/3 [00:11<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[31], line 1
----> 1 toydata, results = runner.toy_simulation()
2 runner.write_output(results)
File ~/alea/alea/runner.py:229, in Runner.toy_simulation(self)
227 fit_result['ll'] = max_llh
228 if self._compute_confidence_interval and (self.poi not in hypothesis_values):
--> 229 dl, ul = self.statistical_model.confidence_interval(
230 poi_name=self.poi,
231 best_fit_args=self._hypotheses_values[0],
232 confidence_interval_args=hypothesis_values)
233 else:
234 dl, ul = np.nan, np.nan
File ~/alea/alea/model.py:465, in StatisticalModel.confidence_interval(self, poi_name, parameter_interval_bounds, confidence_level, confidence_interval_kind, best_fit_args, confidence_interval_args)
463 if confidence_interval_kind in {"upper", "central"} and t_best_parameter < 0:
464 if t(parameter_interval_bounds[1]) > 0:
--> 465 ul = brentq(t, best_parameter, parameter_interval_bounds[1])
466 else:
467 ul = np.inf
File /opt/XENONnT/anaconda/envs/XENONnT_2023.07.2/lib/python3.9/site-packages/scipy/optimize/_zeros_py.py:784, in brentq(f, a, b, args, xtol, rtol, maxiter, full_output, disp)
782 if rtol < _rtol:
783 raise ValueError("rtol too small (%g < %g)" % (rtol, _rtol))
--> 784 r = _zeros._brentq(f, a, b, xtol, rtol, maxiter, args, full_output, disp)
785 return results_c(full_output, r)
RuntimeError: Failed to converge after 100 iterations.
This is because scipy.optimize.brentq
can not accept np.inf
as the boundary. Slack discussion: https://xenonnt.slack.com/archives/C04JRF0AZTP/p1690870295199999
So MAX_FLOAT
here is a suggested very large value, > 10^19.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which will make the brentq super slow-- I think we should rather ask for a sensible range to always be given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe give a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess-- do we set in the test configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would throw an error tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, even if I change MAX_FLOAT
from 10^19 to 100, the iteration time did not change.
Checked by
ul, r = brentq(t, best_parameter, parameter_interval_bounds[1], full_output=True)
print(r)
Maybe it is an advantage of brentq
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the gaussian ex of both? I'm frankly worried about all kinds of numerics also at such high values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved to throw a warning if you do not redefine this
I just added a data generator of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we postpone per-hypothesis switching on whether to compute confidence intervals for now-- we seldom need it, the neyman computation which takes most of the time has no confidence intervals.
alea/parameters.py
Outdated
value = [-MAX_FLOAT, MAX_FLOAT] | ||
else: | ||
if value[0] is None: | ||
value[0] = -MAX_FLOAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess-- do we set in the test configs?
alea/parameters.py
Outdated
value = [-MAX_FLOAT, MAX_FLOAT] | ||
else: | ||
if value[0] is None: | ||
value[0] = -MAX_FLOAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would throw an error tbh
alea/parameters.py
Outdated
value = [-MAX_FLOAT, MAX_FLOAT] | ||
else: | ||
if value[0] is None: | ||
value[0] = -MAX_FLOAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the gaussian ex of both? I'm frankly worried about all kinds of numerics also at such high values.
alea/parameters.py
Outdated
value = [-MAX_FLOAT, MAX_FLOAT] | ||
else: | ||
if value[0] is None: | ||
value[0] = -MAX_FLOAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved to throw a warning if you do not redefine this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks, @dachengx to me this looks good
Trying to resolve #3 and #16
The docstring of
Runner
will be rendered after #68.What does the code in this PR do / what does it improve?
The runner manipulates the statistical model.
It:
Can you briefly describe how it works?
The changes except in
runner.py
,models
, andmodel.py
is only some format changes, not crucial.likelihood_config
should be contained instatistical_model_args
if needed.fit_guesses
is defined inParameters
.store_data
method ofBlueiceExtendedModel
, to save thegenerate_values
. Becausestore_data
ofStatisticalModel
only saveslikelihood_names
in thedata_name_list
by default.self.statistical_model.data
.best_fit_args
andconfidence_interval_args
to methodStatisticalModel.confidence_interval
.best_fit_args
gives you the global best fit. But when calculating CL, the hypothesis can be different from the best fit, so you needconfidence_interval_args
.Can you give a minimal working example (or illustrate with a figure)?
toydata-generating runner:
toydata-reading runner:
What are the potential drawbacks of the codes?
Things should be implemented later:
fit_guess
ofParameters
Please include the following if applicable:
Notes on testing
All italic comments can be removed from this template.