Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First runner manipulating statistical model #50

Merged
merged 27 commits into from
Aug 4, 2023
Merged

First runner manipulating statistical model #50

merged 27 commits into from
Aug 4, 2023

Conversation

dachengx
Copy link
Collaborator

@dachengx dachengx commented Jul 21, 2023

Trying to resolve #3 and #16

The docstring of Runner will be rendered after #68.

What does the code in this PR do / what does it improve?

The runner manipulates the statistical model.

It:

  1. initializes the statistical model
  2. generates or reads toy data
  3. saves toy data if needed
  4. fits parameters
  5. writes the output file

Can you briefly describe how it works?

The changes except in runner.py, models, and model.py is only some format changes, not crucial.

likelihood_config should be contained in statistical_model_args if needed.
fit_guesses is defined in Parameters.

  1. Define a store_data method of BlueiceExtendedModel, to save the generate_values. Because store_data of StatisticalModel only saves likelihood_names in the data_name_list by default.
  2. The updating of the ancillary terms is implemented when you set self.statistical_model.data.
  3. Add keyword arguments best_fit_args and confidence_interval_args to method StatisticalModel.confidence_interval. best_fit_args gives you the global best fit. But when calculating CL, the hypothesis can be different from the best fit, so you need confidence_interval_args.

Can you give a minimal working example (or illustrate with a figure)?

from alea.runner import Runner

parameter_definition = {
    'mu': {
        'fit_guess': 0.0,
        'fittable': True,
        'nominal_value': 0.0,
        'parameter_interval_bounds': [
            -10,
            10
        ],
    },
    'sigma': {
        'fittable': False,
        'nominal_value': 1.0,
    }
}

toydata-generating runner:

generate_runner = Runner(
    statistical_model='alea.examples.gaussian_model.GaussianModel',
    parameter_definition=config['parameter_definition'],
    compute_confidence_interval=True,
    poi='mu',
    hypotheses=['free', 'null', 'true'],
    true_generate_values={'mu': 1., 'sigma': 1.},
    n_mc=3,
    toydata_file=toydata_file,
    toydata_mode='generate_and_write',
    output_file='test_toymc_generate.hdf5',
)
generate_runner.run()

toydata-reading runner:

read_runner = Runner(
    statistical_model='alea.examples.gaussian_model.GaussianModel',
    parameter_definition=config['parameter_definition'],
    compute_confidence_interval=True,
    poi='mu',
    hypotheses=['free', 'null', 'true'],
    true_generate_values={'mu': 1., 'sigma': 1.},
    n_mc=3,
    toydata_file=toydata_file,
    toydata_mode='read',
    output_file='test_toymc_read.hdf5',
)
read_runner.run()

What are the potential drawbacks of the codes?

Things should be implemented later:

  1. confidential intervals
  2. update fit_guess of Parameters

Please include the following if applicable:

  • Update the docstring(s)
  • Update the documentation
  • Tests to check the (new) code is working as desired.
  • Does it solve one of the open issues on github?

Notes on testing

  • Until the automated tests pass, please mark the PR as a draft.

All italic comments can be removed from this template.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

pep8

alea/runner.py|211 col 1| F811 redefinition of unused 'StatisticalModel' from line 11
alea/runner.py|211 col 1| E402 module level import not at top of file
alea/runner.py|214 col 1| F811 redefinition of unused 'Runner' from line 14
alea/runner.py|214 col 1| WPS338 Found incorrect order of methods in a class
alea/runner.py|214 col 1| WPS230 Found too many public instance attributes: 14 > 6
alea/runner.py|271 col 9| WPS414 Found incorrect unpacking target
alea/runner.py|271 col 9| WPS414 Found incorrect unpacking target
alea/runner.py|271 col 9| WPS414 Found incorrect unpacking target
alea/runner.py|281 col 1| E800 Found commented out code
alea/runner.py|282 col 1| E800 Found commented out code
alea/runner.py|283 col 1| E800 Found commented out code
alea/runner.py|318 col 9| WPS221 Found line with high Jones Complexity: 15 > 14
alea/runner.py|320 col 22| WPS510 Found in used with a non-set container
alea/runner.py|328 col 48| WPS441 Found control variable used after block: ea

alea/blueice_extended_model.py Outdated Show resolved Hide resolved
alea/blueice_extended_model.py Outdated Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated
confidence_interval_threshold: Callable[[float], float] = None,
poi: str = None,
hypotheses: list = None,
common_generate_values: dict = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the distinction between common_generate_values and true_generate_values?
the generate values are always true for the datasets generated with them, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in previous commits.

alea/runner.py Outdated
# pass
return parameter_list, result_list, result_dtype

def _get_generate_values(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming inside here mixed hypothesis and generate values even for when hypothesis is not "true"-- I think this is unfortunate, separating them conceptually is a key to getting the concepts right in my mind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in previous commits.

alea/runner.py Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Jul 22, 2023

Pull Request Test Coverage Report for Build 5744366197

  • 167 of 171 (97.66%) changed or added relevant lines in 6 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+9.2%) to 72.302%

Changes Missing Coverage Covered Lines Changed/Added Lines %
alea/parameters.py 18 19 94.74%
alea/utils.py 8 9 88.89%
alea/model.py 24 26 92.31%
Files with Coverage Reduction New Missed Lines %
alea/model.py 1 86.47%
Totals Coverage Status
Change from base Build 5730856253: 9.2%
Covered Lines: 757
Relevant Lines: 1047

💛 - Coveralls

@hammannr hammannr self-requested a review July 25, 2023 06:47
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
Encourage people to use h5 instead of hdf5
tests/test_runner.py Outdated Show resolved Hide resolved
tests/test_runner.py Outdated Show resolved Hide resolved
tests/test_runner.py Show resolved Hide resolved
alea/template_source.py Show resolved Hide resolved
alea/template_source.py Show resolved Hide resolved
@dachengx
Copy link
Collaborator Author

dachengx commented Jul 31, 2023

@kdund I am done with your suggestions.

I will then tune the pytest.

@dachengx dachengx marked this pull request as ready for review July 31, 2023 20:35
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
@dachengx dachengx requested a review from kdund July 31, 2023 20:38
tests/test_runner.py Outdated Show resolved Hide resolved
alea/model.py Show resolved Hide resolved
alea/model.py Show resolved Hide resolved
alea/model.py Show resolved Hide resolved
alea/model.py Show resolved Hide resolved
alea/model.py Outdated Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
alea/parameters.py Outdated Show resolved Hide resolved
alea/parameters.py Outdated Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
tests/test_gaussian_model.py Outdated Show resolved Hide resolved
tests/test_gaussian_model.py Outdated Show resolved Hide resolved
tests/test_gaussian_model.py Outdated Show resolved Hide resolved
tests/test_gaussian_model.py Outdated Show resolved Hide resolved
tests/test_runner.py Show resolved Hide resolved
alea/model.py Show resolved Hide resolved
alea/parameters.py Outdated Show resolved Hide resolved
alea/parameters.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
alea/runner.py Show resolved Hide resolved
Copy link
Collaborator

@kdund kdund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vvv nice! Some comments

alea/examples/gaussian_model.py Outdated Show resolved Hide resolved
alea/model.py Show resolved Hide resolved
alea/models/blueice_extended_model.py Outdated Show resolved Hide resolved
@@ -245,12 +264,20 @@ def _generate_data(self, **generate_values) -> dict:
data["generate_values"] = dict_to_structured_array(generate_values)
return data

def store_data(self, file_name, data_list, data_name_list=None, metadata=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this here, but I feel like it is a bit troublesome-- we want to say you only need ll, generate_data, but here we're redefining other functionality also, when we could just set the data_name_list at init, I guess?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to change likelihood_names here to data_name_list:

alea/alea/model.py

Lines 214 to 218 in b710149

if data_name_list is None:
if hasattr(self, "likelihood_names"):
data_name_list = self.likelihood_names
else:
data_name_list = ["{:d}".format(i) for i in range(len(_data_list[0]))]
?

I think the logic is:

  1. ll and generate_data are mandatory for a new model, this is already demonstrated at GaussianModel
  2. BlueiceExtendedModel needs an overwritten store_data function because it has an advantage that generate_values is also in the self.data, so that the StatisticalModel.store_data can not be directly applied on BlueiceExtendedModel.
  3. Someone overwriting store_data, does not violate the truth that only ll and generate_data are mandatory. In principle, one can overwrite everything.

value = [-MAX_FLOAT, MAX_FLOAT]
else:
if value[0] is None:
value[0] = -MAX_FLOAT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why max float and not inf?

Copy link
Collaborator Author

@dachengx dachengx Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set the boundary parameter_interval_bounds to np.inf, we will see errors like:

  0%|          | 0/3 [00:11<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[31], line 1
----> 1 toydata, results = runner.toy_simulation()
      2 runner.write_output(results)

File ~/alea/alea/runner.py:229, in Runner.toy_simulation(self)
    227 fit_result['ll'] = max_llh
    228 if self._compute_confidence_interval and (self.poi not in hypothesis_values):
--> 229     dl, ul = self.statistical_model.confidence_interval(
    230         poi_name=self.poi,
    231         best_fit_args=self._hypotheses_values[0],
    232         confidence_interval_args=hypothesis_values)
    233 else:
    234     dl, ul = np.nan, np.nan

File ~/alea/alea/model.py:465, in StatisticalModel.confidence_interval(self, poi_name, parameter_interval_bounds, confidence_level, confidence_interval_kind, best_fit_args, confidence_interval_args)
    463 if confidence_interval_kind in {"upper", "central"} and t_best_parameter < 0:
    464     if t(parameter_interval_bounds[1]) > 0:
--> 465         ul = brentq(t, best_parameter, parameter_interval_bounds[1])
    466     else:
    467         ul = np.inf

File /opt/XENONnT/anaconda/envs/XENONnT_2023.07.2/lib/python3.9/site-packages/scipy/optimize/_zeros_py.py:784, in brentq(f, a, b, args, xtol, rtol, maxiter, full_output, disp)
    782 if rtol < _rtol:
    783     raise ValueError("rtol too small (%g < %g)" % (rtol, _rtol))
--> 784 r = _zeros._brentq(f, a, b, xtol, rtol, maxiter, args, full_output, disp)
    785 return results_c(full_output, r)

RuntimeError: Failed to converge after 100 iterations.

This is because scipy.optimize.brentq can not accept np.inf as the boundary. Slack discussion: https://xenonnt.slack.com/archives/C04JRF0AZTP/p1690870295199999

So MAX_FLOAT here is a suggested very large value, > 10^19.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which will make the brentq super slow-- I think we should rather ask for a sensible range to always be given.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe give a warning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess-- do we set in the test configs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would throw an error tbh

Copy link
Collaborator Author

@dachengx dachengx Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, even if I change MAX_FLOAT from 10^19 to 100, the iteration time did not change.

Checked by

ul, r = brentq(t, best_parameter, parameter_interval_bounds[1], full_output=True)
print(r)

Maybe it is an advantage of brentq.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the gaussian ex of both? I'm frankly worried about all kinds of numerics also at such high values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved to throw a warning if you do not redefine this

alea/runner.py Show resolved Hide resolved
alea/runner.py Outdated Show resolved Hide resolved
tests/test_runner.py Show resolved Hide resolved
@dachengx
Copy link
Collaborator Author

dachengx commented Aug 2, 2023

I just added a data generator of Runner, and nothing functionally changed.

Copy link
Collaborator

@kdund kdund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we postpone per-hypothesis switching on whether to compute confidence intervals for now-- we seldom need it, the neyman computation which takes most of the time has no confidence intervals.

alea/model.py Show resolved Hide resolved
value = [-MAX_FLOAT, MAX_FLOAT]
else:
if value[0] is None:
value[0] = -MAX_FLOAT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess-- do we set in the test configs?

value = [-MAX_FLOAT, MAX_FLOAT]
else:
if value[0] is None:
value[0] = -MAX_FLOAT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would throw an error tbh

value = [-MAX_FLOAT, MAX_FLOAT]
else:
if value[0] is None:
value[0] = -MAX_FLOAT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the gaussian ex of both? I'm frankly worried about all kinds of numerics also at such high values.

value = [-MAX_FLOAT, MAX_FLOAT]
else:
if value[0] is None:
value[0] = -MAX_FLOAT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved to throw a warning if you do not redefine this

alea/model.py Show resolved Hide resolved
@dachengx dachengx linked an issue Aug 4, 2023 that may be closed by this pull request
Copy link
Collaborator

@kdund kdund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks, @dachengx to me this looks good

@dachengx dachengx merged commit 66538e5 into main Aug 4, 2023
@dachengx dachengx deleted the first_runner branch August 4, 2023 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Runner class Improve naming of toydata_mode
2 participants