Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paragraph arguing for fixed-parameter yaml files #3

Closed
jeromekelleher opened this issue Nov 17, 2020 · 5 comments
Closed

Paragraph arguing for fixed-parameter yaml files #3

jeromekelleher opened this issue Nov 17, 2020 · 5 comments

Comments

@jeromekelleher
Copy link
Member

The discussion around whether the yaml format should allow for distributions of parameters in the demes repo was quite useful:

popsim-consortium/demes-python#63

The conclusion (IMO) is that the yaml should be seen as a concise and precise way of describing a single, fully specified realisation of a demographic model. Adding the ability to specify distributions on the various estimated parameters is at first attractive, but as you look into the details it actually solves very little while hugely complicating the specification of the models and the implementation of parsers for the demes-yaml format.

Some high-level points to make:

  1. Simulators need to have fully specified models to actually run, and so even if the yaml specified the distributions to draw these parameters from, the simulator would need to sample from these distributions before it could actually run anything. Since every simulator that implemented the demes-yaml format would need to implement the relevant sampling strategies, this would represent a lot of duplication of effort. It would be much simpler to define some software that does this model sampling at a higher level which then outputs demes-yaml descriptions, which is then fed to the simulators.
  2. In the limit of very complex models in which the parameters for different parts of the model are interrelated, then actually outputting a series of samples of the model parameters as deme-yaml is about as good as you could do. There's no language for describing interrelated distributions of parameters that would be as general as this. This is an important point for things like MCMC inference methods, which can describe samples from the posterior distribution of inferred models by outputting lots of demes-yaml files. Since these are concise and would compress well, it's hard to imagine a more efficient way of doing it.

The one thing that's a bit unsatisfactory is that there's no way for inference methods that do output a model that consists of point estimates to indicate uncertainty around those values. I guess we could add some optional fields for each parameter value to describe this uncertainty, which simulators would typically ignore?

@grahamgower
Copy link
Member

Thanks @jeromekelleher, that's a nice summary. Regards your last point, I would note that each value specified in the yaml has a unique "key". E.g. demes.demeA.epoch[3].initial_size. It strikes me that such keys could be used in external files to describe confidence intervals, or even used as column headers in a file where each row corresponds to a single draw from a posterior distribution.

@sgravel
Copy link

sgravel commented Jun 28, 2021

Would it be possible to have a field (such as initial_size_label: N0) to refer to keys like demes.demeA.epoch[3].initial_size in a more user-readable manner? I see two benefits for this.

First, many models are defined using such variable names in the literature. These labels would typically be entered as comments in the model, for readability. Having an explicit field for this, even optional, would make things more consistent.

Second, this would make it easier for users to define optimization functions and manipulate the model.
I have been trying to integrate demes for inference into tracts, and faced the same issue of variable model specification as everyone. I followed so far @apragsdale's approach to integrate demes into moments, discussed here. which is to create a separate yaml file with instructions about which parameters are to be optimized, and what to call them. E.g.,

parameters:
- name: TA
  description: Time before present of ancestral expansion
  values:
  - demes:
      ancestral:
        epochs:
          0: `end_time`  

This tells Moments that the end time of the 0th epoch of the ancestral deme has label TA. The optimizer can then define reasonably readable functions to manipulate the model, as suggested by @grahamgower (e.g., draw parameters from a distribution, or define functional relationships between the parameters).

But the bit of code above is cryptic and bug-prone. If it could be replaced with a single, clear entry in the demes yaml, I would think it preferable.

@grahamgower
Copy link
Member

We've agreed to add a metadata field at several scopes (popsim-consortium/demes-spec#65). The mini language in moments for "tagging" parameter names could be put inside the metadata. E.g.

time_units: generations
metadata:
  params:
    - N0: demes[0].epochs[0].start_size
    - N1: demes[0].epochs[1].start_size
    - T: demes[0].epochs[0].end_time
demes:
- name: deme1
  epochs:
  - {end_time: 400.0, start_size: 500.0}
  - {end_time: 0, start_size: 100.0}

This does look similarly error prone though. It's probably better to include such params metadata "closer" to the field itself, as below. The application software would be responsible for ensuring there are no duplicated parameter names across the different metadata fields.

time_units: generations
demes:
- name: deme1
  epochs:
  - end_time: 400.0
    start_size: 500.0
    metadata:
      params:
        - N0: start_size
        - T: end_time
  - end_time: 0.0
    start_size: 100.0
    metadata:
      params:
        - N1: start_size

Honestly though, I don't see this as a long-term solution. But maybe its time to open an issue in the spec repository to discuss this further?

@sgravel
Copy link

sgravel commented Jun 28, 2021

Sorry, I did not understand the the metadata could be used this way. This would work fine for my purposes.

Happy to continue the discussion on spec!

@grahamgower
Copy link
Member

Another point to make in this section of the paper: even if we did choose to include inference-related features and make the format non-static, we cannot adequately cater for inference applications where non-Demes parameters (such as recombination rate) are going to be jointly estimated with the demography.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants