Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalize YAML metadata fields #56

Closed
ischoegl opened this issue Jul 6, 2020 · 8 comments
Closed

Formalize YAML metadata fields #56

ischoegl opened this issue Jul 6, 2020 · 8 comments
Labels
feature-request New feature request

Comments

@ischoegl
Copy link
Member

ischoegl commented Jul 6, 2020

Abstract

Based on a discussion in Cantera/cantera#881, the description of metadata appears to not be fully formalized. At the moment, ck2yaml creates note and description fields. See @speth's comment

Just because ck2yaml currently creates a field named note doesn't mean should formalize that as the way of storing metadata. I know there are multiple pseudo-standards for organizing metadata in Chemkin mechanism comments which could be used to populate more semantically useful fields. For example, I know RMG puts the SMILES for each species as the comment following each species name in the SPECIES section, and information on the source of the reaction rate before each reaction. If ck2yaml gets smart enough to do this, I'd rather the note field be empty except when it doesn't know what else to do.

Motivation

Defining note and description now with potentially moving to other ways of storing metadata in the future will create potential issues in terms of backward compatibility. At the moment, note contains CHEMKIN comments that qualify as metadata if the YAML file is generated by ck2yaml. I.e. if someone uses a YAML file generated by ck2yaml in 2.5, and then revisits a problem in, say, 3.1, the YAML note fields should still be supported/readable? I strongly believe that once ck2yaml creates a precedent of defining specific fields it will be difficult to change in future releases.

With Cantera 2.5 still in alpha stage, there's a (rapidly closing) window of opportunity to 'future-proof' metadata fields by renaming note and description to meta and treating them as a generic AnyValue fields. It thus can be a string (as in the current implementation) for 2.5 or any other structure down the road.

Possible Solutions

Rename note and description to meta in YAML files (sed replacements in current YAML data), update converters and parsers.

References

@ischoegl ischoegl added the feature-request New feature request label Jul 6, 2020
@ischoegl ischoegl changed the title Rename fields describing YAML metadata (replace note by meta) Future-proof YAML metadata fields (replace note by meta) Jul 9, 2020
@speth
Copy link
Member

speth commented Jul 9, 2020

I don't really see what problem this is solving. Can you give an example of what you think would go in the meta field, and how that's not supported by the current implementation? My thinking has generally been that we should support unrestricted user-defined fields within at least all of the higher-level YAML data structures.

To avoid creating backwards incompatibilities in the example I gave, I think it would be reasonable for that additional comment-parsing behavior of ck2yaml to be a (non-default) option.

@ischoegl
Copy link
Member Author

ischoegl commented Jul 9, 2020

@speth

I don't really see what problem this is solving. Can you give an example of what you think would go in the meta field, and how that's not supported by the current implementation?

The way I see this, note (as well as description) already constitute metadata, which just happen to be the version that ck2yaml currently understands. If future versions of ck2yaml would interpret annotations differently (or use different field names in default and non-default behavior), this would mean that notes would all of a sudden be empty, while a previously non-existent meta (or other field name) is populated. My point is that it may be more consistent to create a meta field from the get go, which can be populated using a str (current behavior), or some other structure (non-default behavior of future versions).

From a user perspective, this would mean that <ctobject>.meta would always produce a result, i.e. the behavior is predictable. In the default/non-default version you describe, one or even both would not exist (i.e. it could be <ctobject>.note or <ctobject>.meta or neither exists). I believe that it's less confusing to always have meta defined (i.e. as a standard attribute), and just have it return None if nothing had been specified.

Example

To give a specific example, let's assume that I download a YAML mechanism file from group XYZ's website, and am interested in specifics of some reaction rxn. In my proposed revision, I'd simply type

rxn.meta

and get a result (which may be None, a str, or a dict, i.e. the content of some underlying AnyValue). In the alternative scenario, I may have to track down

rxn.some_custom_field_that_contains_metadata

that may or may not exist for this specific reaction (and may only be used by group XYZ), and/or use

rxn.note

in order to get to the desired information.

Further, I may want to use yaml2ck to generate CK input: how would the converter know what to put as metadata annotation (especially if XYZ’s format doesn’t use one of the ‘official’ structures)?

If a meta field/attribute makes sense as envisioned here, Cantera/cantera#881 would introduce the external interface, where the underlying implementation is a placeholder for #11.

@ischoegl
Copy link
Member Author

ischoegl commented Jul 10, 2020

PS: I haven't given an actual example of YAML input yet. Keeping with the definition of meta being data describing data (as opposed to auxiliary data), this would be a possibility:

species:
- name: H2
  composition: {H: 2}
  meta:
    alias: hydrogen
    SMILES: ‘[HH]’
    InChI: InChI=1S/H2/h1H
    InChI-key: UFHFLCQGNIYNRP-UHFFFAOYSA-N
  thermo:
    model: NASA7
    temperature-ranges: [200.0, 1000.0, 3500.0]
    data:
    - [2.34433112, 7.98052075e-03, -1.9478151e-05, 2.01572094e-08, -7.37611761e-12,
      -917.935173, 0.683010238]
    - [3.3372792, -4.94024731e-05, 4.99456778e-07, -1.79566394e-10, 2.00255376e-14,
      -950.158922, -3.20502331]
    meta: TPIS78
[...]

reactions:
- equation: 2 H + H2 <=> 2 H2  # Reaction 40
  rate-constant: {A: 9.0e+16, b: -0.6, Ea: 0.0}
  meta: some description
[...]

As the meta field would be optional, meta mostly replaces note (anticipated use for Cantera 2.5). For H2, I have added a meta field with hierarchical structure, as a mock-up for potential Cantera 2.6+ implementations (using data from PubChem). As mentioned above, there wouldn't be restrictions in terms of content/structure of the meta field. Note that the specific example naturally combines with #14 as data describing data. Imho, the envisioned output of gas.species(‘H2’).meta (a dict) would be informative for cases like this ...

Obviously, a lot of this is perfectly compatible with #11; the only difference is that meta is a reserved field name that guarantees predictable behavior.

@bryanwweber
Copy link
Member

@ischoegl Thanks for posting the YAML sample, it helps clarify. I think having some sort of hierarchical structure for the data may make sense, but I have some reservations as well. The biggest reservation is that I don't think there's a need for it at the present moment. I think it is possible to leave everything at the top-level of a given definition without sacrificing backwards compatibility. For instance, we can print deprecation warnings and include the contents of a field in two places for a time, if necessary. We can certainly do our best to avoid this, but I think users upgrading, and especially upgrading across a major version change, should expect some changes.

My second reservation is that it is hard to define what qualifies as metadata, so I think that eventually, any meta field will just become a dumping ground for every property that doesn't seem to have another home. For instance, name could be considered metadata, but will presumably remain at the top level. Likewise, I could argue that SMILES is actually data, not metadata, since it is one way of expressing how the molecule is connected together. For Cantera, that may be metadata. For other tools, it would not be. Ideally, this format will be the new interchange between many tools in much the same way that the CK format is right now. As such, I'd prefer to leave as much flexibility as we can until we discover what people are going to do.

I believe what @speth has in mind is much more flexible than the implementation proposed here and in Cantera/cantera#881. My understanding is that users will not need to access the fields of the YAML as attributes/properties of a class (although if I understand correctly, the ability to dynamically generate attributes may be possible, particularly in Python). As such, there's no need to formalize anything right now and we can effectively punt, with the small caveat that yaml2ck proposed in #52 will not be quite as feature-ful until this ability is added.

In all, I am not in favor of formalizing into the user interface any fields in YAML that Cantera doesn't use directly for calculations. Since the YAML data is not available to users right now anyways, I would prefer to wait for a more complete implementation than bake in something that limits future flexibility. Yes, that means it won't be ready for 2.5. Hopefully (🤞) it won't be 2 years before 2.6 is released with this functionality.

@ischoegl
Copy link
Member Author

ischoegl commented Jul 10, 2020

@bryanwweber ... thanks for the feedback

My understanding is that users will not need to access the fields of the YAML as attributes/properties of a class (although if I understand correctly, the ability to dynamically generate attributes may be possible, particularly in Python).

I thought that this was the point?! Accessing fields as dynamic attributes is straightforward, and would be quite beneficial.

In all, I am not in favor of formalizing into the user interface any fields in YAML that Cantera doesn't use directly for calculations.

To provide an example where meta would be useful to the user is in the differentiation of isomers (see discussions in Cantera/cantera#635 and Cantera/cantera#859, which point towards the need of informed user interaction); while metadata may not be used for calculations, it would be very beneficial to have, especially if there’s a dedicated attribute that is documented.

I believe what @speth has in mind is much more flexible than the implementation proposed here and in Cantera/cantera#881.

No doubt about that, but I never saw the efforts as mutually exclusive (as I tried to point out elsewhere, there is no contradiction). All I am promoting here is to formalize one field (meta), which is not in any conflict with the major goals of #11.

Regardless, I closed Cantera/cantera#881 and hope that my suggestion can be considered at a later point.

@ischoegl ischoegl changed the title Future-proof YAML metadata fields (replace note by meta) Formalize YAML metadata fields (replace note by meta) Jul 10, 2020
@ischoegl ischoegl changed the title Formalize YAML metadata fields (replace note by meta) Formalize YAML metadata fields Jul 10, 2020
@bryanwweber
Copy link
Member

I thought that this was the point?! Accessing fields as dynamic attributes is straightforward, and would be quite beneficial.

I'm not sure anyone has ever said that YAML fields would/should be accessible as instance attributes, specifically. There are other options, for instance as @speth has said, he is planning to provide complete access as language-natural data types (dictionaries & lists in Python, etc.). Given the flexibility that we imagine as being possible with YAML, IMO users expecting access by instance attributes is not really feasible, although if it is possible to do dynamic generation, that might be appropriate.

To provide an example where meta would be useful to the user is in the differentiation of isomers (see discussions in Cantera/cantera#635 and Cantera/cantera#859, which point towards the need of informed user interaction); while metadata may not be used for calculations, it would be very beneficial to have, especially if there’s a dedicated attribute that is documented.

I agree that metadata is useful. IMO, providing access in a way that is natural for what users know is in their own input files is better than trying to standardize on a dedicated attribute, particularly since the definition of metadata depends on the application.

No doubt about that, but I never saw the efforts as mutually exclusive (as I tried to point out elsewhere, there is no contradiction). All I am promoting here is to formalize one field (meta), which is not in any conflict with the major goals of #11.

Unfortunately, I do believe there is a conflict, in the sense I don't believe that defining a field called meta will achieve the objectives that we envision for YAML, yet. It may in the future, but there is no need to rush into anything here.

@ischoegl
Copy link
Member Author

ischoegl commented Jul 10, 2020

I agree that getting 2.5 out is probably the highest priority.

providing access in a way that is natural for what users know is in their own input files is better than trying to standardize on a dedicated attribute

I believe most users will use other’s input files, so it’s not natural to guess. Thus far, I have never had a need to generate my own mechanism (I’m not a kineticist) and rely on whatever is provided; it would be great to take the guesswork out and access whatever is provided via a dedicated attribute. I assume that not being the author of an input file is the norm amongst end users.

[...] for instance as @speth has said, he is planning to provide complete access as language-natural data types (dictionaries & lists in Python, etc.). Given the flexibility that we imagine as being possible with YAML, IMO users expecting access by instance attributes is not really feasible, although if it is possible to do dynamic generation, that might be appropriate.

What you mentioned here is imho not mutually exclusive. Dynamic attributes are straight-forward, and certainly feasible (see e.g. most attributes and methods of SolutionArray), which led me to the assumption that this would be the goal; on the other hand, language-natural data types are a means to an end regardless of how this would be accessed. Will stop here and wait until @speth provides more insights into #11.

@ischoegl
Copy link
Member Author

Probably no longer necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request New feature request
Projects
None yet
Development

No branches or pull requests

3 participants