Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add a statement addressing metadata conflicts #761

Closed

Conversation

effigies
Copy link
Collaborator

@effigies effigies commented Mar 17, 2021

Closes #138.

Closes #1133

sappelhoff
sappelhoff previously approved these changes Mar 17, 2021
Copy link
Collaborator

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some wording tune up

src/02-common-principles.md Outdated Show resolved Hide resolved
Comment on lines 141 to 143
If an exact representation of BIDS metadata is not possible in the format,
then the imaging file metadata SHOULD be undefined
or set to the closest possible approximation of the BIDS metadata.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If an exact representation of BIDS metadata is not possible in the format,
then the imaging file metadata SHOULD be undefined
or set to the closest possible approximation of the BIDS metadata.
If only possibly an exact representation of BIDS metadata is possible in the format,
then the data file metadata SHOULD be set to the closest possible approximation of the BIDS metadata.

I think would make it less ambiguous. I do not think we should allow for completely undefined in such cases. If "representation" is not possible at all -- there is nothing to talk about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation I was considering here is that someone might prefer to zero out a value rather than have an approximate but incorrect value in the header. For instance, suppose you have a slice timing sequence that doesn't fit any defined NIFTI_SLICE_SEQ_* field. I'd rather set it NIFTI_SLICE_UNKNOWN than the closest approximation.

I suspect this type of case is going to be relatively common for binary formats.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes, for NIFTI1_SLICE_ORDER indeed should be set to NIFTI_SLICE_UNKNOWN and thus indeed "undefined".

Majority of formats likely to not have nil or None or explicit "unknown" like in your example to specify the "undefined" value. So might be worth explicitly stating to set to "undefined" if explicit value is reserved for that (i.e. not 0 for some int or float field)? and also make it as a measure of the last resort while still keeping "SHOULD" on approximate values to make those files usable with generic tools?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of 679b8e0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/02-common-principles.md Outdated Show resolved Hide resolved

In cases of conflict, the BIDS metadata is considered authoritative.
If BIDS metadata is defined,
format-specific metadata MUST NOT conflict to the extent permitted by the format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • how poor validator would be able to check that "approximately" the same, or this MUST NOT will not be validated, and thus not enforced?

  • as much as I like consistency, it might be pragmatically prohibitive to achieve it. E.g. if metadata is contained within 1GB (or 1TB?) data file and then extracted and fixed up in metadata file (might be just a minor fix) -- should 1G/TB be re-uploaded?

With that in mind I just wonder if we should keep it as SHOULD NOT (warning if mismatch; not an error if it was MUST NOT) level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go with SHOULD NOT. What about something like:

Generally, if BIDS metadata is defined, format-specific metadata SHOULD NOT conflict, to the extent permitted by the format. Specific metadata fields may impose a strict requirement.

I'm thinking here of RepetitionTime and pixdim[4] which MUST agree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: because it should be less of fields we allow to overload than of the fields we do not, may be we should keep original specification, but we would need to 1. come up with the annotation to when "conflict" is allowed (e.g. for some "description" fields etc) in side cars. 2. when allowed in metadata from filename entities (e.g. subject or session).

FWIW My longer thinking/line of argument

It seems then again to require per sidecar field "equality" specification that some could be allowed to disagree (thus it would be only a WARNING if sidecar has different value) whenever others MUST agree (and thus ERROR from validator). Most likely the strict requirements would then be added a explicit checks in the schema, like ATM checks being added for lengths consistency e.g. in https://github.com/bids-standard/bids-specification/blob/HEAD/src/schema/rules/checks/asl.yaml#L26 . That would indeed make it explicit but standard itself IMHO somewhat "inconsistent" since it would allow for some fields to be overloaded but not the others, and "by default" allow for them being different. Note that adding a restriction (from a default of "could different") in principle constitutes backward incompatible change, thus should be avoided.

In general though it is a tough aspect to allow such differences since it mandates that tools must now be able to read that "overloaded" metadata from sidecar files. That is why, ideally, we should really strive for consistency. And that is why e.g. we are working with NWB file format folks to allow for ".overwrite.json" file to accompany ".nwb" file so that metadata could be overloaded at the level of the file format, and thus allowing for "cheap" adjustment of metadata if so desired. For NIfTI, the primary workhorse of BIDS, since most of the tools would need correct base metadata such as pixdim, we better indeed enforce their consistency with side car files.

So, overall, I think we MUST enforce consistency and if so desired, relax only for some sidecar metadata fields AND strive to allow for metadata overload at the original file format whenever possible/necessary.

JSON (see [Key/value files](#keyvalue-files-dictionaries)).
In some cases, this duplicates metadata contained in a data file.

In cases of conflict, the BIDS metadata is considered authoritative.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we clarify what a conflict is? for example, is a change of floating point precision a conflict? one reason we store things in binary is because floating point data maintains some veracity. if we see this outside the nifti lens for other modalities allowed by bids (the eeg/meg or upcoming ephys/microscopy formats, where precision starts mattering).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we clarify what a conflict is?

I can think of a few:

  1. Equivalent fields, different values. RepetitionTime vs pixdim[4]
  2. Disparate representations. PhaseEncodingDirection (enum) vs dim_info (bitfield)
  3. Range of representations. PhaseEncodingDirection (axis + direction) vs dim_info (axis)

for example, is a change of floating point precision a conflict?

I'm interpreting this as a question of tolerance? Yes, there should be some tolerance. I imagine the order of tolerance would probably be a case-by-case question.

one reason we store things in binary is because floating point data maintains some veracity. if we see this outside the nifti lens for other modalities allowed by bids (the eeg/meg or upcoming ephys/microscopy formats, where precision starts mattering).

This seems to be arguing that we should say the highest-precision metadata should be authoritative, whether BIDS or format-internal. Or am I misunderstanding?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be arguing that we should say the highest-precision metadata should be authoritative, whether BIDS or format-internal.

in general yes, but we may want to clarify that for floating point values, the internal format always wins when close. this would perhaps avoid the situations where some has zeroed out something like TR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a hard thing to write a rule for. Here's something:

In general, the more precise metadata representation SHOULD be considered authoritative, with BIDS metadata being presumptively more precise. A notable exception to this presumption is if a value is represented as a floating point value in the internal metadata format and a decimal string in the JSON sidecar; to avoid accumulating error in float-decimal conversions, the floating point value is authoritative.

This actually contravenes existing practice in fMRIPrep and FitLins with regard to RepetitionTime/pixdim[4], and may do so for MNE-Python as well, but I guess that's what standardization is for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me.

ping @satra

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like @effigies paragraph - but i think he still had the question out for the community.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@effigies please digest it for me a little on an example. E.g. if we have a value 0.3(3) (to the floating point precision) in the binary file and have 0.333 (to some arbitrary text precision) in sidecar file, the side car 0.333 would be taken as "authorative" but if we have it in side car as "0" or as "1", only then then the data file would have it as "authorative"?

Copy link
Collaborator Author

@effigies effigies Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic I intended to write the paragraph 4 comments up to mean (and still read it as saying) "floating point fields in a binary format take precedence over JSON, which is inherently decimal".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to freshen up: re my comment on floating point representation in text. data representation would also be truncated and it would just be a matter of expressing enough precision in json form.

E.g. specifically for my own example messing around in ipython
In [1]: repr(1/3)
Out[1]: '0.3333333333333333'

In [2]: 0.3333333333333333 * 3
Out[2]: 1.0

In [3]: 3.333333333333333e-1 * 3
Out[3]: 1.0

In [4]: 3.33333333333333e-1 * 3
Out[4]: 0.9999999999999989

In [5]: 1/3 == 3.333333333333333e-1
Out[5]: True

In [6]: 1/3 == 3.33333333333333e-1
Out[6]: False

so may be we should just add somewhere wording that "serialization of floating point numbers in json should be to maximal precision of the underlying data type used in the data file" or something like that?

then the closest possible approximation of the BIDS metadata SHOULD be used.
Special null or undefined values MAY be used when available.
If the format-specific metadata field is defined,
the BIDS metadata SHOULD also be defined.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could someone give an example for the last one here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If NIfTI slice_code != 0, then SliceTiming SHOULD be defined?

Copy link
Collaborator

@yarikoptic yarikoptic Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice_code != 0 is not just that slice_code is defined but rather meets some criteria (!=0) and looking at https://nifti.nimh.nih.gov/nifti-1/documentation/nifti1fields/nifti1fields_pages/slice_code.html describing it as " slice_code = If this is nonzero, AND if slice_dim is nonzero, AND if slice_duration is positive,..." so it is even more complex than just defined. Moreover from above sentence it somehow suggests that it is just a matter of "mirroring" some value between file and BIDS metadata. But in this case it is not the case since SliceTiming is not even in NIfTI, which only specifies the order and timing is not contained in the file. So I will suggest this

@robertoostenveld robertoostenveld self-requested a review July 13, 2022 08:48
@sappelhoff sappelhoff modified the milestones: 1.7.1, 1.8.0 Jul 13, 2022
@sappelhoff sappelhoff changed the title ENH: Add a statement addressing metadata conflicts [ENH] Add a statement addressing metadata conflicts Jul 13, 2022
@effigies
Copy link
Collaborator Author

effigies commented Jul 13, 2022

I'm not sure what approving reviews are approving at this point. There is an open proposal to completely reverse the priority to have file metadata take precedence over BIDS metadata.

Comment on lines 203 to 204
If the format-specific metadata field is defined,
the BIDS metadata SHOULD also be defined.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the format-specific metadata field is defined,
the BIDS metadata SHOULD also be defined.
Some BIDS metadata can be REQUIRED to be defined depending on the values of the format-specific metadata fields.

although even then the slice_code target use case is quite hard to guess ;) but it also has nothing to do with the overloading really. May be should just be removed?

@yarikoptic
Copy link
Collaborator

There is an open proposal to completely reverse the priority to have file metadata take precedence over BIDS metadata.

yeah, and it seems that we are arriving that overall there might be no effective precedence and metadata must be in agreement between file metadata and BIDS metadata.

@sappelhoff
Copy link
Member

migrating to 1.9.0 milestone because apparently this needs more discussion.

@sappelhoff sappelhoff modified the milestones: 1.8.1, 1.9.0 Feb 14, 2023
@effigies
Copy link
Collaborator Author

I've updated this PR to master because it's listed in the 1.9.0 milestone. I don't think there's a clear consensus on what to do. I've personally kind of given up hope on it, and would appreciate if a motivated party took over leading the discussion in a new issue/PR.

@effigies
Copy link
Collaborator Author

Abandoning this. Others are welcome to pick it up.

@effigies effigies closed this Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants