-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add a statement addressing metadata conflicts #761
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some wording tune up
src/02-common-principles.md
Outdated
If an exact representation of BIDS metadata is not possible in the format, | ||
then the imaging file metadata SHOULD be undefined | ||
or set to the closest possible approximation of the BIDS metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an exact representation of BIDS metadata is not possible in the format, | |
then the imaging file metadata SHOULD be undefined | |
or set to the closest possible approximation of the BIDS metadata. | |
If only possibly an exact representation of BIDS metadata is possible in the format, | |
then the data file metadata SHOULD be set to the closest possible approximation of the BIDS metadata. |
I think would make it less ambiguous. I do not think we should allow for completely undefined in such cases. If "representation" is not possible at all -- there is nothing to talk about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The situation I was considering here is that someone might prefer to zero out a value rather than have an approximate but incorrect value in the header. For instance, suppose you have a slice timing sequence that doesn't fit any defined NIFTI_SLICE_SEQ_*
field. I'd rather set it NIFTI_SLICE_UNKNOWN
than the closest approximation.
I suspect this type of case is going to be relatively common for binary formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yes, for NIFTI1_SLICE_ORDER
indeed should be set to NIFTI_SLICE_UNKNOWN
and thus indeed "undefined".
Majority of formats likely to not have nil
or None
or explicit "unknown" like in your example to specify the "undefined" value. So might be worth explicitly stating to set to "undefined" if explicit value is reserved for that (i.e. not 0
for some int or float field)? and also make it as a measure of the last resort while still keeping "SHOULD" on approximate values to make those files usable with generic tools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of 679b8e0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @yarikoptic
src/02-common-principles.md
Outdated
|
||
In cases of conflict, the BIDS metadata is considered authoritative. | ||
If BIDS metadata is defined, | ||
format-specific metadata MUST NOT conflict to the extent permitted by the format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
how poor validator would be able to check that "approximately" the same, or this MUST NOT will not be validated, and thus not enforced?
-
as much as I like consistency, it might be pragmatically prohibitive to achieve it. E.g. if metadata is contained within 1GB (or 1TB?) data file and then extracted and fixed up in metadata file (might be just a minor fix) -- should 1G/TB be re-uploaded?
With that in mind I just wonder if we should keep it as SHOULD NOT (warning if mismatch; not an error if it was MUST NOT) level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go with SHOULD NOT. What about something like:
Generally, if BIDS metadata is defined, format-specific metadata SHOULD NOT conflict, to the extent permitted by the format. Specific metadata fields may impose a strict requirement.
I'm thinking here of RepetitionTime
and pixdim[4]
which MUST agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: because it should be less of fields we allow to overload than of the fields we do not, may be we should keep original specification, but we would need to 1. come up with the annotation to when "conflict" is allowed (e.g. for some "description" fields etc) in side cars. 2. when allowed in metadata from filename entities (e.g. subject or session).
FWIW My longer thinking/line of argument
It seems then again to require per sidecar field "equality" specification that some could be allowed to disagree (thus it would be only a WARNING if sidecar has different value) whenever others MUST agree (and thus ERROR from validator). Most likely the strict requirements would then be added a explicit check
s in the schema, like ATM checks being added for lengths consistency e.g. in https://github.com/bids-standard/bids-specification/blob/HEAD/src/schema/rules/checks/asl.yaml#L26 . That would indeed make it explicit but standard itself IMHO somewhat "inconsistent" since it would allow for some fields to be overloaded but not the others, and "by default" allow for them being different. Note that adding a restriction (from a default of "could different") in principle constitutes backward incompatible change, thus should be avoided.
In general though it is a tough aspect to allow such differences since it mandates that tools must now be able to read that "overloaded" metadata from sidecar files. That is why, ideally, we should really strive for consistency. And that is why e.g. we are working with NWB file format folks to allow for ".overwrite.json" file to accompany ".nwb" file so that metadata could be overloaded at the level of the file format, and thus allowing for "cheap" adjustment of metadata if so desired. For NIfTI, the primary workhorse of BIDS, since most of the tools would need correct base metadata such as pixdim, we better indeed enforce their consistency with side car files.
So, overall, I think we MUST enforce consistency and if so desired, relax only for some sidecar metadata fields AND strive to allow for metadata overload at the original file format whenever possible/necessary.
src/02-common-principles.md
Outdated
JSON (see [Key/value files](#keyvalue-files-dictionaries)). | ||
In some cases, this duplicates metadata contained in a data file. | ||
|
||
In cases of conflict, the BIDS metadata is considered authoritative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we clarify what a conflict is? for example, is a change of floating point precision a conflict? one reason we store things in binary is because floating point data maintains some veracity. if we see this outside the nifti lens for other modalities allowed by bids (the eeg/meg or upcoming ephys/microscopy formats, where precision starts mattering).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we clarify what a conflict is?
I can think of a few:
- Equivalent fields, different values.
RepetitionTime
vspixdim[4]
- Disparate representations.
PhaseEncodingDirection
(enum) vsdim_info
(bitfield) - Range of representations.
PhaseEncodingDirection
(axis + direction) vsdim_info
(axis)
for example, is a change of floating point precision a conflict?
I'm interpreting this as a question of tolerance? Yes, there should be some tolerance. I imagine the order of tolerance would probably be a case-by-case question.
one reason we store things in binary is because floating point data maintains some veracity. if we see this outside the nifti lens for other modalities allowed by bids (the eeg/meg or upcoming ephys/microscopy formats, where precision starts mattering).
This seems to be arguing that we should say the highest-precision metadata should be authoritative, whether BIDS or format-internal. Or am I misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be arguing that we should say the highest-precision metadata should be authoritative, whether BIDS or format-internal.
in general yes, but we may want to clarify that for floating point values, the internal format always wins when close. this would perhaps avoid the situations where some has zeroed out something like TR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a hard thing to write a rule for. Here's something:
In general, the more precise metadata representation SHOULD be considered authoritative, with BIDS metadata being presumptively more precise. A notable exception to this presumption is if a value is represented as a floating point value in the internal metadata format and a decimal string in the JSON sidecar; to avoid accumulating error in float-decimal conversions, the floating point value is authoritative.
This actually contravenes existing practice in fMRIPrep and FitLins with regard to RepetitionTime
/pixdim[4]
, and may do so for MNE-Python as well, but I guess that's what standardization is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me.
ping @satra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like @effigies paragraph - but i think he still had the question out for the community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@effigies please digest it for me a little on an example. E.g. if we have a value 0.3(3)
(to the floating point precision) in the binary file and have 0.333
(to some arbitrary text precision) in sidecar file, the side car 0.333
would be taken as "authorative" but if we have it in side car as "0"
or as "1"
, only then then the data file would have it as "authorative"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic I intended to write the paragraph 4 comments up to mean (and still read it as saying) "floating point fields in a binary format take precedence over JSON, which is inherently decimal".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to freshen up: re my comment on floating point representation in text. data representation would also be truncated and it would just be a matter of expressing enough precision in json form.
E.g. specifically for my own example messing around in ipython
In [1]: repr(1/3)
Out[1]: '0.3333333333333333'
In [2]: 0.3333333333333333 * 3
Out[2]: 1.0
In [3]: 3.333333333333333e-1 * 3
Out[3]: 1.0
In [4]: 3.33333333333333e-1 * 3
Out[4]: 0.9999999999999989
In [5]: 1/3 == 3.333333333333333e-1
Out[5]: True
In [6]: 1/3 == 3.33333333333333e-1
Out[6]: False
so may be we should just add somewhere wording that "serialization of floating point numbers in json should be to maximal precision of the underlying data type used in the data file" or something like that?
src/02-common-principles.md
Outdated
then the closest possible approximation of the BIDS metadata SHOULD be used. | ||
Special null or undefined values MAY be used when available. | ||
If the format-specific metadata field is defined, | ||
the BIDS metadata SHOULD also be defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could someone give an example for the last one here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If NIfTI slice_code != 0
, then SliceTiming
SHOULD be defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slice_code != 0
is not just that slice_code
is defined but rather meets some criteria (!=0
) and looking at https://nifti.nimh.nih.gov/nifti-1/documentation/nifti1fields/nifti1fields_pages/slice_code.html describing it as " slice_code = If this is nonzero, AND if slice_dim is nonzero, AND if slice_duration is positive,..." so it is even more complex than just defined. Moreover from above sentence it somehow suggests that it is just a matter of "mirroring" some value between file and BIDS metadata. But in this case it is not the case since SliceTiming
is not even in NIfTI, which only specifies the order and timing is not contained in the file. So I will suggest this
I'm not sure what approving reviews are approving at this point. There is an open proposal to completely reverse the priority to have file metadata take precedence over BIDS metadata. |
src/02-common-principles.md
Outdated
If the format-specific metadata field is defined, | ||
the BIDS metadata SHOULD also be defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the format-specific metadata field is defined, | |
the BIDS metadata SHOULD also be defined. | |
Some BIDS metadata can be REQUIRED to be defined depending on the values of the format-specific metadata fields. |
although even then the slice_code
target use case is quite hard to guess ;) but it also has nothing to do with the overloading really. May be should just be removed?
yeah, and it seems that we are arriving that overall there might be no effective precedence and metadata must be in agreement between file metadata and BIDS metadata. |
My approval was for the original suggestion, I'll withdraw it now and dive back into the discussion.
migrating to 1.9.0 milestone because apparently this needs more discussion. |
077223e
to
3a77356
Compare
I've updated this PR to master because it's listed in the 1.9.0 milestone. I don't think there's a clear consensus on what to do. I've personally kind of given up hope on it, and would appreciate if a motivated party took over leading the discussion in a new issue/PR. |
Abandoning this. Others are welcome to pick it up. |
Closes #138.
Closes #1133