simpler short-term BIDS metadata extraction #772

satra · 2021-09-20T13:57:39Z

@yarikoptic - i wanted to bring this up a little bit to ensure that we can do a bit more with bids metadata extraction before full fledged bids support #432.

can we simply extract the metadata from a file path when possible (identifiers for subject, session, sample, modality-via-folder, technique-via-suffix, when available)?

@djarecka - is this something you could take a quick look at? dandiset 108 and 26 should provide a fair bit of examples and you can play with the path names of files using the API. this should return some structured metadata for the non-NWB part of the processing of a file. it should look like the session metadata and the subject metadata info we extract from NWB.

djarecka · 2021-09-20T19:00:15Z

@TheChymera -ping

djarecka · 2021-09-22T03:08:24Z

@satra - is it ok to add pybids as a new requirement?

satra · 2021-09-22T03:11:33Z

i would stay away from that for the moment mostly because it's a very heavy requirement. since you are parsing the filepath for a single file at the moment, a simple function + a dictionary for modality/technique mapping may work for the moment.

but eventually we may probably need pybids as an optional dependency.

djarecka · 2021-09-22T12:54:38Z

@satra - ok, so I understand I could add specific parsing and assume that this will not change much with new versions etc. (basically ignoring the version info)

satra · 2021-09-23T13:16:29Z

a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a sub- in filename, then:

it should also require a participants.tsv (normal bids doesn't require this)
if sample- in filename it should also require a samples.tsv
it should extract age, sex, and species and tissue_type

so overall things to extract and inject into metadata are:

subject id
age
sex
species
genotype (if available)
modality (from folder)
technique (from suffix)
session id (if available)
sample id (if available)
sample type (when sample id is present)

we should perhaps start subclassing assets in dandischema soonish.

djarecka · 2021-09-23T14:59:37Z

@satra - I talked to @TheChymera and he will work on the extraction this info from the BIDS files, but it is indeed not clear to me where we will add the info. Should we create a new schema first?

satra · 2021-09-23T15:04:14Z

the current asset schema handles all of these fields.

here is an example nwb asset metadata: https://api.dandiarchive.org/api/dandisets/000008/versions/draft/assets/d387991e-f9c5-4790-9152-75885ddf5572/

yarikoptic · 2021-09-23T16:10:26Z

a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a sub- in filename, then:

I would say: detect presence of dataset_description.json file containing BIDSVersion field.

but eventually we may probably need pybids as an optional dependency.

most likely only if it gets lighter -- now it has pandas dependency which brings in lots... there are discussions about some pybids-lite or alike. Also - "schema" component of BIDS specification might get its own library to load/parse/validate (relevant discussion: bids-standard/bids-specification#543) although it might indeed just get implemented within pybids.

djarecka · 2021-10-12T14:22:54Z

@TheChymera - are you still working on this?

TheChymera · 2021-10-12T15:53:17Z

@djarecka hi, sorry, no, I got stuck in trying to get some sort of MWE (or already working example) to extract metadata from, and derive from there. I think I only just realized while working on some ephys data that conversion to NWB is different from DANDI metadata extraction (or at least it's provided by another package there). If you could share some sort of example of what we're working on, I could continue from there. Are we working on dandi register /path/to/bids or dandi validate /path/to/bids, or something else?

TheChymera · 2022-01-07T19:59:48Z

Currently building an XS regex-based BIDS validator as part of the new BIDS schemacode package. The schema directory is parameterized so this can be pointed to our own version, if we have nonstandard datasets (e.g. DANDI:000108).

bids-standard/bids-specification#543 (comment)

@satra do you know whether our 108 dataset has a schema version which they use, or is it:

ad-hoc naming and
we know that the dataset is valid.

So we should just edit the schema to fit the dataset?

There is a microscopy PR in BIDS, which looks like it might be merged soon, but our dataset does not, in fact, use it.

@djarecka

satra · 2022-01-07T20:05:18Z

it should technically correspond to that PR except that it will have ngff files. everything else should be aligned with that PR though. filenames should not be ad hoc even now, except for the h5 extension.

TheChymera · 2022-01-10T00:27:44Z

@satra sadly it does not, currently nothing validates as the datatype is listed as micr by the PR, whereas all files use a microscopy directory for the modalities. There may be other issues as well, but for the time being it seems we would need a divergent schema, unless we want to fix the data.

Or we could also comment on the extension PR, but micr seems to make a lot more sense, analogously to anat and func.

satra · 2022-01-10T02:04:35Z

that should be fixed (but that PR only recently changed from microscopy to micr). i'm pretty sure @LeeKamentsky was going to change it, but i would suggest waiting till we get zarr upload into the CLI, and then rewrite that dataset with the ngff files, which themselves likely need to be updated to the upcoming spec. it's a large dataset and we can wait a bit to do fixes together.

we will also need additional dandi specific overrides. ngff is not directly supported by the PR as well since that spec is evolving, but we will support it in dandi.

And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)

TheChymera · 2022-01-13T06:51:16Z

@satra it turns out there are quite a few more inconsistencies between BEP031 and DANDI:000108 .
These are the one I've picked up on thus far, but there might be even more, particularly with regard to entity order and suffix case:
https://github.com/TheChymera/bids-schemadata/commits/dandi

yarikoptic · 2022-01-13T16:46:34Z

I think some should be fixed (renamed) in the dandiset itself, e.g.

spim -> SPIM
run: although it sounds sensible to add/recommend to BEP, since it is always _run-1 I think we should rename that too
it should stay micr not microscopy

So I think only the addition of .h5 should be the customization for us to have. @satra, agree?

We will need to coordinate with @LeeKamentsky on renaming: do you have an entire copy of the dandiset with data as it was uploaded?

satra · 2022-01-14T04:02:35Z

this will be reuploaded as ngff(see #772 (comment)), so the customization should be for ngff not .h5. btw, the bids specification doesn't say anything about case sensitivity of the suffix (and there doesn't seem to be consistency of lower and upper cases - bold is lower which is also an acronym). my preference would be to not introduce case.

LeeKamentsky · 2022-01-14T14:36:54Z

I've got both the NGFF and HDF5 stored locally. I can reorganize the file structure (microscopy -> micr) and rewrite both the NGFF and sidecar metadata as part of the update procedure when we get to the point where we switch from HDF5 to NGFF.

yarikoptic · 2022-01-14T19:32:52Z

That would be great @LeeKamentsky ! So for now we can validate using "adjusted schema" and then before upload of ngff switch to the schema which has only "allow .ngff" modification in.

TheChymera · 2022-01-15T15:28:52Z

@satra regarding case, going by general unix case handling, the case gets parsed as-written from the YAML (both for the online listing currently done in bids-specification, as well as for the validator I'm writing). This means that for it to be case-insensitive both upper- and lower-case strings would need to be listed. I see no single suffix doing this, so if you want it implemented as case-insensitive it should be done at the validator level, but that would be nonstandard behaviour, so I'm not sure it's a good idea...

The other thing I can do is write a comment on the BEP for it to be made lower-case, which sounds most consitent with BIDS, and include the change ahead of time in our bids-schemadata:DANDI branch. In fact I think I might do it in any case.

dandi/dandi-cli#772 (comment)

TheChymera · 2022-01-15T15:36:17Z

Let's see what they think about this: bids-standard/bids-specification#881 (comment)

TheChymera · 2022-01-17T12:47:23Z

@satra ok, so it appears there's no good reason to make it lowercase, since there are in fact quite a few uppercase suffixes. I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree?

TheChymera · 2022-01-17T12:59:26Z

Also, apparently README and CHANGES are required as per BIDS :( I really don't see the point for CHANGES, but that seems to be the situation. Other than these two files (and sessions.tsv), DANDI:000108 seems to validate: https://ppb.chymera.eu/1d5ef8.log

satra · 2022-01-17T19:19:46Z

CHANGES is indeed a little weird, especially if there are no changes to start with. but we should be supportive of whatever the bids spec entails. could you also test the validator on 000026 (it has a lot of files)?

satra · 2022-01-17T19:22:16Z

I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree?

yes. please create the validator with that in mind + support for ngff directories. please let @LeeKamentsky know how best to test the dataset with the validator, perhaps before it gets into dandi cli (or if that is imminent, then we can wait to just execute the validate function).

TheChymera · 2022-01-18T15:15:07Z

test the validator on 000026

@satra oh my, there are many (extra) standard divergences here as well.
The one where I think we most need feedback is the _fa-<index> entity.
It seems this is used interchangeably by the dataset with _flip-<index>, but it could also be something else entirely.
See TheChymera/bids-schemadata@bebd54f for the changes required to make it pass.
Detrimentally to standard integrity, flip needs to be downgraded to optional as well...
Are you (we?) in charge of updating the data now, or should I be contacting @gmazzamuto directly?

TheChymera · 2022-01-18T15:18:16Z

I have also just come across something I cannot really "fix" via schema divergence, namely /sub-I46/ses-MRI/anat/sub-I46_ses-MRI-echo-2_flip-2_VFA.json (in DANDI:000026). Between MRI and echo there is a dash instead of an underscore (took me forever to spot this one...). I assume this is a simple typo, but it needs to be fixed in the data.

TheChymera · 2022-01-18T15:23:26Z

Additionally, there seems to be some microscopy data which is not using the microscopy/ datatype subdirectory, e.g. /sub-I46/ses-SPIM/sub-I46_ses-SPIM_sample-*.

satra · 2022-01-18T15:28:20Z

@TheChymera - we should give the contact person a heads up, but i would suggest first making sure that the validator is integrated so that they can check. one thing to note is that each site has a different subset of the data. no site has all the data, and it also doesn't make sense to force people to download all the data. so that may be something additional to look into for the validator.

yarikoptic · 2022-01-18T18:19:50Z

I have also just come across something I cannot really "fix" via schema divergence, namely /sub-I46/ses-MRI/anat/sub-I46_ses-MRI-echo-2_flip-2_VFA.json (in DANDI:000026). Between MRI and echo there is a dash instead of an underscore (took me forever to spot this one...). I assume this is a simple typo, but it needs to be fixed in the data.

that just shows the power of standards and validators -- ability to detect presence of such typos ;) yeah, needs to be fixed in the dataset itself.

LeeKamentsky · 2022-01-19T20:40:15Z

I can remove sub-MITU01_sessions.tsv - I guess I missed noticing the point where that file disappeared from the spec. I'll add a README and see if I can reconstruct and backfill the CHANGES file

satra · 2022-01-19T20:49:24Z

@LeeKamentsky - it hasn't disappeared. it's just an optional file: https://bids-specification.readthedocs.io/en/stable/06-longitudinal-and-multi-site-studies.html#sessions-file (i wouldn't remove it).

TheChymera · 2022-01-20T13:06:20Z

I can remove sub-MITU01_sessions.tsv

@LeeKamentsky
I don't think it has disappeared from the spec, just that it's not represented in the YAML schema data which we use for the validator. So those files are not the issue. But the typo with the underscore and microscopy subdirectory should be fixed at some point.

gmazzamuto · 2022-01-21T16:54:50Z

Hello everyone!

I had a nice chat with @TheChymera yesterday, we'll fix the ses-SPIM directories as soon as we have some time. For the other image modalities I think you can contact

for MRI: Leah Morgan, MGH
for OCT: Jiarui Yang, BU

TheChymera · 2022-01-23T22:20:37Z

@satra do we have individual issue trackers for each dataset? I remember I saw something like that once, but can't find it again. It might be better to discuss updates there since there seem to be a lot of different things going on in this thread. Ideally we could discuss:

Things which should be changed in DANDI:000108
Things which should be changed in DANDI:000026
Things which we fix in the schema and keep fixing in the schema until our draft fixes are deprecated by vanilla BIDS.

separately.

satra · 2022-01-23T22:27:41Z

you can do it in the github.com/dandisets org where each dandiset has it's own repo.

TheChymera · 2022-01-24T17:12:33Z

@gmazzamuto can you send me the contact details per email, or ping the github accounts (I searched but don't think I could find them).

TheChymera · 2022-01-24T17:23:29Z

Issues with the datasets are now tracked here:
https://github.com/dandisets/000108/issues
https://github.com/dandisets/000026/issues

Thus the only relevant schema changes remain:

Sessions file support, our schema change, discussed with BIDS here.
echo entity in VFA suffix files, our schema change, pending a reply from BIDS here.
.h5 support for microscopy files, our schema change, not currently discussed with BIDS
.ngff support for microscopy files our schema change, not currently discussed with BIDS.

TheChymera · 2022-01-31T11:46:45Z

Ok, so we have a new schema version containing only changes which enhance but do not diverge from BIDS.
Almost all of both datasets fail with this schema, pending issues reported on them.
https://github.com/TheChymera/bids-schemadata/tree/dandi_strict

And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)

djarecka self-assigned this Sep 20, 2021

TheChymera added a commit to TheChymera/bids-specification that referenced this issue Jan 10, 2022

Entity concatenation rewrite to avoid extra separators

35056a2

And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)

TheChymera added a commit to TheChymera/bids-schemadata that referenced this issue Jan 15, 2022

Support for NGFF files as per:

4daa957

dandi/dandi-cli#772 (comment)

TheChymera added a commit to TheChymera/bids-schemadata that referenced this issue Jan 15, 2022

Lower-case suffixes, as per:

a5ab507

dandi/dandi-cli#772 (comment)

TheChymera mentioned this issue Jan 23, 2022

fa entity should be renamed to flip, if appropriate dandisets/000026#2

Closed

TheChymera added a commit to TheChymera/bids-specification that referenced this issue Mar 2, 2022

Entity concatenation rewrite to avoid extra separators

f5db2f9

And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)

TheChymera added a commit to TheChymera/bids-specification that referenced this issue Apr 8, 2022

Entity concatenation rewrite to avoid extra separators

3227225

And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)

TheChymera added a commit to TheChymera/bids-specification that referenced this issue May 16, 2022

Entity concatenation rewrite to avoid extra separators

92e5f00

And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)

simpler short-term BIDS metadata extraction #772

simpler short-term BIDS metadata extraction #772

Comments

satra commented Sep 20, 2021

djarecka commented Sep 20, 2021

djarecka commented Sep 22, 2021

satra commented Sep 22, 2021

djarecka commented Sep 22, 2021 • edited Loading

satra commented Sep 23, 2021

djarecka commented Sep 23, 2021

satra commented Sep 23, 2021

yarikoptic commented Sep 23, 2021

djarecka commented Oct 12, 2021

TheChymera commented Oct 12, 2021

TheChymera commented Jan 7, 2022 • edited Loading

satra commented Jan 7, 2022

TheChymera commented Jan 10, 2022 • edited Loading

satra commented Jan 10, 2022

TheChymera commented Jan 13, 2022

yarikoptic commented Jan 13, 2022

satra commented Jan 14, 2022

LeeKamentsky commented Jan 14, 2022

yarikoptic commented Jan 14, 2022

TheChymera commented Jan 15, 2022

TheChymera commented Jan 15, 2022

TheChymera commented Jan 17, 2022

TheChymera commented Jan 17, 2022

satra commented Jan 17, 2022

satra commented Jan 17, 2022

TheChymera commented Jan 18, 2022 • edited Loading

TheChymera commented Jan 18, 2022 • edited Loading

TheChymera commented Jan 18, 2022

satra commented Jan 18, 2022

yarikoptic commented Jan 18, 2022

LeeKamentsky commented Jan 19, 2022

satra commented Jan 19, 2022

TheChymera commented Jan 20, 2022 • edited Loading

gmazzamuto commented Jan 21, 2022

TheChymera commented Jan 23, 2022 • edited Loading

satra commented Jan 23, 2022

TheChymera commented Jan 24, 2022

TheChymera commented Jan 24, 2022 • edited Loading

TheChymera commented Jan 31, 2022

djarecka commented Sep 22, 2021 •

edited

Loading

TheChymera commented Jan 7, 2022 •

edited

Loading

TheChymera commented Jan 10, 2022 •

edited

Loading

TheChymera commented Jan 18, 2022 •

edited

Loading

TheChymera commented Jan 18, 2022 •

edited

Loading

TheChymera commented Jan 20, 2022 •

edited

Loading

TheChymera commented Jan 23, 2022 •

edited

Loading

TheChymera commented Jan 24, 2022 •

edited

Loading