Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simpler short-term BIDS metadata extraction #772

Open
satra opened this issue Sep 20, 2021 · 39 comments
Open

simpler short-term BIDS metadata extraction #772

satra opened this issue Sep 20, 2021 · 39 comments
Assignees

Comments

@satra
Copy link
Member

satra commented Sep 20, 2021

@yarikoptic - i wanted to bring this up a little bit to ensure that we can do a bit more with bids metadata extraction before full fledged bids support #432.

can we simply extract the metadata from a file path when possible (identifiers for subject, session, sample, modality-via-folder, technique-via-suffix, when available)?

@djarecka - is this something you could take a quick look at? dandiset 108 and 26 should provide a fair bit of examples and you can play with the path names of files using the API. this should return some structured metadata for the non-NWB part of the processing of a file. it should look like the session metadata and the subject metadata info we extract from NWB.

@djarecka djarecka self-assigned this Sep 20, 2021
@djarecka
Copy link
Member

@TheChymera -ping

@djarecka
Copy link
Member

@satra - is it ok to add pybids as a new requirement?

@satra
Copy link
Member Author

satra commented Sep 22, 2021

i would stay away from that for the moment mostly because it's a very heavy requirement. since you are parsing the filepath for a single file at the moment, a simple function + a dictionary for modality/technique mapping may work for the moment.

but eventually we may probably need pybids as an optional dependency.

@djarecka
Copy link
Member

djarecka commented Sep 22, 2021

@satra - ok, so I understand I could add specific parsing and assume that this will not change much with new versions etc. (basically ignoring the version info)

@satra
Copy link
Member Author

satra commented Sep 23, 2021

a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a sub- in filename, then:

  • it should also require a participants.tsv (normal bids doesn't require this)
  • if sample- in filename it should also require a samples.tsv
  • it should extract age, sex, and species and tissue_type

so overall things to extract and inject into metadata are:

  • subject id
  • age
  • sex
  • species
  • genotype (if available)
  • modality (from folder)
  • technique (from suffix)
  • session id (if available)
  • sample id (if available)
  • sample type (when sample id is present)

we should perhaps start subclassing assets in dandischema soonish.

@djarecka
Copy link
Member

@satra - I talked to @TheChymera and he will work on the extraction this info from the BIDS files, but it is indeed not clear to me where we will add the info. Should we create a new schema first?

@satra
Copy link
Member Author

satra commented Sep 23, 2021

the current asset schema handles all of these fields.

here is an example nwb asset metadata: https://api.dandiarchive.org/api/dandisets/000008/versions/draft/assets/d387991e-f9c5-4790-9152-75885ddf5572/

@yarikoptic
Copy link
Member

a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a sub- in filename, then:

I would say: detect presence of dataset_description.json file containing BIDSVersion field.

but eventually we may probably need pybids as an optional dependency.

most likely only if it gets lighter -- now it has pandas dependency which brings in lots... there are discussions about some pybids-lite or alike. Also - "schema" component of BIDS specification might get its own library to load/parse/validate (relevant discussion: bids-standard/bids-specification#543) although it might indeed just get implemented within pybids.

@djarecka
Copy link
Member

@TheChymera - are you still working on this?

@TheChymera
Copy link
Contributor

@djarecka hi, sorry, no, I got stuck in trying to get some sort of MWE (or already working example) to extract metadata from, and derive from there. I think I only just realized while working on some ephys data that conversion to NWB is different from DANDI metadata extraction (or at least it's provided by another package there). If you could share some sort of example of what we're working on, I could continue from there. Are we working on dandi register /path/to/bids or dandi validate /path/to/bids, or something else?

@TheChymera
Copy link
Contributor

TheChymera commented Jan 7, 2022

Currently building an XS regex-based BIDS validator as part of the new BIDS schemacode package. The schema directory is parameterized so this can be pointed to our own version, if we have nonstandard datasets (e.g. DANDI:000108).

bids-standard/bids-specification#543 (comment)

@satra do you know whether our 108 dataset has a schema version which they use, or is it:

  1. ad-hoc naming and
  2. we know that the dataset is valid.

So we should just edit the schema to fit the dataset?

There is a microscopy PR in BIDS, which looks like it might be merged soon, but our dataset does not, in fact, use it.

@djarecka

@satra
Copy link
Member Author

satra commented Jan 7, 2022

it should technically correspond to that PR except that it will have ngff files. everything else should be aligned with that PR though. filenames should not be ad hoc even now, except for the h5 extension.

@TheChymera
Copy link
Contributor

TheChymera commented Jan 10, 2022

@satra sadly it does not, currently nothing validates as the datatype is listed as micr by the PR, whereas all files use a microscopy directory for the modalities. There may be other issues as well, but for the time being it seems we would need a divergent schema, unless we want to fix the data.

Or we could also comment on the extension PR, but micr seems to make a lot more sense, analogously to anat and func.

@satra
Copy link
Member Author

satra commented Jan 10, 2022

that should be fixed (but that PR only recently changed from microscopy to micr). i'm pretty sure @LeeKamentsky was going to change it, but i would suggest waiting till we get zarr upload into the CLI, and then rewrite that dataset with the ngff files, which themselves likely need to be updated to the upcoming spec. it's a large dataset and we can wait a bit to do fixes together.

we will also need additional dandi specific overrides. ngff is not directly supported by the PR as well since that spec is evolving, but we will support it in dandi.

TheChymera added a commit to TheChymera/bids-specification that referenced this issue Jan 10, 2022
@TheChymera
Copy link
Contributor

@satra it turns out there are quite a few more inconsistencies between BEP031 and DANDI:000108 .
These are the one I've picked up on thus far, but there might be even more, particularly with regard to entity order and suffix case:
https://github.com/TheChymera/bids-schemadata/commits/dandi

@yarikoptic
Copy link
Member

I think some should be fixed (renamed) in the dandiset itself, e.g.

  • spim -> SPIM
  • run: although it sounds sensible to add/recommend to BEP, since it is always _run-1 I think we should rename that too
  • it should stay micr not microscopy

So I think only the addition of .h5 should be the customization for us to have. @satra, agree?

We will need to coordinate with @LeeKamentsky on renaming: do you have an entire copy of the dandiset with data as it was uploaded?

@satra
Copy link
Member Author

satra commented Jan 14, 2022

this will be reuploaded as ngff(see #772 (comment)), so the customization should be for ngff not .h5. btw, the bids specification doesn't say anything about case sensitivity of the suffix (and there doesn't seem to be consistency of lower and upper cases - bold is lower which is also an acronym). my preference would be to not introduce case.

@LeeKamentsky
Copy link

I've got both the NGFF and HDF5 stored locally. I can reorganize the file structure (microscopy -> micr) and rewrite both the NGFF and sidecar metadata as part of the update procedure when we get to the point where we switch from HDF5 to NGFF.

@yarikoptic
Copy link
Member

That would be great @LeeKamentsky ! So for now we can validate using "adjusted schema" and then before upload of ngff switch to the schema which has only "allow .ngff" modification in.

@TheChymera
Copy link
Contributor

@satra regarding case, going by general unix case handling, the case gets parsed as-written from the YAML (both for the online listing currently done in bids-specification, as well as for the validator I'm writing). This means that for it to be case-insensitive both upper- and lower-case strings would need to be listed. I see no single suffix doing this, so if you want it implemented as case-insensitive it should be done at the validator level, but that would be nonstandard behaviour, so I'm not sure it's a good idea...

The other thing I can do is write a comment on the BEP for it to be made lower-case, which sounds most consitent with BIDS, and include the change ahead of time in our bids-schemadata:DANDI branch. In fact I think I might do it in any case.

TheChymera added a commit to TheChymera/bids-schemadata that referenced this issue Jan 15, 2022
TheChymera added a commit to TheChymera/bids-schemadata that referenced this issue Jan 15, 2022
@TheChymera
Copy link
Contributor

Let's see what they think about this: bids-standard/bids-specification#881 (comment)

@TheChymera
Copy link
Contributor

@satra ok, so it appears there's no good reason to make it lowercase, since there are in fact quite a few uppercase suffixes. I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree?

@TheChymera
Copy link
Contributor

Also, apparently README and CHANGES are required as per BIDS :( I really don't see the point for CHANGES, but that seems to be the situation. Other than these two files (and sessions.tsv), DANDI:000108 seems to validate: https://ppb.chymera.eu/1d5ef8.log

@satra
Copy link
Member Author

satra commented Jan 17, 2022

CHANGES is indeed a little weird, especially if there are no changes to start with. but we should be supportive of whatever the bids spec entails. could you also test the validator on 000026 (it has a lot of files)?

@satra
Copy link
Member Author

satra commented Jan 17, 2022

I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree?

yes. please create the validator with that in mind + support for ngff directories. please let @LeeKamentsky know how best to test the dataset with the validator, perhaps before it gets into dandi cli (or if that is imminent, then we can wait to just execute the validate function).

@TheChymera
Copy link
Contributor

TheChymera commented Jan 18, 2022

test the validator on 000026

@satra oh my, there are many (extra) standard divergences here as well.
The one where I think we most need feedback is the _fa-<index> entity.
It seems this is used interchangeably by the dataset with _flip-<index>, but it could also be something else entirely.
See TheChymera/bids-schemadata@bebd54f for the changes required to make it pass.
Detrimentally to standard integrity, flip needs to be downgraded to optional as well...
Are you (we?) in charge of updating the data now, or should I be contacting @gmazzamuto directly?

@TheChymera
Copy link
Contributor

TheChymera commented Jan 18, 2022

I have also just come across something I cannot really "fix" via schema divergence, namely /sub-I46/ses-MRI/anat/sub-I46_ses-MRI-echo-2_flip-2_VFA.json (in DANDI:000026). Between MRI and echo there is a dash instead of an underscore (took me forever to spot this one...). I assume this is a simple typo, but it needs to be fixed in the data.

@TheChymera
Copy link
Contributor

Additionally, there seems to be some microscopy data which is not using the microscopy/ datatype subdirectory, e.g. /sub-I46/ses-SPIM/sub-I46_ses-SPIM_sample-*.

@satra
Copy link
Member Author

satra commented Jan 18, 2022

@TheChymera - we should give the contact person a heads up, but i would suggest first making sure that the validator is integrated so that they can check. one thing to note is that each site has a different subset of the data. no site has all the data, and it also doesn't make sense to force people to download all the data. so that may be something additional to look into for the validator.

@yarikoptic
Copy link
Member

I have also just come across something I cannot really "fix" via schema divergence, namely /sub-I46/ses-MRI/anat/sub-I46_ses-MRI-echo-2_flip-2_VFA.json (in DANDI:000026). Between MRI and echo there is a dash instead of an underscore (took me forever to spot this one...). I assume this is a simple typo, but it needs to be fixed in the data.

that just shows the power of standards and validators -- ability to detect presence of such typos ;) yeah, needs to be fixed in the dataset itself.

@LeeKamentsky
Copy link

I can remove sub-MITU01_sessions.tsv - I guess I missed noticing the point where that file disappeared from the spec. I'll add a README and see if I can reconstruct and backfill the CHANGES file

@satra
Copy link
Member Author

satra commented Jan 19, 2022

@LeeKamentsky - it hasn't disappeared. it's just an optional file: https://bids-specification.readthedocs.io/en/stable/06-longitudinal-and-multi-site-studies.html#sessions-file (i wouldn't remove it).

@TheChymera
Copy link
Contributor

TheChymera commented Jan 20, 2022

I can remove sub-MITU01_sessions.tsv

@LeeKamentsky
I don't think it has disappeared from the spec, just that it's not represented in the YAML schema data which we use for the validator. So those files are not the issue. But the typo with the underscore and microscopy subdirectory should be fixed at some point.

@gmazzamuto
Copy link

Hello everyone!

I had a nice chat with @TheChymera yesterday, we'll fix the ses-SPIM directories as soon as we have some time. For the other image modalities I think you can contact

  • for MRI: Leah Morgan, MGH
  • for OCT: Jiarui Yang, BU

@TheChymera
Copy link
Contributor

TheChymera commented Jan 23, 2022

@satra do we have individual issue trackers for each dataset? I remember I saw something like that once, but can't find it again. It might be better to discuss updates there since there seem to be a lot of different things going on in this thread. Ideally we could discuss:

  1. Things which should be changed in DANDI:000108
  2. Things which should be changed in DANDI:000026
  3. Things which we fix in the schema and keep fixing in the schema until our draft fixes are deprecated by vanilla BIDS.

separately.

@satra
Copy link
Member Author

satra commented Jan 23, 2022

you can do it in the github.com/dandisets org where each dandiset has it's own repo.

@TheChymera
Copy link
Contributor

@gmazzamuto can you send me the contact details per email, or ping the github accounts (I searched but don't think I could find them).

@TheChymera
Copy link
Contributor

TheChymera commented Jan 24, 2022

Issues with the datasets are now tracked here:
https://github.com/dandisets/000108/issues
https://github.com/dandisets/000026/issues

Thus the only relevant schema changes remain:

@TheChymera
Copy link
Contributor

Ok, so we have a new schema version containing only changes which enhance but do not diverge from BIDS.
Almost all of both datasets fail with this schema, pending issues reported on them.
https://github.com/TheChymera/bids-schemadata/tree/dandi_strict

TheChymera added a commit to TheChymera/bids-specification that referenced this issue Mar 2, 2022
TheChymera added a commit to TheChymera/bids-specification that referenced this issue Apr 8, 2022
TheChymera added a commit to TheChymera/bids-specification that referenced this issue May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants