-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simpler short-term BIDS metadata extraction #772
Comments
@TheChymera -ping |
@satra - is it ok to add pybids as a new requirement? |
i would stay away from that for the moment mostly because it's a very heavy requirement. since you are parsing the filepath for a single file at the moment, a simple function + a dictionary for modality/technique mapping may work for the moment. but eventually we may probably need pybids as an optional dependency. |
@satra - ok, so I understand I could add specific parsing and assume that this will not change much with new versions etc. (basically ignoring the version info) |
a few more rules:. if it detects a bids dataset (i.e. dataset_description.json) and detects a
so overall things to extract and inject into metadata are:
we should perhaps start subclassing assets in dandischema soonish. |
@satra - I talked to @TheChymera and he will work on the extraction this info from the BIDS files, but it is indeed not clear to me where we will add the info. Should we create a new schema first? |
the current asset schema handles all of these fields. here is an example nwb asset metadata: https://api.dandiarchive.org/api/dandisets/000008/versions/draft/assets/d387991e-f9c5-4790-9152-75885ddf5572/ |
I would say: detect presence of
most likely only if it gets lighter -- now it has pandas dependency which brings in lots... there are discussions about some |
@TheChymera - are you still working on this? |
@djarecka hi, sorry, no, I got stuck in trying to get some sort of MWE (or already working example) to extract metadata from, and derive from there. I think I only just realized while working on some ephys data that conversion to NWB is different from DANDI metadata extraction (or at least it's provided by another package there). If you could share some sort of example of what we're working on, I could continue from there. Are we working on |
Currently building an XS regex-based BIDS validator as part of the new BIDS schemacode package. The schema directory is parameterized so this can be pointed to our own version, if we have nonstandard datasets (e.g. bids-standard/bids-specification#543 (comment) @satra do you know whether our 108 dataset has a schema version which they use, or is it:
So we should just edit the schema to fit the dataset? There is a microscopy PR in BIDS, which looks like it might be merged soon, but our dataset does not, in fact, use it. |
it should technically correspond to that PR except that it will have ngff files. everything else should be aligned with that PR though. filenames should not be ad hoc even now, except for the h5 extension. |
@satra sadly it does not, currently nothing validates as the datatype is listed as Or we could also comment on the extension PR, but |
that should be fixed (but that PR only recently changed from we will also need additional dandi specific overrides. ngff is not directly supported by the PR as well since that spec is evolving, but we will support it in dandi. |
And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)
@satra it turns out there are quite a few more inconsistencies between BEP031 and DANDI:000108 . |
I think some should be fixed (renamed) in the dandiset itself, e.g.
So I think only the addition of We will need to coordinate with @LeeKamentsky on renaming: do you have an entire copy of the dandiset with data as it was uploaded? |
this will be reuploaded as |
I've got both the NGFF and HDF5 stored locally. I can reorganize the file structure (microscopy -> micr) and rewrite both the NGFF and sidecar metadata as part of the update procedure when we get to the point where we switch from HDF5 to NGFF. |
That would be great @LeeKamentsky ! So for now we can validate using "adjusted schema" and then before upload of ngff switch to the schema which has only "allow .ngff" modification in. |
@satra regarding case, going by general unix case handling, the case gets parsed as-written from the YAML (both for the online listing currently done in bids-specification, as well as for the validator I'm writing). This means that for it to be case-insensitive both upper- and lower-case strings would need to be listed. I see no single suffix doing this, so if you want it implemented as case-insensitive it should be done at the validator level, but that would be nonstandard behaviour, so I'm not sure it's a good idea... The other thing I can do is write a comment on the BEP for it to be made lower-case, which sounds most consitent with BIDS, and include the change ahead of time in our |
Let's see what they think about this: bids-standard/bids-specification#881 (comment) |
@satra ok, so it appears there's no good reason to make it lowercase, since there are in fact quite a few uppercase suffixes. I think the right approach would be to update our data, rather than branch the schema over case or introduce validator fuzziness. Would you agree? |
Also, apparently |
|
yes. please create the validator with that in mind + support for ngff directories. please let @LeeKamentsky know how best to test the dataset with the validator, perhaps before it gets into dandi cli (or if that is imminent, then we can wait to just execute the validate function). |
@satra oh my, there are many (extra) standard divergences here as well. |
I have also just come across something I cannot really "fix" via schema divergence, namely |
Additionally, there seems to be some microscopy data which is not using the |
@TheChymera - we should give the contact person a heads up, but i would suggest first making sure that the validator is integrated so that they can check. one thing to note is that each site has a different subset of the data. no site has all the data, and it also doesn't make sense to force people to download all the data. so that may be something additional to look into for the validator. |
that just shows the power of standards and validators -- ability to detect presence of such typos ;) yeah, needs to be fixed in the dataset itself. |
I can remove sub-MITU01_sessions.tsv - I guess I missed noticing the point where that file disappeared from the spec. I'll add a README and see if I can reconstruct and backfill the CHANGES file |
@LeeKamentsky - it hasn't disappeared. it's just an optional file: https://bids-specification.readthedocs.io/en/stable/06-longitudinal-and-multi-site-studies.html#sessions-file (i wouldn't remove it). |
@LeeKamentsky |
Hello everyone! I had a nice chat with @TheChymera yesterday, we'll fix the ses-SPIM directories as soon as we have some time. For the other image modalities I think you can contact
|
@satra do we have individual issue trackers for each dataset? I remember I saw something like that once, but can't find it again. It might be better to discuss updates there since there seem to be a lot of different things going on in this thread. Ideally we could discuss:
separately. |
you can do it in the github.com/dandisets org where each dandiset has it's own repo. |
@gmazzamuto can you send me the contact details per email, or ping the github accounts (I searched but don't think I could find them). |
Issues with the datasets are now tracked here: Thus the only relevant schema changes remain:
|
Ok, so we have a new schema version containing only changes which enhance but do not diverge from BIDS. |
And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)
And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)
And using openneuro for testing, pending: dandi/dandi-cli#772 (comment)
@yarikoptic - i wanted to bring this up a little bit to ensure that we can do a bit more with bids metadata extraction before full fledged bids support #432.
can we simply extract the metadata from a file path when possible (identifiers for subject, session, sample, modality-via-folder, technique-via-suffix, when available)?
@djarecka - is this something you could take a quick look at? dandiset 108 and 26 should provide a fair bit of examples and you can play with the path names of files using the API. this should return some structured metadata for the non-NWB part of the processing of a file. it should look like the session metadata and the subject metadata info we extract from NWB.
The text was updated successfully, but these errors were encountered: