Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stimuli BEP #751

Open
yarikoptic opened this issue Mar 9, 2021 · 20 comments
Open

stimuli BEP #751

yarikoptic opened this issue Mar 9, 2021 · 20 comments
Labels
BEP opinions wanted Please read and offer your opinion on this matter raw stimuli

Comments

@yarikoptic
Copy link
Collaborator

As @Remi-Gau hinted by #695 , we still lack total clarity on original stimuli storage and annotation.

We do have

  • stimuli/ folder which, like sourcedata/ is nohow "prescribed" for a specific structure.
  • stim_file column in _events.tsv as to point to a (unregulated) location under stimuli/ and then populating that stim_file description/HED tags within _events.json (bless the inheritance principle).
  • "human wording" to point to the origin of a stimuli within _events.json as possibly coming from some DB
  • _stim.tsv.gz files for "signals related to the stimulus" (but not necessarily stimulus)

In respect to the first 3 items, and in conjunction with

  • stimuli collections/datasets should be self-sufficient/described
  • incoming requests to store stimuli datasets on DANDI archive

I wondered if there either an ongoing effort to standardize "stimuli datasets" so

  • they could be readily reusable across studies by simply placing them under stimuli/<name> and avoiding necessity to describe stimuli in _events.json since information could be picked from their standardized layout
  • derivatives of them could be created and possibly shared along (e.g. all the feature extractions done by pliers (thanks to @tyarkoni , @qmac, et al)

With that in mind I am even thinking such datasets could follow BIDS mantra and just get "participant/subject" and sub- renamed to "stimulus"/stim-, and preserve README.md, dataset_description.json, stimuli.tsv etc

Worth a BEP/effort or may be it is already a "solved problem"? ;) WDYT?

Related:

@effigies
Copy link
Collaborator

At one point I talked to @Gilles86 about how he was storing stimuli, but don't clearly recall how deep we went. He might have some thoughts here.

Just to comment on one thing, I'm not sure stim-<label> buys much over <label>. It would be worth thinking about what are the orthogonalish dimensions that it would make sense to have entities for. A stimulus class name and an index to distinguish instances of that class are going to be most common. Then you may have some within class paramterization, but that's going to be really specific to the type of stimulus. For example, if you're interested in speaker-invariant speech representations, you might split your stimuli by speaker, but I don't see an entity that could cover all such parameterizations.

Something like <label>[_desc-<label>][_<index>].<ext> might cover most use cases without unnecessarily adding boilerplate.

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Mar 16, 2021

Just to comment on one thing, I'm not sure stim-<label> buys much over <label>.

  • Having clean prefix allows to avoid collisions of <label>s conflicting with other possible directories (e.g. code/, sourcedata/ etc).
  • With similar argument sub-<label>/ and sub-<label>_ IMHO also buy us nothing really, but that is what we have, and likely because they provide immediate "metadata" about the domain of <label> we are talking about,
  • Having stim-<label> directory and stim-<label>_ filenames prefix allows to generalize BIDS dataset layout to cover "stimuli BIDS datasets", where stim entity serves analog to sub entity we have now.

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Mar 16, 2021

Something like <label>[_desc-<label>][_<index>].<ext> might cover most use cases without unnecessarily adding boilerplate.

I also hate boilerplate , and indeed in many use cases which might not even really need directories needed at all. BUT I can see stimuli collections where each stimuli could have a good number of files (audio, audio/video, images, etc) associated with that stimuli category; thus would be beneficial for organization and also navigation and reuse (clear "module" for a stimuli at the directory level). So, again, similarly to neuroimaging datasets where having just a single T1w image per subject, it might be sensible to have per-label directories. (moreover there could be multiple samples of the same label -- so semantically similar to _run- but that entity is really not a good fit for that, and indeed _desc- could be better)

PS although even may be run could have sense for some stimuli recordings of the same scene/action taken in sequence and otherwise having no immediate qualitative difference!

@effigies
Copy link
Collaborator

Hmm. Okay, fair enough. I guess the question is how much is this supposed to be BIDS-like or is it supposed to be BIDS? That matters for what entity names are chosen, since if it is BIDS, then we can't change the meaning of an entity too far. If it's just BIDS-like, then we can choose entities that are appropriate for stimuli with little regard for BIDS' existing definitions or ones that are likely to be claimed by future BEPs.

I would probably prefer BIDS-like, since a subject or a recording session is integral to a lot of definitions.

So here's a notion:

stimuli/
  dataset_description.json
  stim-<label>/
    stim-<label>[_desc-<label>][_item-<index>].<ext>
    stim-<label>[_desc-<label>][_item-<index>].json
  • stim-<label> would be the task-relevant class
  • desc-<label> would be within-class parameterization
  • item-<index> would be like run- with no qualitative difference
  • <ext> can probably indicate data type without resorting to an additional _<suffix>, which is another reason to be BIDS-like, instead of BIDS, where suffix is required.

An alternative (or addition) to desc- could be a stims.tsv that allowed you to explicitly say that here are relevant factors:

stimulus                                speaker_id  speaker_gender  tone
stim-word1_desc-sp1normal_item-1.wav    1           M               normal
stim-word1_desc-sp1normal_item-2.wav    1           M               normal
stim-word1_desc-sp1strained_item-1.wav  1           M               strained
stim-word1_desc-sp1strained_item-2.wav  1           M               strained

Then we presumably need a stims.json to define columns.

@yarikoptic
Copy link
Collaborator Author

Thank you @effigies !!! I feel like we are on the same page and progressing leaping forward ;)

I guess the question is how much is this supposed to be BIDS-like or is it supposed to be BIDS? That matters for what entity names are chosen, since if it is BIDS, then we can't change the meaning of an entity too far.

what entities meaning you see needing much of adjustment? Even for run I feel we would not need much of adjustment although some might already be a bit overdue: filed #760 . So not sure if we really need to introduce _item in favor of _run just yet

An alternative (or addition) to desc- could be a stims.tsv

+1 on that. Additional thoughts:
If we are to retain scans as a term, and aim for "BIDS" (not just "BIDS-like") then such a file would be analogous to _scans.tsv we already have, thus be stim-word1/stim-word1_scans.tsv...
with aforementioned #760 in mind, I wonder if with this "stimuli BEP" we could indeed be the first to generalize that into samples (from scans) or some other good generic term?

But then I see the point of having top level stims.tsv analogous to participants.tsv to describe common high level attributes for each stimuli <label> such as

stimulus_id  language word_class  ...
word1          english     noun

@effigies
Copy link
Collaborator

what entities meaning you see needing much of adjustment? Even for run I feel we would not need much of adjustment although some might already be a bit overdue: filed #760 . So not sure if we really need to introduce _item in favor of _run just yet

It feels like shoehorning an experimental notion into a corpus description. I would rather step back and think about what would make a good corpus standard with minimal reference to BIDS.

Maybe if you're thinking of the generation of the stimuli as a procedure that is repeated multiple times, run works. But perhaps I'm sampling from a larger corpus where the notion doesn't apply (e.g., going back through BBC archives for different pronunciations of words).

An alternative (or addition) to desc- could be a stims.tsv

+1 on that. Additional thoughts:
If we are to retain scans as a term, and aim for "BIDS" (not just "BIDS-like") then such a file would be analogous to _scans.tsv we already have, thus be stim-word1/stim-word1_scans.tsv...
with aforementioned #760 in mind, I wonder if with this "stimuli BEP" we could indeed be the first to generalize that into samples (from scans) or some other good generic term?

But then I see the point of having top level stims.tsv analogous to participants.tsv to describe common high level attributes for each stimuli <label> such as

stimulus_id  language word_class  ...
word1          english     noun

Yeah stims.tsv and samples.tsv makes sense to me.

@Remi-Gau
Copy link
Collaborator

Quick thought to point out what @sappelhoff mentioned regarding subject specific stimuli here: #750 (comment)

I don't think that it will such a rare case and we should probably give that some thought.

If it is just a matter of a raw stimulus being adapted to each participant, this could be treated as derivatives but having a way to describe the subject the stimulus is for would be a good thing.

Use the desc label to do that?

stimuli/
  dataset_description.json
  stim-<label>/
    stim-<label>[_desc-<label>][_item-<index>].<ext>

Reuse the sub entity?

stimuli/
  dataset_description.json
  stim-<label>/
    stim-<label>[_sub-<label>][_desc-<label>][_item-<index>].<ext>

@Remi-Gau
Copy link
Collaborator

Also does it make sense to have "prefix" or is it really shoehorning too much BIDS into this?

@effigies
Copy link
Collaborator

RE: sub entities, is the stimulus truly related to the subject, or is it that each subject gets a different stimulus? For a stimulus dataset that needs to be able to be understood in isolation, I'm wary of infecting with a separate notion. For example, maybe I created the stimuli for the subjects in a particular study, but then I want to perform a second study with the same stimuli, and the stim-movie_sub-01.mp4 no longer is viewed by sub-01 in my new study. Or maybe it's viewed by sub-01 and sub-38.

I would suggest that this would be a good use case for item. If sub-01 watches stim-movie_item-01.mp4, sub-02 watches stim-movie_item-02.mp4, and so on, then there's a straightforward mapping, but it is not confusing if it doesn't apply when the same stimulus set is used in a different study.

Also does it make sense to have "prefix" or is it really shoehorning too much BIDS into this?

I don't really understand this question. Could you clarify?

@Remi-Gau
Copy link
Collaborator

RE: sub entities, is the stimulus truly related to the subject, or is it that each subject gets a different stimulus?

Yes but...

For a stimulus dataset that needs to be able to be understood in isolation, I'm wary of infecting with a separate notion.

I will get specific to better explain.

So the case I have in mind the stimuli are literally made for each participant: participants are presented with sounds played from different locations, the sounds are recorded with microphones placed next to their ears so that the sound can replayed to them in the scanner as if they were listening to sound coming from that very specific location. Each person has their own "head related transfer function" that filters the sound in a given way, so each participant has their own set of sounds.

This is very much related to a given dataset so in most cases it won't work in isolation from the data.

But even if you "ship" the stimuli with the BIDS dataset I am wondering if it would make sense to worry about this to the level of having an entity that "pairs" a stimulus to a subject. Sort of thinking this is in the 20% of our pareto principle.

@tyarkoni
Copy link

My two cents: this feels to me like way more trouble than it's worth. Just give each stimulus a unique stim and/or item label, and then you can map between stimuli and subjects using a .tsv file that maps between them, or by adding a sidecar to each stimulus that indicates which subject they're for. Otherwise it gets very messy because everywhere else in BIDS that sub occurs, it's mandatory. I would honestly even consider sticking with just stim in the filename and doing everything else with a stims.tsv file. The space of potential stimuli and their applications seems to me too wide to plausibly encode in a meaningful way in filenames.

I also think this (i.e., stimulus naming/encoding) is a big and important enough problem it could easily be spun off into its own non-BIDS spec, and just be wrapped later by BIDS.

@effigies
Copy link
Collaborator

I also think this (i.e., stimulus naming/encoding) is a big and important enough problem it could easily be spun off into its own non-BIDS spec, and just be wrapped later by BIDS.

ReCorDS (Research Corpus Data Structure)?

@Remi-Gau
Copy link
Collaborator

I don't really understand this question. Could you clarify?

as many files in BIDS are of the form [entity1-<label>][_entityX-<label>]*_<suffix>.<ext> I was just thinking if having a suffix in there made any sense.

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Apr 14, 2021

My two cents: this feels to me like way more trouble than it's worth. Just give each stimulus a unique stim and/or item label, and then you can map between stimuli and subjects using a .tsv file that maps between them, or by adding a sidecar to each stimulus that indicates which subject they're for.

Makes sense: when I wrote my last reply, I started thinking of an optional "intended for" or something equivalent instead of an entity

I also think this (i.e., stimulus naming/encoding) is a big and important enough problem it could easily be spun off into its own non-BIDS spec, and just be wrapped later by BIDS.

Agreed. That is definitely something where I would like to hear the opinion of the psych-DS folks for example.

@yarikoptic
Copy link
Collaborator Author

Thanks everyone! I think this discussion, along with #750, resonates also with the recent discussion of BEP032 where sub- top level might often be very suboptimal (e.g. consider a "tissue" or a "cell" to be a main domain of differentiation between recordings).
So at to my ear "big and important enough problem it could easily be spun off into its own non-BIDS spec" might be a generalization of BIDS 1.x (may be even for the BIDS 2.0?), since it seems could be largely made "backward compatible" (BIDS 1.x datasets would still be "valid"), where

  • a hierarchy would be not "hardcoded" to be a sub-[/ses-] but a specification (e.g. defined in dataset_description.json):
    • would be based on entities we define in https://github.com/bids-standard/bids-specification/blob/master/src/schema/entities.yaml#L2
      • may be we should restrict to allow only some entities to be promoted to "hierarchy", or have a vetted list of possible "hierarchies". But mechanism would be generic:
    • in general specification would be ["<entity#1>", "[<entity#2>]", "..."], where [] would signal optional (if present) inclusion.
    • e.g. ["subject", "[session]"] for default/BIDS 1.0
    • and ["stimulus"] (yet to be added as entity) for stimuli datasets, but could as well be ["stimulus", "subject"] (or swapped order) if such dataset has many per subject stimuli
    • some users of BEP032 will be happy to use ["tissue", "cell"]
  • "lessons learned" consistency introduced:
    • top level includes <entity:plural>.{tsv,json} (we will need to add "plural" per each entity in entities.yaml, e.g. "stimuli" for "stimulus" entity)
      • we have sub- but participants.tsv: we can generalize into subjects.tsv. Having participants.tsv while operating on subject entity is just a pretense of no gain IMHO.
      • so we get stimuli.{tsv,json}
    • _scans.{tsv,json} is generalized into _samples.{tsv,json} or dissolved entirely:
      • insofar I see it as a "summary" of metadata which generally should be present in each particular scan/sample sidecar .json file.

I think with such generalization, it would allow for establishing BEPs like this, as easily as adding a few (if any) missing entities, and "vetting" a "new" hierarchy layout. With ongoing effort by @tsalo in formalizing the schema, any BIDS tool using that schema, would be able to immediately support such a "novel" layout. The interesting and important questions would be on what metadata to include.

Sorry if I derailed a bit ;-)

@yarikoptic
Copy link
Collaborator Author

oh (sorry for the dump) - I just realized, that it generalizes very nicely for what many (myself included) were missing: per entity level specific metadata, and in general it is

[ent1-<label>_...]_<ent?:plural>.{tsv,json}

where

  • [ent1-<label>_...] are entities from prior levels, such as now sub-<label>[_ses-<label>]
  • <ent?:plural> is the one for the level. Such as "participants.tsv" (no prior levels).
    • and to some degree it is currently _scans but that is where seamless generalization breaks on many points:
    • Since that is the final level at which we have datatypes which are not an entity per se ATM and does not follow datatype-<label>/ but just <label>/ naming
    • it does not provide some common details across data types but rather lists all individual "samples" across all data types

BUT, it generalizes nicely into

  • sub-<label>/sub-<label>_sessions.{tsv,json}: on so many occasions I wondered and people asked: "where do I place per-participant information for different sessions?". Current solution is to serialize it within participants.tsv, and BIDS seems to be silent on how to deal with it
so people come up with ad-hoc cross-product of the two with session or session_id column and either just id or with `ses-` values
(git)smaug:/mnt/btrfs/datasets/datalad/crawl/openneuro[master]
$> grep -A2 session ds*/participants.tsv
ds001541/participants.tsv:participant_id	session	run1	run2	run3	run4	Viral_infusion_date	MRI_acquisition_date	weight	group	day_post_infusion	gender	viral_vector
ds001541/participants.tsv-562	2	33	100	66	n/a	2014-01-09	2014-03-10	30.8	exp	60	male	ChR2-eYFP
ds001541/participants.tsv-562	1	100	g100	g100	100	2014-01-09	2014-03-11	30.6	exp	61	male	ChR2-eYFP
--
ds001653/participants.tsv:participant_id	session	gender	weight	acquisition_date	breathing_rate	condition
ds001653/participants.tsv-sub-jgrAesAWc11R1L	ses-1	f	20.6	2017-08-11	150	awake
ds001653/participants.tsv-sub-jgrAesAWc12R	ses-1	f	22.4	2017-08-11	240	awake
--
ds001890/participants.tsv:participant_id	session	sex	genotype	Weight	SpO2	HR	Temperature	DOB	Experiment_Date	Age
ds001890/participants.tsv-c1NT	1	M	3xTG	32.3	98	272	35.8	2016-11-22	2017-03-23	3
ds001890/participants.tsv-c1NT	2	M	3xTG	36.2	94	311	35.8	2016-11-22	2017-05-31	6
--
ds002134/participants.tsv:participant_id	session	genotype	virus	age	sex	Weight	Temperature	DOB	Surgery_date	Experiment_Date	run-1	run-2	run-3	run-4	
ds002134/participants.tsv-jgroptoAD100	1	C57BL/6	mCherry	3	M	30	36.3	2018-12-11	2019-04-01	2019-04-20	n/a	10	20	5	
ds002134/participants.tsv-jgroptoAD101	1	C57BL/6	mCherry	3	M	29.6	36.6	2018-12-11	2019-04-01	2019-04-20	n/a	5	10	20	
--
ds002154/participants.tsv:participant_id	session	gender	condition	weight	Experiment_Date
ds002154/participants.tsv-1	1	m	veh	29.3	2015-10-14
ds002154/participants.tsv-1	2	m	psi05	29.3	2015-10-14
--
ds002307/participants.tsv:participant_id	DOB 	rs-fMRI 1	rs-fMRI 2	rs-fMRI 3	rs-fMRI 4	rs-fMRI 5	rs-fMRI 6	rs-fMRI 7	excluded_rs-fMRI_sessions	dMRI
ds002307/participants.tsv-Ey112	20160126	20160415	20160417	20160418	20160419	20160421	20160424	20160425	x	20160426
ds002307/participants.tsv-Ey113	20160126	20160415	20160417	20160418	20160419	20160421	20160424	20160425	x	20160426
--
ds002547/participants.tsv:participant_id	sex	age	validation_session
ds002547/participants.tsv-sub-01	F	24.0	1.0
ds002547/participants.tsv-sub-02	M	21.0	1.0
--
ds002995/participants.tsv:participant_id	weight	age	gender	num_sessions
ds002995/participants.tsv-sub-007	68	24	F	1
ds002995/participants.tsv-sub-008	70	22	F	2
--
ds003416/participants.tsv:participant_id	session_id	sex	age	handedness
ds003416/participants.tsv-cIs1	s1Ax1	male	25	left
ds003416/participants.tsv-cIs1	s1Ax2	male	25	left
--
ds003464/participants.tsv:participant_id	session	genotype	virus	Experiment_Date	sex	weight	delta_preference
ds003464/participants.tsv-jgroptoINS501	2	C57BL/6	ChR2-mCherry	2018-09-11	M	32	n/a
ds003464/participants.tsv-jgroptoINS503	2	C57BL/6	ChR2-mCherry	2018-07-25	M	n/a	0.11
--
ds003470/participants.tsv:participant_id	session_id	age	sex	size	weight
ds003470/participants.tsv-sub-01	ses-1	26	F	1.63	55
ds003470/participants.tsv-sub-02	ses-1	18	M	1.82	67

So overall generalization could be

  • [ent1-<label>_...]_<ent?:plural>.{tsv,json} for a level which has some other entity as sub-level
  • [ent1-<label>_...]_samples.{tsv,json} - if that is the last level in the "hierarchy" and then it is followed by "datatypes"... but isn't stim a datatype? (need to think more ;))

@sappelhoff sappelhoff added BEP opinions wanted Please read and offer your opinion on this matter labels Jan 15, 2022
@yarikoptic
Copy link
Collaborator Author

Agreed. That is definitely something where I would like to hear the opinion of the psych-DS folks for example.

I should have looked into @psych-ds earlier. Initiated some dialog on Psych-DS spec google doc. Indeed might align nicely if we could allow for different layout (not ["subject", "[session]"])

@mekline
Copy link

mekline commented Jun 8, 2022

Psych-DS 'maintainer' here (we have a tech spec, no released validator software yet) Psych-DS is very firmly in the "BIDS-like" rather than "BIDS" category, and one of the main differences at least in v1 is we are not enforcing ordering of the key-value pairs in directory or filename structure.

A possible use case would be the ability to take a BIDS dataset and "compile out" the behavioral task data, e.g. for an existing pipeline designed for out-of-scanner analysis of task data, or conversely, "compiling in" task data that's collected in a non-BIDSlike form but that is associated with BIDS data. Psych-DS is scoped primarily for behavioral data rather than stimuli, but I think there's no particular reason there couldn't be other clear paralells

@mekline
Copy link

mekline commented Jun 8, 2022

One point to note is that Psych-DS uses/will use JSON-LD metadata, i.e Schema.org/Dataset. A stimulus set version of Psych-DS would probably want to use some other kind of combo of CreativeWork, ImageObject etc

@yarikoptic
Copy link
Collaborator Author

FWIW, added a stub for possible BIDS 2.0 development: bids-standard/bids-2-devel#54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BEP opinions wanted Please read and offer your opinion on this matter raw stimuli
Projects
None yet
Development

No branches or pull requests

6 participants