Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Allow for - (dash) (and/or +) in <label> #1165

Open
yarikoptic opened this issue May 9, 2019 · 34 comments · May be fixed by #1926
Open

[ENH] Allow for - (dash) (and/or +) in <label> #1165

yarikoptic opened this issue May 9, 2019 · 34 comments · May be fixed by #1926
Milestone

Comments

@yarikoptic
Copy link
Collaborator

I thought that we already had an issue filed for this, but may be it was just my comments in the google doc version of the BIDS specification, so filing anew. Please close it referencing original one if one already exists.

ATM <label> is as I formalized in yet to be finished PR is allowed to contain only alpha numerics. That makes it really difficult in some cases to place multiple entries (e.g. details for _acq- or abbreviate details of preprocessing in _desc- of BIDS derivatives). In ReproIn heuristic for heudiconv, where we do allow for - we just replace them with an ugly X as a separator to just stay compatible with BIDS.

I have been suggesting to extend the list of allowed characters in <label> to include at least - and possibly others (e.g., +) which otherwise should not introduce any backward incompatibility with BIDS 1.0 -- all BIDS 1.0 names would be compatible with new syntax allowing - in the <labels>.

The original concern (probably at least a year back) which precluded any action was a possible breakage of parsing BIDS filenames for BIDS-supporting software, and was suggested to be included in coming some time in the future BIDS 2.0.

N.B. I have failed to file and issue outlining plans for BIDS 2.0, so I initiated a new milestone (2.0.0) to which I am adding this new issue.

IMHO (and may be we should outline it somewhere) BIDS supporting applications should be coded adhering to Postel's principle which is driving TCP implementations to "be conservative in what you do, be liberal in what you accept from others.". And most likely, in the case of added -, they already are since they are more likely to be splitting based on _ and then on - to separate key from value instead of full rigid regular expressions limiting to the allowed characters (not so obviously) mandated by the BIDS specification. So the only catch here indeed might be splitting on - without restricting that only a single split should happen. That is where breakage could happen if - is added. The possibility of breakage in case of extending the list of allowed characters with + (or even : which I dislike to suggest due to messing up scp/ssh) unlikely to have negative ramifications.

Please share your thoughts/opinions

@effigies
Copy link
Collaborator

effigies commented May 9, 2019

I'm personally okay with - or +, and can't really remember the arguments against them.

@oesteban
Copy link
Collaborator

oesteban commented May 9, 2019

I'm okay with +, and not very fond of - but I understand it could be added. I would not allow - to be present in the suffix entity, though.

@yarikoptic
Copy link
Collaborator Author

I would not allow - to be present in the suffix entity, though.

Great point I haven't thought to exercise @oesteban ! -- I would not allow it as well, makes parsing much more difficult etc!
The only spot I see where <label> is used not within _key-<label> is

(git)hopa:~/proj/bids/bids-specification[derivatives]git
$> git grep '_<label'
src/04-modality-specific-files/04-intracranial-electroencephalography.md:called `electrical_stimulation_<label>`, with labels chosen by the researcher and

so might need analysis/tune up (or not)

@yarikoptic
Copy link
Collaborator Author

ok, after #152 is merged in one way or another we could have a PR extending the list of allowed characters non-ambiguously in a single location ;-)

@nicholst
Copy link
Collaborator

nicholst commented May 10, 2019

Note that the request to include hyphens in key-value labels is currently in the "Suggestions for BIDS 2.0" Google doc:

Allowing hyphens in filename key values

which suggests, as I had recalled, that this was a settled issue. Including hyphens increases the complexity of parsing the filenames, was the reason I recall.

@nbeliy
Copy link

nbeliy commented May 10, 2019

@nicholst, will '-' really increase complexity of parsing?

Each key-value parsed by '_' and everything after the first '-' is a label.

@yarikoptic
Copy link
Collaborator Author

@nicholst Oh, that is where BIDS 2.0 "extensions" are discussed -- thank you for the link! I have added the link to the 2.0.0 milestone description. I think it might be worthwhile to move them all under issues over here at some point, so elderly folks like us could easily find everything relevant in the single location (here).

I would not say that parsing complexity would increase, but indeed it might require some adjustment.

@nicholst
Copy link
Collaborator

nicholst commented May 11, 2019

@nbeliy,

@nicholst, will '-' really increase complexity of parsing?

Each key-value parsed by '_' and everything after the first '-' is a label.

Just reciting the conclusion of a long thread about the issue. Like anything, this issue can be re-raised, but I think there is the basic issue of extra clarity from having the hyphen play exactly one role in file/directory names.

@yarikoptic
Copy link
Collaborator Author

my 1c: This issue by now is probably a few years old (IIRC I have suggested it possibly even during BIDS 1.x initial development/release) . If we acted sooner, it would be no issue but even those years back I have proposed it the main argument against was "we would break the tools" of which there were much fewer. So longer we wait to act, the harder would be to adopt this change.

@oesteban
Copy link
Collaborator

I think at this point the procedure would be sending a draft PR with the suggested changes for a more focused discussion, is that correct @franklin-feingold @sappelhoff ?

@sappelhoff
Copy link
Member

I think at this point the procedure would be sending a draft PR with the suggested changes for a more focused discussion, is that correct @franklin-feingold @sappelhoff ?

Yes, and on top of that I think what this discussion needs are:

  1. concrete use cases where the proposed changes would significantly enhance BIDS (summaries/links to such cases)
  2. a concrete PR showing the diff of the spec prior and post the proposed changes (what @oesteban said)
  3. more participants to voice their opinions - especially developers

Currently this change seems to me as a minor improvement with the potential for a major confusion ... but I am very happy to be convinced otherwise :-)

@nicholst
Copy link
Collaborator

nicholst commented Aug 16, 2019 via email

@robertoostenveld
Copy link
Collaborator

Is UTF8 allowed in BIDS file names, or if not, should it be? If so, there are many dashes to choose from which are not a minus (ASCII decimal 45). For example ‒ – and —. See https://www.w3schools.com/charsets/ref_utf_punctuation.asp

@oesteban
Copy link
Collaborator

oesteban commented Nov 1, 2019

Using alternatives that look like the proposed addition doesn't help much with readability, makes it harder. Imagine a suffix like _inplane‒T2.json (I used your first UTF8 character). _inplane+T2.json breaks the suffix a bit, but is still readable. The limitation of the suffix is also important in this kind of proposal, IMHO.

I would be fine with the + symbol and I think the dash is not worth the pain. Adding the dash would feel to me as useful as allowing spaces (i.e., error-prone and likely confusing).

@yarikoptic
Copy link
Collaborator Author

I think we should explicitly forbid UTF8 and limit to ASCII to avoid such "tricks" which could potentially cause confusion.
So let's move forward with "+"?
I doubt any tool was so strict with regexp that it listed only alpha numerics. and sooner we change, less possible negative effect it would have. This issue already dragged long time (I originally suggested in the times of BIDS being a google doc), and I think it is worth to be accepted to 1.x series without awaiting 2.0 which I am not quite sure would come in any foreseeable future

@robertoostenveld
Copy link
Collaborator

robertoostenveld commented Nov 2, 2019 via email

@yarikoptic
Copy link
Collaborator Author

ok, it seems that the path of the least resistance would be to add + as allowed character. I will look into proposing a PR here and against bids-validator for that (unless someone would beat me to it which I would really appreciate)

@yarikoptic
Copy link
Collaborator Author

oh hoh, it had been a year! Few final remarks and then I will try to find time to compose a PR.

  • + implies composition of multiple things, like _proc-mc+filt -- thing was motion corrected and filtered; whenever - is more of a pure delimiter between multiple words which need to be taken together
  • re _inplane‒T2. example -- "illegit" since - must not be allowed in the suffix, which it must not look like _key-value. This issue only about <label> as used in the _key-<label>, and we never use _<label> as arbitrary suffix (git grep -e '<label>\.' -e '_<label>' comes with no relevant hits)
  • comparison of - to spaces is IMHO also a bit too far stretched. - does not break visual composition of the filename etc.
  • arguments against - because there are UTF8 characters for longer dashes are not really pertinent. Pretty much any character could have some confusingly similar symbol in UTF8, so we just must not allow utf8 in general.

Since I and others mentioned a number of use cases, and IMHO it would be easier to strip away than to add later, I will prepare a PR which would introduce allowing both - and + and then we will see how it goes.

@Remi-Gau Remi-Gau transferred this issue from bids-standard/bids-specification Feb 22, 2021
@yarikoptic
Copy link
Collaborator Author

woohoo -- I found you my darling issue -- you have been moved and I thought that I just lost my "search foo" skill!

With the addition of BIDS URIs (#918) into BIDS specification (aiming for 1.8.0 release) I really want to make a case for this issue to be given yet another consideration:

  • BIDS URIs PR introduced a "breaking" new feature, while deprecating support for prior paths-based IntendedFor and requiring everyone to adjust to use bids:: in those cases. Operation of any tool now which uses IntendedFor and which was created for BIDS prior future 1.8.0 release would break now on a first BIDS 1.8.0 compliant dataset. As to me, that PR establishes a precedence to appeal for this issue/request since we are running into such use cases with quite some regularity and for which we have no good workaround besides "use X instead" or "come up with a shorter oneWord one" (hard it label is just numbers).
  • this is one of the oldest feature requests. If it was filed OVER 3 years ago. If it was timely handled back then, we would not even be speaking over this issue ever again.
  • It is good to have this repository with possible items toward BIDS v2 , but AFAIK there is no clear roadmap/timeline toward v2, so I do not see this issue being addressed in immediate future.
  • per discussion/analyses above allowing for [-+] in the <label> would not even guaranteed to break any tool right away: tool might be just as fine, users would need to come to make use of this "new" functionality (unlike with BIDS URIs - they have to switch to use them, so causing tools to break "earlier")
  • unlike BIDS URIs, adding support for having [-+] in the label should be trivial for any tool/language. And hopefully soon, as more tools use our BIDS schema, fix for this schema would be "automagically adopted".

Now I will really make an effort to file a PR but might need to prep few "preparatory" PRs first.

@effigies effigies transferred this issue from bids-standard/bids-2-devel Jul 26, 2022
@effigies
Copy link
Collaborator

I moved this back. Sorry, not sure exactly why it got moved; perhaps it seemed abandoned, or was lumped in with the BIDS-2 proposal linked above because of the - suggestion.

Reading through the thread, I think there's basically:

  • Rough support (as in some support but no opposition) for +
  • Opposition to - for having multiple meanings
  • Opposition to UTF-8 for deceptive characters

So I think a reasonable proposal is to change the regex for label to [a-zA-Z0-9+], and associated textual changes. It does not seem worth the time to write for both + and -, given that there's known opposition and - was specifically tabled before.

@TheChymera
Copy link
Collaborator

TheChymera commented Jul 26, 2022

While I have no objection to + other than that it looks ugly, - I think is a very bad idea. I think the main issue is not high-level tools such as validators, but basic tools such as user bash scripts and path legibility at a glance. Minus is a special character, and that is good. I would make an analogy to periods in file names, using periods to delimit words in a filename is a bad idea.

The genealogy of this request is also indicative of problems which BIDS could help solve by encouraging better data practices. This --desire generally crops up because operators will put the experiment name, animal name, animal genotype, session date, etc. in the name field to keep it all in one place. That is a poorly annotated experiment and not a string which is just “too fancy” for BIDS. So I think there is something to be said for explicitly not supporting minuses. If you need a minus you're trying to cram 2 concepts in a field, pretty much by definition.

@yarikoptic
Copy link
Collaborator Author

... using periods to delimit words in a filename is a bad idea

and thus we don't use periods in the middle of the filenames (although could), but we do use .tar.gz and others to compose extensions together. In other words, "allowed != always a good idea". But on contrary "disallowed == need to come up with uglier workaround" (.tgz?).

Although I agree that quite often desire to place - within a label as indicator of "underthinking" about separating metadata, it is not always the case and some cases do warrant "multi-word" label, e.g. within _acq- field to indicate e.g. different SMS and GRAPPA factors etc for which there is no (and never would be due to too small of a niche) dedicated entity. Situation becomes even more fun for _desc- in derivative data to provide some meaningful description.

Another analogy -- BIDS doesn't restrict the length of the <label> but nobody does super long labels. The same sanity I expect to be applied to use of - by the user.

@oesteban
Copy link
Collaborator

oesteban commented Jul 26, 2022

Although I agree that quite often desire to place - within a label as indicator of "underthinking" about separating metadata, it is not always the case and some cases do warrant "multi-word" label, e.g. within _acq- field to indicate e.g. different SMS and GRAPPA factors etc for which there is no (and never would be due to too small of a niche) dedicated entity. Situation becomes even more fun for _desc- in derivative data to provide some meaningful description.

Are we not talking about the 20% in the 80/20 principle? How many datasets would need such flexibility to specify acq-? This means, all other entities are the same so you can't disambiguate ?

I tend to agree with Chris' assessment, and, as I mentioned before, once + is allowed, it could be given tokenization semantics. That would turn - redundant as its sole proposed purpose is label tokenization.

@neurorepro
Copy link

I agree with the usefulness of +, is there any update on this issue ?

@effigies
Copy link
Collaborator

effigies commented May 16, 2023

No update. Rereading this, I think my assessment remains the same as before: We can do this. Somebody needs to make a concrete proposal in the form of a PR. I have some new questions that I would like to see addressed in a proposal (possibly they came up and I skimmed too briefly):

  1. Can all labels include +, or should we have labels ([a-zA-Z0-9]+) and multi-labels ([a-zA-Z0-9+]+)?
  2. Is + allowed in index ([0-9]+) or just label?
  3. Can + be the first or last character of a label(/index)? (If no, the regexes will get much uglier.)
  4. Is this generally available or only in derivatives?
  5. Do we define any semantics? e.g., if you combine run-01 and run-02, is run-01+02 now acceptable? Previously derivatives said to remove the run entity. In the other direction, does run-01+02 imply any relationship to run-01 or run-02 per the standard? This will have implications on tooling: if I query layout.get(run=1), should I get run-01 and run-01+02?
  6. What does this mean for suffixes? These aren't "entities", so we could say this is just labels.

I would suggest the following answers, but I'm not 100% set:

  1. Split labels and multi-labels. Subject shouldn't have a multi-label IMO. Possibly others shouldn't as well.
  2. Just label. Indexes have numerical meaning and a list of numbers is less clear how to work with.
  3. Weak no. I could imagine a label like B1+ might be useful enough to justify permitting it.
  4. General.
  5. No semantics. It's just another character in a label for human readability. Tooling should not count on consumers to interpret multi-labels as compositions of discrete meanings. If detailed metadata needs to be communicated, it should be in JSON.
  6. Leave suffixes as a controlled vocabulary. On a case-by-case basis a suffix with a + can be considered.

@neurorepro
Copy link

Thanks @effigies , I'm happy to help on the PR

@yarikoptic
Copy link
Collaborator Author

Are you @effigies to introduce some new semantic behind milti-label - Ie would have some meaning beyond just being a label?

@effigies
Copy link
Collaborator

Hmm. Not intentionally (see 5). Possibly that argues against my position 1. I do think the end result should be logically consistent, however these questions are answered.

@TheChymera
Copy link
Collaborator

TheChymera commented May 22, 2023

The same sanity I expect to be applied to use of - by the user.

Ambitious hope. I think the lack of minus support is usually the point where people are introduced to metadata separation. I'm still explaining to one of our colleagues why the name of his mouse is not part of the session name.

I continue to think this makes BIDS less conducive to good metadata practices. If you really really want to allow something like this, maybe it should be made sufficiently prohibitive so as to give the user pause as to whether it's worth it. We could support some other character. Perhaps the cdot, , it's actually very aesthetic and distinctive from the minus (and very obviously not a period either). This is similar to an older suggestion by @robertoostenveld , except the cdot is visually distinct enough from the dash/minus family (em-dash , en-dash , and minus -) to not cause visual confusion.

The prohibitiveness of comes from most people not knowing where it is on their keyboards.

@oesteban
Copy link
Collaborator

@effigies could you confirm that here #1165 (comment) you are referring to only + ? I believe one way of pushing this forward faster is dropping - from proposals now, see + added and if someone is really interested in also permitting - then iterate on a new PR.

Another weak idea regarding 3 (and trying to get regexes even uglier) is: should two + be allowed? Something like: acq-O++O.

Overall, I totally agree with @effigies' assessment, but also agree with @yarikoptic on why this should get added semantics (e.g., allow for subject labels).

@effigies
Copy link
Collaborator

Yes, just +. Let's consider - as dropped from this proposal.

@yarikoptic
Copy link
Collaborator Author

Given the absent concrete action (from me) on this, I am thinking that we should bundle this one into BIDS 2.0. Hence I will announce this closed and let's continue on

@tsalo
Copy link
Member

tsalo commented Sep 19, 2024

The BIDS maintainers have discussed this issue and I think we've converged on allowing +, but not -.

@tsalo tsalo linked a pull request Sep 20, 2024 that will close this issue
@yarikoptic
Copy link
Collaborator Author

FWIW, as #1926 shows, change is "substantial" enough to also allow for -, I would even might have postponed it until BIDS 2.0 frankly: any BIDS compliant tool might need to adjust the way they parse filenames, especially if they do not use schema yet. IMHO - is dominantly already used in various use cases for separation of words in the labels, and + is more of "something which is not -" but without really a big reason: I think BIDS should encourage consistent parsing of file names based on simple algorithm of "split by _ into entities + suffix; then split each entity into name and the value by splitting on the first -". That would make all tools robust, and it would make them work with any regex we adopt as allowed for the BIDS: tools IMHO should not rely on any specific regex, it should be just a job of the validator to ensure compliant values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants