Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acquisition date anonymization #538

Open
guiomar opened this issue Jul 23, 2020 · 27 comments
Open

acquisition date anonymization #538

guiomar opened this issue Jul 23, 2020 · 27 comments

Comments

@guiomar
Copy link
Collaborator

guiomar commented Jul 23, 2020

Hi!

After discussion at:
mne-tools/mne-bids-pipeline#160

From the BIDS specs:

"For anonymization purposes all dates within one subject should be shifted by a randomly chosen (but consistent across all runs etc.) number of days. This way relative timing would be preserved, but chances of identifying a person based on the date and time of their scan would be decreased. Dates that are shifted for anonymization purposes should be set to a year 1925 or earlier to clearly distinguish them from unmodified data. Shifting dates is RECOMMENDED, but not required."

Summarizing:

  • specs recommend to anonymize the acquisition date and to distinguish it from unmodified data, set a year 1925 or earlier
  • this date was initially set to 1900, however because FIF files doesn't accept such old dates (int-32 issue), the spec was changed to be year "1925 or earlier".

Since currently the spec remains a bit open, and people could still use earlier than 1925 dates, we should at least warn in the specs, that this may produce errors when working with FIF files (e.g. for M/EEG).

Alternatively, talking to Marc he suggested to use an additional field which indicate wether a dated was randomized. This will potentially end these discussions around the acq dates issue. I think this is an elegant solution. I wonder if there is a reason why it has not been already implemented?

The possible solutions I see are:

  1. Impose 1925 (remove "or earlier" from the specs)
  2. Add a warning on the specs to indicate that earlier dates may crash FIF files when working with M/EEG
  3. Add an additional field to indicate whether a date was randomized and keep usual acq date ranges

What do you think?

@hoechenberger @jasmainak @agramfort @sappelhoff @alexrockhill @robertoostenveld @Moo-Marc

@Moo-Marc
Copy link
Contributor

As Guio mentioned, my suggestion is to remove the specific date recommendation and simply add a flag like "AnonymizedAcqTime = true" in dataset_description.json.

@jasmainak
Copy link
Collaborator

yes, I think that's a nice solution. I think it's also worth mentioning that fif files do form a big market in the MEG community (not sure what percentage), not just a format that needs to be fixed. I believe there are hurdles for that to happen any time soon.

@sappelhoff
Copy link
Member

simply add a flag like "AnonymizedAcqTime = true" in dataset_description.json.

Can somebody explain in detail how that would help, please?

@jasmainak
Copy link
Collaborator

jasmainak commented Jul 23, 2020

Can somebody explain in detail how that would help, please?

The problem is fif files cannot have dates before 1901. If you remove the restriction of 1925, it makes it much easier to anonymize. You only need to anonymize up to a year rather than figure out a crazy big number that is not too big.

@Moo-Marc
Copy link
Contributor

You no longer have to look at the year to know if it's a real date or not. And it removes the unnecessary restriction. You can now have a real or anonymized date of 2000-01-01 for example.

@sappelhoff
Copy link
Member

sappelhoff commented Jul 24, 2020

Ahhh okay thanks! Let me rephrase:

  • Right now BIDS implicitly communicates that dates are anonymized by enforcing ridiculously early dates (prior 1925)
  • With an additional field anonymized in dataset_description.json, we can be explicit about anonymization and relax the "prior 1925" rule

Sounds good to me in principle. Let's collect more opinions and if we converge, we can do a small PR!

@effigies
Copy link
Collaborator

tl;dr This discussion seems to have focused on a technical solution to a communication problem. Of the three options presented by @guiomar, only # 2 (Add a warning on the specs to indicate that earlier dates may crash FIF files when working with M/EEG) seems to address the problem.


So, if I'm reading the original thread correctly, the problem is that a dataset might anonymize to a pre-1901 date. If the data needs to be converted to FIF (or FIF-encoded M/EEG data added in a later session), then it may be impossible to correctly encode dates across the dataset without changing all of the dates.

If this is correct, the solution does not seem to match the problem. Asserting anonymization occurred doesn't modify the need to shift the anonymization dates post-1901 if there's any risk a dataset or its derivatives might include FIF data. And for datasets yet to be collected or curated, the thing that needs to happen is that curators need to encode dates post-1901, regardless of any anonymization assertion flag.

It has always been possible to shift dates into some other range, and anonymization is simply something the dataset user has to trust. The idea with the pre-1925 dates is that it is impossible to have collected the data on those dates, and therefore it is unambiguous and directly validatable. Putting AnonymizedAcqTime or similar into the dataset_description.json would simply be encoding "trust me". This doesn't have obvious value to me.

@jasmainak
Copy link
Collaborator

jasmainak commented Jul 27, 2020

Thanks for your thoughts @effigies !

I disagree that # 2 is necessarily the best solution. This is because it assumes that everyone who works with BIDS data reads the specification. More often than not, people use some software for anonymization, thus keeping this weird 1925 date means every MEG software that does anonymization for the purposes of sharing with the BIDS standard will have to add warnings and checks to prevent users from using pre-1901 dates.

Note that HIPAA does not require year to be anonymized and for this reason many packages implement the anonymization in terms of number of days to shift rather than via a specific date. This in turn means that we have to calculate the number of days until 1925, or alternatively change how anonymization is done. All this seems too much of a hassle when we should be in fact making it easier for folks to share data.

My final argument is that group studies can have a range of dates and it may be necessary to keep the relative times. In that case, one has to figure out the number of days that takes you back up to 1925 but not further than 1901 for each recording and then take their minimum/maximum.

Note that dates are not the only thing one has to anonymize, thus we still have to go the "trust me" route even with the 1925 criteria.

@Moo-Marc
Copy link
Contributor

Putting AnonymizedAcqTime or similar into the dataset_description.json would simply be encoding "trust me".

Anonymization does not imply shifting dates, it is not required by all ethics boards, and it is only recommended in BIDS. So it has value to me to specify this. By your logic, having the year shifted very early would also be of no value.

Of course, you are correct that to directly address the issue, a warning could be added wherever we talk about shifting dates.

@sappelhoff
Copy link
Member

I pushed a commit in #546 that should take care of a proper "warning".

Please review

@Moo-Marc
Copy link
Contributor

Moo-Marc commented Aug 5, 2020

While the warning is taken care of, any objection to going ahead with an explicit flag instead of an early date recommendation?
Should it be REQUIRED when dates are shifted? (seems logical to me, but also not essential)
Any better name suggestions? Not sure "AnonymizedAcqTime" is that clear. "ShiftedDates" or "DateShifting" maybe?

@robertoostenveld
Copy link
Collaborator

Both the name of the "key" in the JSON but also the potential values need to be considered. I see multiple options:

  • dates are correct, times are correct (i.e. matching the truth)
  • dates are modified, times are correct (to know the time-of-day, which may be relevant for certain studies, e.g. fatigue)
  • dates are correct, times are modified (seasonal information may be relevant for certain studies, e.g. mood disorders)
  • both dates and times have been modified

There are multiple ways that the date and/or time can be modified:

  • replaced by a date and/or time from the past, e.g. 1925-01-01T00:00:00
  • shifted with a certain amount, such that relative differences in multiple scans/sessions from the same participant are still correct
  • shifted with a certain amount, such that relative differences in scans/sessions from different participants are still correct

I don't think that these cases allow themselves to be described properly with a single key-value pair like AnonymizedAcqTime=true/false

@robertoostenveld
Copy link
Collaborator

Oh, and I don't think that dataset_description.json is the right place to store this. According to https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#scans-file the acq_time information goes into scans.tsv, which means that this should be documented in a corresponding scans.json data dictionary.

There can be a separate scans.json for each subject (theoretically allowing for the acquisition time of some subjects to be anonymized, while that of others not) or using the inheritance principle there can be a scans.json at the top level.

@Moo-Marc
Copy link
Contributor

Moo-Marc commented Aug 6, 2020

There are multiple ways that the date and/or time can be modified

Sure, but I wouldn't go into too much detail. I think for most cases a simple note that the dates are shifted will alert the users not to conclude too much from them. I see little use in knowing date vs time details for example. Of course, within subject relative timing can be important as is already noted in the spec. We could have a few values:
AcqTimeShift: None, Global, PerSubject, PerScan.
I guess hyperscanning is one case where shifting globally might be preferred.

Regarding where to put the key, can it be in scans.json? I thought those files can only contain a description of the columns. It could be mentioned in the acq_time Description field, but that wouldn't be standardized. Is it possible to have the following?

"acq_time" : {
	"Description" : "Date/time of the scan",
	"Shift" : "PerSubject"
},

If so, I like that solution.

@hoechenberger
Copy link
Collaborator

@robertoostenveld

There can be a separate scans.json for each subject (theoretically allowing for the acquisition time of some subjects to be anonymized, while that of others not) or using the inheritance principle there can be a scans.json at the top level.

I very much like this idea. This would also enable us to lift the "pre-1925" limitation, as by looking at scans.json it will become clear whether the acq_time in scans.tsv was anonymized in some way or not.

@Moo-Marc

We could have a few values:
AcqTimeShift: None, Global, PerSubject, PerScan.

I'm a little confused. What would "Global" imply, and why do you think it would only be relevant in hyperscanning?
What would be the difference between PerSubject and PerScan?

Regarding where to put the key, can it be in scans.json? I thought those files can only contain a description of the columns. It could be mentioned in the acq_time Description field, but that wouldn't be standardized. Is it possible to have the following?

The current spec says:

Note however, when a JSON file is used as an accompanying sidecar file for a TSV file, the keys linking a TSV column with their description in the JSON file need to follow the exact formatting as in the TSV file.

Therefore,

Is it possible to have the following?

"acq_time" : {
	"Description" : "Date/time of the scan",
	"Shift" : "PerSubject"
},

this should very well be possible.

I like this solution very much (albeit I still need to understand the meaning of the value @Moo-Marc proposed!), and adopting the spec accordingly would still be backward-compatible with existing anonymized datasets (i.e. if scans.json doesn't exist, one could check whether acq_time is pre-1925 to conclude the dataset is anonymized).

What do you all think about this? If we can reach a consensus, I would start drafting a proposal for a BIDS standard amendment.

@robertoostenveld
Copy link
Collaborator

You can include additional columns in the scans.tsv as I believe is the case in every other BIDS tsv file. From the specification:

Additional fields can include ...

where you should interpret "fields" as "columns". Column headings must be in snake_case (whereas in the json files BIDS uses CamelCase) and the columns can be explained in more detail in a json file that accompanies the tsv.

So the _scans.tsv could be

filename                                          acq_time            acq_time_shift
anat/sub-04_ses-01_T1w.nii.gz                     1800-05-21T11:18:59 true
func/sub-04_ses-01_task-nback_run-01_bold.nii.gz  1800-05-21T11:23:59 true
func/sub-04_ses-01_task-nback_run-02_bold.nii.gz  1800-05-21T11:38:59 true
func/sub-04_ses-01_task-rest_bold.nii.gz          1800-05-21T11:53:59 true

and the corresponding _scans.json

{
  "filename": {
      "Description": "filename corresponding to the scan"
  },
  "acq_time": {
      "Description": "time of acquisition of the scan, formatted according to RFC3339"
  },
  "acq_time_shift": {
      "Description": "whether the acquisition time has been shifted for anonymization purposes",
      "Levels": {
        "true",
        "false"
    }
  } 
}

Although you could also drop the acq_time_shift column from the tsv and document the columns in _scans.json like this

{
  "filename": {
      "Description": "filename corresponding to the scan"
  },
  "acq_time": {
      "Description": "date and time of acquisition, formatted according to RFC3339 and shifted for anonymization purposes"
  }
}

Using the inheritance principle, you could choose to put a single _scans.json at the top level.

@hoechenberger
Copy link
Collaborator

@robertoostenveld

So the _scans.tsv could be

filename                                          acq_time            acq_time_shift
anat/sub-04_ses-01_T1w.nii.gz                     1800-05-21T11:18:59 true
func/sub-04_ses-01_task-nback_run-01_bold.nii.gz  1800-05-21T11:23:59 true
func/sub-04_ses-01_task-nback_run-02_bold.nii.gz  1800-05-21T11:38:59 true
func/sub-04_ses-01_task-rest_bold.nii.gz          1800-05-21T11:53:59 true

and the corresponding _scans.json

{
  "filename": {
      "Description": "filename corresponding to the scan"
  },
  "acq_time": {
      "Description": "time of acquisition of the scan, formatted according to RFC3339"
  },
  "acq_time_shift": {
      "Description": "whether the acquisition time has been shifted for anonymization purposes",
      "Levels": {
        "true",
        "false"
    }
  } 
}

I like this. 👌

@agramfort
Copy link
Collaborator

solution via _scans.tsv is ok for me. 2 remarks:

  • it really means that we "trust" the authors of the data to have actually anonymized if they scans say so. It means it becomes impossible to guarantee from the files
  • the time should be possibly up to microsecond precision.

@sappelhoff
Copy link
Member

the time should be possibly up to microsecond precision.

that will be possible starting with the next release, see latest spec:

Date time information MUST be expressed in the following format YYYY-MM-DDThh:mm:ss[.000000][Z] (year, month, day, hour (24h), minute, second, optional fractional seconds, and optional UTC time indicator). This is almost equivalent to the RFC3339 "date-time" format, with the exception that UTC indicator Z is optional and non-zero UTC offsets are not indicated. If Z is not indicated, time zone is always assumed to be the local time of the dataset viewer. No specific precision is required for fractional seconds, but the precision SHOULD be consistent across the dataset. For example 2009-06-15T13:45:30.

The validator has already been adjusted

@hoechenberger
Copy link
Collaborator

hoechenberger commented Aug 31, 2020

@agramfort

  • it really means that we "trust" the authors of the data to have actually anonymized if they scans say so. It means it becomes impossible to guarantee from the files

The current specs say:

Dates that are shifted for anonymization purposes should be set to a year 1925 or earlier to clearly distinguish them from unmodified data. Shifting dates is RECOMMENDED, but not required.

So it's already only a SHOULD, not a MUST. I believe it's only an implementation detail in MNE-BIDS that we enforce pre-1925 dates for anonymization.

Therefore, I believe we already have to trust users to do the right thing 👍 and the changes proposed here would actually improve the situation.

I think we could just keep this sentence, capitalize "SHOULD", and your issue would be addressed, right?

Dates that are shifted for anonymization purposes SHOULD be set to a year 1925 or earlier ...

@agramfort
Copy link
Collaborator

agramfort commented Aug 31, 2020 via email

sappelhoff added a commit to sappelhoff/bids-specification that referenced this issue Aug 31, 2020
@sappelhoff
Copy link
Member

I think we could just keep this sentence, capitalize "SHOULD", and your issue would be addressed, right?

see: 9b1ae95

@hoechenberger
Copy link
Collaborator

see: 9b1ae95

Great, thanks!

@Moo-Marc
Copy link
Contributor

We could have a few values:
AcqTimeShift: None, Global, PerSubject, PerScan.

I'm a little confused. What would "Global" imply, and why do you think it would only be relevant in hyperscanning?
What would be the difference between PerSubject and PerScan?

@hoechenberger These were meant to refer to the shift value itself, not simply whether they are shifted or not. Global means the shift is the same throughout the dataset, per subject means the same shift is applied within a subject, per scan means the shift is different for each file. So for one global constant shift, I see this as being useful only in hyperscanning where you may want to be able to confirm which participants were scanned together. In any other case, I don't see the use of knowing the relative dates between participants.

While the extra column idea of @robertoostenveld works, I think most cases won't have a mix of anonymized and non-anonymized scans. I therefore prefer a way to specify it globally, and something standardized, not just written in the acq_time field description.

@adam2392
Copy link
Member

adam2392 commented Jan 4, 2021

So the _scans.tsv could be

filename                                          acq_time            acq_time_shift
anat/sub-04_ses-01_T1w.nii.gz                     1800-05-21T11:18:59 true
func/sub-04_ses-01_task-nback_run-01_bold.nii.gz  1800-05-21T11:23:59 true
func/sub-04_ses-01_task-nback_run-02_bold.nii.gz  1800-05-21T11:38:59 true
func/sub-04_ses-01_task-rest_bold.nii.gz          1800-05-21T11:53:59 true

and the corresponding _scans.json

{
  "filename": {
      "Description": "filename corresponding to the scan"
  },
  "acq_time": {
      "Description": "time of acquisition of the scan, formatted according to RFC3339"
  },
  "acq_time_shift": {
      "Description": "whether the acquisition time has been shifted for anonymization purposes",
      "Levels": {
        "true",
        "false"
    }
  } 
}

Adding to the chorus here that a solution along these lines is desirable.

Just FYI there is an issue with the 1925 also in EDF (not just FIF). Raised in #698, this issue also came up with EDF files, where the file format itself prevents anyone from shifting dates to < 1985.

Link here related chat with Teuniz (maintainer of EDFBrowser): https://gitlab.com/Teuniz/EDFbrowser/-/issues/26https://gitlab.com/Teuniz/EDFbrowser/-/issues/26

@guiomar
Copy link
Collaborator Author

guiomar commented Feb 4, 2021

Hi everyone!

Shall we create a PR to deal with this following the changes suggested by @robertoostenveld ?

I also like a lot the values suggested by @Moo-Marc, and they could be easily integrated in the suggestion

None, Global, PerSubject, PerScan.

@Moo-Marc
Copy link
Contributor

Moo-Marc commented Mar 2, 2021

I second that though unfortunately I wouldn't have the time to do it soon.
I much prefer being explicit about having shifted dates. And while the suggestion is already BIDS compliant (I think), it would be best to standardize the name of the added column in _scans.tsv and its possible values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants