-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Split ICA into multiple steps #865
Conversation
mne_bids_pipeline/_run.py
Outdated
# We need to make the logs more compact to be able to write Excel format | ||
# (32767 char limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this, how does compressing the logs help with writing the (then-uncompressed) data to Excel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xlsx has a 32767 chat limit for each cell. The cell data for entries in what was formerly the cfg
column in the spreadsheet gets compressed both in json characters (remove indentation and unnecessary whitespace) and then with zlib. This compressed data is now written in a cfg_zlib column instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean that it's not human-readable anymore?
Co-authored-by: Richard Höchenberger <richard.hoechenberger@gmail.com>
mne_bids_pipeline/_run.py
Outdated
# We need to make the logs more compact to be able to write Excel format | ||
# (32767 char limit per cell), in particular the "cfg" column has very large | ||
# cells, so replace the "cfg" column with a "cfg_zip" column (parentheticals | ||
# contain real-world example numbers used to illustrate the compression): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hoechenberger better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it helps my understanding, thanks! but does that mean that this column isn't human-readable anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, no longer human readable. On main
part of it is human readable, but the vast majority of it is just completely missing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... at least that's the case for the ERP_CORE dataset I was looking at, which is probably a worst case. There are probably some steps for which compression is not necessary. But it's hard to know which will be which.
I think a big part of it might have been the support for a custom Montage -- the json_tricks
dump of that was super long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we, perhaps, create a new sheet and write one line per row (easy) or one config setting per row (more difficult), instead? this would surely keep us below the character limit per cell and retain readability as we wouldn't need to use compression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like right now in main
:
- Each row in the spreadsheet is a log from a (usually parallel) function call
- Each col is an entry like
exec_params
,subject
,session
, etc. - Each sheet is a pipeline step
So for example (with my changes here)
So I'm not quite sure how to map the config
-- which could change from step to step and also potentially call to call -- to new sheet(s).
Maybe instead it could be N
new cfg.<attr>
columns for the N
config attributes for the given step? Presumably these would nearly (but not quite necessarily I think!) match across rows (runs/subjects/whatever), and they would differ quite a bit across sheets (pipeline steps). This would require un-serializing the JSON and getting a list of attributes and mapping these into the Series
but that's not too bad I think. It makes the SimpleNamespace object much harder to un-serialize but the params (much) more human readable I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this
I'm questioning that we need this level of "differentiability" across participants, sessions, tasks, and steps.
I claim that currently, the only really reliable and "sound" way to use the pipeline is by processing all data with the same configuration. If different settings are supposed to be used for different sessions or tasks, a new, separate derivatives folder should be created.
If we follow this premise, we could create a single Excel sheet where one row corresponds to one line or one setting in the config file that was used
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to keep in mind is that the whole config isn't currently available at this level. This isn't everything from config.py
-- it's only the stuff used by the current step. And it can be stuff like which runs are being used (which could in principle vary by participant someday). So it's not about using different config.py
files that this differs, but rather given a static config.py
the set of params (and even potentially stuff like which runs are used for a given subject someday maybe?) could differ.
So yes we could in principle do what you want but it's different information from what's in there now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you. In that case, let's maybe keep it the way it is right now
Before merging …
docs/source/changes.md
)Closes #864
Closes #861
Closes #857
Closes #804
Sample output
Splits ICA into: 1) fitting + epoch creation, 2) detection of ecg/eog, 3) applying. To do this, two ICAs (and two reports) now get saved: one for the original fit (
proc="icafit"
), and a second one for after the automated detection.Updates our docs with some stuff about caching (figured I'd take a stab at it while updating docs for ICA)
Pretty sure this closes Logger reports subject, session, run even if not requested #804 by cleaning up run/session etc. a bit but if you think it's related to stuff other than ICA feel free to remove that from the
closes
lines above @hoechenbergerFixes a bug where we put
_components.tsv
inout_files
dict. Our caching code looks to make sure that not just the input files but also the output files have the expected hashes. Since it's expected that users will modify this file, it should not be inout_files
in the_06
fit/detect step(s), otherwise it will be a cache miss when users modify the file, and the step will re-run and overwrite their changes (!), e.g. onmain
:on this PR: