-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds ParallelGroupAFQ #1124
Adds ParallelGroupAFQ #1124
Conversation
Hello @teresamg! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-05-15 23:45:53 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really neat implementation now! My main suggestion is to change ParticipantAFQ so that the names of kwargs and attributes are well-matched, so that you can make this even neater.
The code looks quite neat now! @36000 : could you take a quick look when you get a chance? Any ideas about how we can test this? I guess that we'd have to set this up as a nightly test, possibly running this parallelized with concurrent futures across two HBN subjects? |
my thoughts, I will implement these:
|
I made some minor changes here. I made pydra a required library for pyAFQ (its pip installable with few dependencies, this will just be easier for most users). We now try to catch that error Ariel mentioned (let me know if this doesnt work). And I added an |
I wrote this test:
and i am getting this error:
Either of you two know what causes this? |
Could you please push that test to this PR? I'm not sure what's up and would like to try to debug locally. |
Sure, I figured out this was due to me using a different pydra version (the latest, 0.23). But now I am getting picking problems. I will push some of the changes I have made, but I think maybe we will have to talk about this in person at some point. I am running into a few issues |
@teresamg mind trying this to see if it works? |
Looks like it ran successfully, both locally and on Hyak! |
Did it generate the combined tract profiles file? I also ran a test on hyak:
which worked great at individual subject level, but I can't find the final combined tract profiles file, which I was expecting to find. |
It did for me... does each subject have the *_desc-profiles_dwi.csv? |
**pAFQ_kwargs.kwargs) | ||
pAFQ.export_all(viz, xforms, indiv) | ||
|
||
for dir in finishing_params["output_dirs"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused at what these lines here are trying to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind this is that we want to generate a group-level csv file that contains the merged tract profiles. But we only want to do this once, so each task checks whether all of the other tasks have completed. If they have not, it returns and nothing happens. But if it is the last task running, the condition is always fulfilled and the code continues to lines 1124-1126, which run one last GroupAFQ, which would generate the combined tract profile. I worry that this is not super robust, though, and could potentially run into some funky race conditions. In my own test on hyak, I did not get the combined tract profile did not work. So, maybe we need to do something a bit more direct to create the group-level derivatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1120 is checking whether the process is the last process running by trawling for each subject's *_desc-profiles_dwi.csv. If all are present, it exports the GroupAFQ object to create the final tract_profiles.csv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if pydra has some way to do this through their API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ariel, if all csvs are present, perhaps you can print "output_dirs" and double check that none of the paths are wonky or null? Is there anything unusual in test.out or test.err?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, good call:
slurmstepd: error: *** JOB 18301898 ON g3030 CANCELLED AT 2024-05-14T22:33:02 DUE TO TIME LIMIT ***
I will try this again with a longer max time limit than two hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be tricky to do this through pydra, because we'd have to somehow leave the submitting process running somehow for the duration of all the sub-processes, and that may become very cumbersome for very large datasets. I think that assuming we don't leave the program running until all tasks are completed, we have two options:
- Something like what is being done here.
- Don't merge and let users do that separately in a separate program.
I suggest that if my current test works and we get what we expect that we go ahead and merge this PR as is, including this bit of code. I can follow up with a documentation example, based on my current experiments with the HBN data. Users need to know that the final merge into a tract profiles file will fail if any of the sub-tasks fail, so maybe we just need to clarify this in the documentation.
Or is there anything else we need to address before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean about the submitting process... it ends as soon as the jobs are submitted, not completed, and its the last job that does the final merge (all of them check for it). Plus GroupAFQ also fails to produce tract_profiles.csv if anything fails.
I don't know if this is directly related to the content of this PR, but I am now getting an error in visualizing the standard set of bundles, where the visualization code is raising:
Presumably because it's looking for a tract that is now no longer part of the default set of tracts (because we're using the more granular set of CC tracts). @36000 : any chance this is related to changes you introduced here to how data is passed between GroupAFQ and ParticipantAFQ? If you think this is unrelated, I think we can probably merge this PR, and fix this issue elsewhere. |
Not sure what would cause this, but I think it is unrelated, as removing overlapping bundle definitions happens in the init method of the BundleDict, so I don't think a race condition is causing this error. |
Creates a ParallelGroupAFQ class inheriting from GroupAFQ to allow for easier parallelization through pydra. Does not currently work; in progress.