Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 timeout: montage_a & montage_s #1338

Closed
shnizzedy opened this issue Aug 7, 2020 · 12 comments
Closed

🐛 timeout: montage_a & montage_s #1338

shnizzedy opened this issue Aug 7, 2020 · 12 comments

Comments

@shnizzedy
Copy link
Member

shnizzedy commented Aug 7, 2020

Describe the bug

SLURM running C-PAC in Singularity with the following sbatch options times out with 5 montage_a and 5 montage_s tasks "running" for 12 hours.

#SBATCH -c 10
#SBATCH --time=11:55:00
#SBATCH --mem-per-cpu=20gb

One run:

200802-02:49:04,214 nipype.workflow INFO:
	 [MultiProc] Running 10 tasks, and 77 jobs ready. Free memory (GB): 18.00/20.00, Free processors: 0/10.
                     Currently running:
                       * resting_preproc_18_movie.montage_csf_gm_wm_1.montage_a
                       * resting_preproc_18_movie.montage_csf_gm_wm_1.montage_s
                       * resting_preproc_18_movie.montage_mni_anat_0.montage_a
                       * resting_preproc_18_movie.montage_mni_anat_0.montage_s
                       * resting_preproc_18_movie.montage_mni_anat_1.montage_a
                       * resting_preproc_18_movie.montage_mni_anat_1.montage_s
                       * resting_preproc_18_movie.qc_skullstrip_0.montage_skull.montage_a
                       * resting_preproc_18_movie.qc_skullstrip_0.montage_skull.montage_s
                       * resting_preproc_18_movie.qc_skullstrip_1.montage_skull.montage_a
                       * resting_preproc_18_movie.qc_skullstrip_1.montage_skull.montage_s

Another run:

200803-15:00:55,213 nipype.workflow INFO:
	 [MultiProc] Running 10 tasks, and 77 jobs ready. Free memory (GB): 18.00/20.00, Free processors: 0/10.
                     Currently running:
                       * resting_preproc_17_movie.montage_csf_gm_wm_1.montage_a
                       * resting_preproc_17_movie.montage_csf_gm_wm_1.montage_s
                       * resting_preproc_17_movie.montage_mni_anat_0.montage_a
                       * resting_preproc_17_movie.montage_mni_anat_0.montage_s
                       * resting_preproc_17_movie.montage_mni_anat_1.montage_a
                       * resting_preproc_17_movie.montage_mni_anat_1.montage_s
                       * resting_preproc_17_movie.qc_skullstrip_0.montage_skull.montage_a
                       * resting_preproc_17_movie.qc_skullstrip_0.montage_skull.montage_s
                       * resting_preproc_17_movie.qc_skullstrip_1.montage_skull.montage_a
                       * resting_preproc_17_movie.qc_skullstrip_1.montage_skull.montage_s

Expected behavior

C-PAC run completes or throws an error

Versions

  • C-PAC: 1.7.0
  • Container Platform :
    • Singularity: 2.5.2

Additional context

  • 1 subject
  • 1 anat image
  • about 2 hours of BOLD data broken into ~15min chunks (TR = 2s)

Possibly related:

@sgiavasis
Copy link
Collaborator

Aha! Someone else has finally seen this. Some of my big regression tests runs have stalled at this point but I was never able to replicate it on command.

@sgiavasis
Copy link
Collaborator

sgiavasis commented Aug 7, 2020

I suspect this behavior won't show itself if the montage nodes are run individually or outside a Nipype workflow- but I am going to try a unit test for them just in case- I'll let you know if I end up with any useful info.

@shnizzedy
Copy link
Member Author

Update: this issue seems not to occur if running with just 1 CPU, so probably/possibly related to #1130

@sgiavasis
Copy link
Collaborator

Okay yeah- then I'm going to test it via a Nipype workflow.

@sgiavasis
Copy link
Collaborator

Have you seen any other montage nodes hang other than these?

montage_csf_gm_wm
montage_mni_anat
montage_skull

@sgiavasis
Copy link
Collaborator

@shnizzedy in your run: are the resample_o/resample_u nodes being successfully completed? Do they have full folders (report.rst etc.) in the working directory?

Would be in /working/resting_preproc_{sub}/montage_csf_gm_wm_x

@shnizzedy
Copy link
Member Author

Are they supposed to have both? Looks like yes to resample_u, no to resample_o:

resting_preproc_{sub}_{ses}/montage_csf_gm_wm_0
└── resample_u
    ├── _0x6a9dace6b9f322c50d67214868841a1b.json
    ├── _inputs.pklz
    ├── _node.pklz
    ├── _report
    │   └── report.rst
    ├── result_resample_u.pklz
    └── sub-{sub}_ses-{ses}_T1w_resample_calc_1mm.nii.gz
resting_preproc_{sub}_{ses}/montage_csf_gm_wm_1
└── resample_u
    ├── _0x6a9dace6b9f322c50d67214868841a1b.json
    ├── _inputs.pklz
    ├── _node.pklz
    ├── _report
    │   └── report.rst
    ├── result_resample_u.pklz
    └── sub-{sub}_ses-{ses}_T1w_resample_calc_1mm.nii.gz

@sgiavasis
Copy link
Collaborator

sgiavasis commented Aug 12, 2020

Yes, for the csf_gm_wm montages, you'll get three resample_o's:

resample_o_csf
resample_o_gm
resample_o_wm

I noticed in some of the other participants from my test run, they had resample_u but not the overlay ones. However, those had actual crash files for segmentation upstream, so there was an actual reason and no hanging.

I made a small workflow that only runs these montage nodes on whatever data you give it, with Multiproc enabled and basically the same environment as the whole pipeline; I haven't been able to replicate any hanging or stalling yet.

I would say that might be a clue though; for the stalling runs- can check the upstream intermediates like the tissue files, the skullstrip, etc.- but those all come from such different processes that I doubt it has something to do with upstream nodes stalling.

@shnizzedy
Copy link
Member Author

At least sometimes, maybe always, tracebacks like this appear when this behavior occurs.

200913-21:10:20,7 nipype.workflow ERROR:
	 Node sinker_13_raw_functional.b7 failed to run on host node055.
200913-21:10:20,7 nipype.workflow ERROR:
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/multiproc.py", line 69, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 471, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 555, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 635, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 523, in run
    outputs = self.aggregate_outputs(runtime)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/interfaces/base/core.py", line 597, in aggregate_outputs
    predicted_outputs = self._list_outputs()
  File "/code/CPAC/utils/interfaces/datasink.py", line 586, in _list_outputs
    use_hardlink=use_hardlink)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/utils/filemanip.py", line 443, in copyfile
    os.unlink(newfile)
FileNotFoundError: [Errno 2] No such file or directory: '/outputs/output/pipeline_analysis_freq-filter_nuisance/sub-04_ses-movie/raw_functional/sub-04_ses-movie_task-movie_run-8_bold.nii.gz'

@shnizzedy
Copy link
Member Author

@shnizzedy
Copy link
Member Author

The same problem does not seem to exist for registration:

def create_wf_calculate_ants_warp(name='create_wf_calculate_ants_warp', num_threads=1, reg_ants_skull=1):

calculate_ants_warp.interface.num_threads = num_threads

if reg_option == 'ANTS':
# linear + non-linear registration
func_to_epi_ants = \
create_wf_calculate_ants_warp(
name='func_to_epi_ants',
num_threads=1,
reg_ants_skull=1)

ants_reg_anat_mni = \
create_wf_calculate_ants_warp(
'anat_mni_ants_register_%s_%d' % (strat_name, num_strat),
num_threads=num_ants_cores,
reg_ants_skull=c.regWithSkull
)

ants_reg_anat_mni = \
create_wf_calculate_ants_warp(
f'anat_mni_ants_register_{num_strat}',
num_threads=num_ants_cores,
reg_ants_skull = c.regWithSkull
)

@shnizzedy
Copy link
Member Author

Closing as duplicate of #1404. Will reopen if still occurs after resolving that issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants