You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The library (rank 0) writes to several files from the compute node that the SCR run scripts reference from the job script. When using NFS, caching can lead to strange behavior. For example, consider that the following sequence of commands execute in a job script:
jsrun -r 1 ./test_api (rank 0 writes to .scr/halt.scr from a compute node)
rm -f .scr/halt.scr
scr_halt --list `pwd` (attempts to read .scr/halt.scr)
When SCR_Finalize() is called, rank 0 in test_api writes an entry to the .scr/halt.scr file from the compute node where rank 0 runs indicating SCR_FINALIZE_CALLED. The subsequent rm command should remove the halt file after the run completes, so that the following scr_halt command should not find it.
However, on some systems scr_haltdoes find .scr/halt.scr in the state that rank 0 left it. My best guess is that this happens because the NFS client on the compute node where rank 0 runs flushes its state to NFS server after it has been deleted with the rm command that is executed on the node that runs the job script.
One can often work around this problem by adding a sleep, e.g.,
The library (rank 0) writes to several files from the compute node that the SCR run scripts reference from the job script. When using NFS, caching can lead to strange behavior. For example, consider that the following sequence of commands execute in a job script:
When
SCR_Finalize()
is called, rank 0 intest_api
writes an entry to the.scr/halt.scr
file from the compute node where rank 0 runs indicatingSCR_FINALIZE_CALLED
. The subsequentrm
command should remove the halt file after the run completes, so that the followingscr_halt
command should not find it.However, on some systems
scr_halt
does find.scr/halt.scr
in the state that rank 0 left it. My best guess is that this happens because the NFS client on the compute node where rank 0 runs flushes its state to NFS server after it has been deleted with therm
command that is executed on the node that runs the job script.One can often work around this problem by adding a
sleep
, e.g.,That sleep must wait long enough for the NFS cache timeout to expire on the rank 0 compute node.
The text was updated successfully, but these errors were encountered: