NFS caching between compute node and node running job script leads to inconsistent behavior #337

adammoody · 2021-05-26T21:57:08Z

The library (rank 0) writes to several files from the compute node that the SCR run scripts reference from the job script. When using NFS, caching can lead to strange behavior. For example, consider that the following sequence of commands execute in a job script:

jsrun -r 1 ./test_api (rank 0 writes to .scr/halt.scr from a compute node)
rm -f .scr/halt.scr
scr_halt --list `pwd` (attempts to read .scr/halt.scr)

When SCR_Finalize() is called, rank 0 in test_api writes an entry to the .scr/halt.scr file from the compute node where rank 0 runs indicating SCR_FINALIZE_CALLED. The subsequent rm command should remove the halt file after the run completes, so that the following scr_halt command should not find it.

However, on some systems scr_halt does find .scr/halt.scr in the state that rank 0 left it. My best guess is that this happens because the NFS client on the compute node where rank 0 runs flushes its state to NFS server after it has been deleted with the rm command that is executed on the node that runs the job script.

One can often work around this problem by adding a sleep, e.g.,

jsrun -r 1 ./test_api
sleep 60
rm -f .scr/halt.scr
scr_halt --list `pwd`

That sleep must wait long enough for the NFS cache timeout to expire on the rank 0 compute node.

The text was updated successfully, but these errors were encountered:

adammoody · 2021-05-26T22:02:53Z

SCR already calls fsync when closing its files, but this doesn't seem to matter.

Also, I tried to add read/write locks using flock hoping that force a time-ordered sequence: #336. But so far, that hasn't helped either.

adammoody · 2021-05-26T22:21:24Z

Maybe we need to fsync the .scr/ directory that contains the file, as well?

Tried that, too. Nope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFS caching between compute node and node running job script leads to inconsistent behavior #337

NFS caching between compute node and node running job script leads to inconsistent behavior #337

adammoody commented May 26, 2021 •

edited

Loading

adammoody commented May 26, 2021 •

edited

Loading

adammoody commented May 26, 2021 •

edited

Loading

NFS caching between compute node and node running job script leads to inconsistent behavior #337

NFS caching between compute node and node running job script leads to inconsistent behavior #337

Comments

adammoody commented May 26, 2021 • edited Loading

adammoody commented May 26, 2021 • edited Loading

adammoody commented May 26, 2021 • edited Loading

adammoody commented May 26, 2021 •

edited

Loading

adammoody commented May 26, 2021 •

edited

Loading

adammoody commented May 26, 2021 •

edited

Loading