Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS caching between compute node and node running job script leads to inconsistent behavior #337

Open
adammoody opened this issue May 26, 2021 · 2 comments

Comments

@adammoody
Copy link
Contributor

adammoody commented May 26, 2021

The library (rank 0) writes to several files from the compute node that the SCR run scripts reference from the job script. When using NFS, caching can lead to strange behavior. For example, consider that the following sequence of commands execute in a job script:

jsrun -r 1 ./test_api (rank 0 writes to .scr/halt.scr from a compute node)
rm -f .scr/halt.scr
scr_halt --list `pwd` (attempts to read .scr/halt.scr)

When SCR_Finalize() is called, rank 0 in test_api writes an entry to the .scr/halt.scr file from the compute node where rank 0 runs indicating SCR_FINALIZE_CALLED. The subsequent rm command should remove the halt file after the run completes, so that the following scr_halt command should not find it.

However, on some systems scr_halt does find .scr/halt.scr in the state that rank 0 left it. My best guess is that this happens because the NFS client on the compute node where rank 0 runs flushes its state to NFS server after it has been deleted with the rm command that is executed on the node that runs the job script.

One can often work around this problem by adding a sleep, e.g.,

jsrun -r 1 ./test_api
sleep 60
rm -f .scr/halt.scr
scr_halt --list `pwd`

That sleep must wait long enough for the NFS cache timeout to expire on the rank 0 compute node.

@adammoody
Copy link
Contributor Author

adammoody commented May 26, 2021

SCR already calls fsync when closing its files, but this doesn't seem to matter.

Also, I tried to add read/write locks using flock hoping that force a time-ordered sequence: #336. But so far, that hasn't helped either.

@adammoody
Copy link
Contributor Author

adammoody commented May 26, 2021

Maybe we need to fsync the .scr/ directory that contains the file, as well?

Tried that, too. Nope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant