-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock folder not cleaned up after (Slurm) job is killed #3280
Comments
Copy from a discussion with Todd Gamblin on Slack : we use fcntl locks if you want to steal ours, here you go: https://github.com/spack/spack/blob/develop/lib/spack/llnl/util/lock.py (edited) |
The problem with I've also been bitten by the problem that @Flamefire reported, it's certainly annoying. How about installing a signal handler that cleans up the locks that were created in that EasyBuild session in case a |
Does a crash necessarily get a signal ?
Not all signals can be catches either.
What about implementing both, and letting the configuration decide which implementation is used ?
…
On Apr 18, 2020 at 8:27 AM, <Kenneth Hoste ***@***.***)> wrote:
The problem with fcntl locks is that not all filesystems support them though, so it's not a perfect solution either. If we can figure out a way to detect whether fcntl locks can be used, then it seems that's a better solution, but I'm not sure that can be done reliably...
I've also been bitten by the problem that @Flamefire (https://github.com/Flamefire) reported, it's certainly annoying.
How about installing a signal handler that cleans up the locks that were created in that EasyBuild session in case a SIGTERM signal (and possible other signals) is received?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (#3280 (comment)), or unsubscribe (https://github.com/notifications/unsubscribe-auth/ABZKY2WUJOHSUN3KHGGJ6ATRNGMETANCNFSM4MITROEA).
|
I'm working on the auto-cleanup after receiving a signal part, which is relatively easy to implement. We can definitely also implement support for using |
With the changes in #3291, locks are cleaned up if the EasyBuild session gets a That doesn't seem to help in the context of Slurm jobs that get cancelled or run into a timeout though... @Flamefire: Are you up for testing the changes in #3291 in the context of Slurm jobs? |
The only way I could get the signal handler in |
Shouldn't be closed yet, since #3291 doesn't actually fix this yet... |
The following happened:
Afterwards the lock was still there so the next build failed too and the lock had to be removed manually
@boegel @mboisson
The text was updated successfully, but these errors were encountered: