Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm-send-mail causes error if array job cancelled before any tasks start #141

Closed
mamiller615 opened this issue Aug 28, 2024 · 7 comments
Closed
Assignees
Labels
bug Something isn't working confirmed An issue that has been confirmed to exist good first issue Good for newcomers
Milestone

Comments

@mamiller615
Copy link

mamiller615 commented Aug 28, 2024

Versions

OS version: Rocky Linux 9.4
Slurm version: 22.05.9-1
Slurm Mail version: 4.20

Describe the bug

We have seen that if a user submits an array job, and cancels it before any tasks start, the slurm-send-mail program will generate an error message in the /var/log/slurm-mail/slurm-send-mail.log log file, and slurm-mail file for the job is not deleted. In our case, thousands of files accumulated over several months and slurm-send-email continually trying to reprocess them. We ended up just deleting these older slurm-email files.

To replicate this, I submitted a simple shell script as an array job and immediately canceled the job:

[USER@jhpce01 class-scripts]$ sbatch --array=1-5 --mail-type=FAIL,END --mail-user=MYEMAIL script1
Submitted batch job 9575212
[USER@jhpce01 class-scripts]$ scancel 9575212
[USER@jhpce01 class-scripts]$ squeue --me
[USER@jhpce01 class-scripts]$

and the following messages were seen in the slurm-send-mail.log file.

. . .
2024/08/28 12:29:00:INFO: processing: /var/spool/slurm-mail/9575210_1724862485.197255.mail
2024/08/28 12:29:00:INFO: Sending e-mail to: ANOTHERUSER using ANOTHERUSER-EMAIL for job 9575210 (Ended) via SMTP server localhost:25
2024/08/28 12:29:00:INFO: Deleting: /var/spool/slurm-mail/9575210_1724862485.197255.mail
2024/08/28 12:29:00:INFO: processing: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:29:00:ERROR: Failed to process: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:29:00:ERROR: list index out of range
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/slurmmail/cli.py", line 971, in send_mail_main
    __process_spool_file(f, smtp_conn, options)
  File "/usr/lib/python3.9/site-packages/slurmmail/cli.py", line 361, in __process_spool_file
    jobs = [jobs[0]]
IndexError: list index out of range

It seemed like the "jobs" python list was empty in this situation, so I was able to fix (or at least avoid) the issue by modifying line 360 in /usr/lib/python3.9/site-packages/slurmmail/cli.p from:

    if array_summary or len(jobs) == 1:

to:

    if ( ( array_summary and (len(jobs) != 0) ) or ( len(jobs) == 1) ):

With this change in place, the problematic slurm-email file was processed, no errors arose, and no email was sent, which I think is fine.

2024/08/28 12:31:49:INFO: processing: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:31:49:INFO: Deleting: /var/spool/slurm-mail/9575212_1724862516.499688.mail

Further testing shows that slurm-email is working as it should.

While the change I made works, there emay be a more intelligent way to deal with the situation.

Logs

Same as above example...

. . .
2024/08/28 12:29:00:INFO: processing: /var/spool/slurm-mail/9575210_1724862485.197255.mail
2024/08/28 12:29:00:INFO: Sending e-mail to: lthuytra using EMAIL_ADDRESS for job 9575210 (Ended) via SMTP server localhost:25
2024/08/28 12:29:00:INFO: Deleting: /var/spool/slurm-mail/9575210_1724862485.197255.mail
2024/08/28 12:29:00:INFO: processing: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:29:00:ERROR: Failed to process: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:29:00:ERROR: list index out of range
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/slurmmail/cli.py", line 971, in send_mail_main
    __process_spool_file(f, smtp_conn, options)
  File "/usr/lib/python3.9/site-packages/slurmmail/cli.py", line 361, in __process_spool_file
    jobs = [jobs[0]]
IndexError: list index out of range

Thanks for all of your work in putting out a great Email too for SLURM!!

@mamiller615 mamiller615 added the bug Something isn't working label Aug 28, 2024
@neilmunday
Copy link
Owner

Hi,

Thanks for reporting the issue. I will take a look and get back to you.

@neilmunday
Copy link
Owner

neilmunday commented Aug 28, 2024

Note: there is an unmasked e-mail address in your last log file snippet.

Edit: I have edited your message to remove it

@neilmunday
Copy link
Owner

Bug confirmed - I have created a new integration test case that demonstrates the bug in the current version of Slurm-Mail.

I am working on a fix.

@neilmunday neilmunday added good first issue Good for newcomers confirmed An issue that has been confirmed to exist labels Aug 29, 2024
@neilmunday neilmunday added this to the 4.21 milestone Aug 29, 2024
@neilmunday
Copy link
Owner

Interestingly, the Slurm job ID for a cancelled job array that never dispatched is of the form X_[Y-Z], e.g.:

1_[1-5]

@neilmunday
Copy link
Owner

Issue fixed in release 4.21.

Thanks again for reporting the issue.

@mamiller615
Copy link
Author

Thanks for the quick response!

@neilmunday
Copy link
Owner

Many thanks for the sponsorship - that's my first one ever!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed An issue that has been confirmed to exist good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants