Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix submission failed handler on bad host select #2631

Conversation

matthewrmshin
Copy link
Contributor

To reproduce.

[scheduling]
    [[dependencies]]
        graph = t1
[runtime]
    [[t1]]
        script = true
        [[[remote]]]
            host = $(false)
        [[[events]]]
            submission failed handler = echo %(id)s

On current master, the failed remote host select will cause a KeyError that will take down the suite with something like this:

2018-04-20T12:27:36+01 CRITICAL - Traceback (most recent call last):
	  File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 231, in start
	    self.run()
	  File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 1377, in run
	    self.process_task_pool()
	  File "/home/matt/cylc.git/lib/cylc/scheduler.py", line 1212, in process_task_pool
	    self.suite, itasks, self.run_mode == 'simulation')
	  File "/home/matt/cylc.git/lib/cylc/task_job_mgr.py", line 195, in submit_task_jobs
	    prepared_tasks, bad_tasks = self.prep_submit_task_jobs(suite, itasks)
	  File "/home/matt/cylc.git/lib/cylc/task_job_mgr.py", line 171, in prep_submit_task_jobs
	    check_syntax=check_syntax)
	  File "/home/matt/cylc.git/lib/cylc/task_job_mgr.py", line 727, in _prep_submit_task_job
	    suite, itask, dry_run, '(remote host select)', exc)
	  File "/home/matt/cylc.git/lib/cylc/task_job_mgr.py", line 780, in _prep_submit_task_job_error
	    self.poll_task_jobs)
	  File "/home/matt/cylc.git/lib/cylc/task_events_mgr.py", line 360, in process_message
	    self._process_message_submit_failed(itask, event_time)
	  File "/home/matt/cylc.git/lib/cylc/task_events_mgr.py", line 687, in _process_message_submit_failed
	    'job %s' % self.EVENT_SUBMIT_FAILED)
	  File "/home/matt/cylc.git/lib/cylc/task_events_mgr.py", line 425, in setup_event_handlers
	    self._setup_custom_event_handlers(itask, event, message)
	  File "/home/matt/cylc.git/lib/cylc/task_events_mgr.py", line 821, in _setup_custom_event_handlers
	    user_at_host = itask.summary['job_hosts'][itask.submit_num]
	KeyError: 1

@matthewrmshin matthewrmshin added the bug Something is wrong :( label Apr 20, 2018
@matthewrmshin matthewrmshin added this to the next release milestone Apr 20, 2018
@matthewrmshin matthewrmshin self-assigned this Apr 20, 2018
Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recreated the KeyError outlined on master, which is resolved on the PR branch so that the failed handler is flagged as an ERROR, stalling the suite instead of shutting it down, as (I think?) is the desired/correct behaviour. New test suitable & passes locally.

New log viewer report:

2018-04-20T15:07:56+01 ERROR - [remote-host-select cmd] timeout 10 bash -c false
	[remote-host-select ret_code] 1
2018-04-20T15:07:56+01 ERROR - false: host selection failed:
	COMMAND FAILED (1): false
	
2018-04-20T15:07:56+01 ERROR - [jobs-submit cmd] (remote host select)
	[jobs-submit ret_code] 1
	[jobs-submit err]
	false: host selection failed:
	COMMAND FAILED (1): false
2018-04-20T15:07:56+01 ERROR - [t1.1] -submission failed
2018-04-20T15:07:56+01 WARNING - suite stalled

Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, scratch that! The functionality is fine as per my original comment, but there is a subtlety in the GUI which means one gets a ValueError when trying to access an item on the 'View Jobs Jogs (Viewer)' menu for the submit-failed task with the failed handler (t.1as in this example suite):

Traceback (most recent call last):
  File "/net/home/h06/sbarth/cylc.git/lib/cylc/gui/app_gcylc.py", line 1271, in view_task_logs
    self._popup_logview(task_id, task_state_summary, choice)
  File "/net/home/h06/sbarth/cylc.git/lib/cylc/gui/app_gcylc.py", line 2210, in _popup_logview
    nsubmits, self.get_remote_run_opts())
  File "/net/home/h06/sbarth/cylc.git/lib/cylc/gui/combo_logviewer.py", line 47, in __init__
    logviewer.__init__(self)
  File "/net/home/h06/sbarth/cylc.git/lib/cylc/gui/logviewer.py", line 37, in __init__
    self.create_gui_panel()
  File "/net/home/h06/sbarth/cylc.git/lib/cylc/gui/combo_logviewer.py", line 69, in create_gui_panel
    combobox2.set_active(snums.index(self.nsubmit))
ValueError: list.index(x): x not in list

Tracing this through the code it is due to nsubmits = len(task_state_summary.get('job_hosts', {})) (app_gcylc.py, line 2207) with the default empty list, so that in the ComboLogViewer class in combo_logviewer.py the argument of the same name sets self.nsubmit = nsubmits = 0, creating snums as an empty list on line 65 of that file. So a small change will be needed to fix this issue.

The KeyError would take down the suite.
@matthewrmshin matthewrmshin force-pushed the fix-submission-failed-handler-on-bad-host-select branch from 3aad403 to a89bd2b Compare April 20, 2018 15:23
@matthewrmshin
Copy link
Contributor Author

GUI issue addressed. (You will still be unable to see any log files, but at least we are not going to bring down the GUI any more.)

Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GUI issue resolved by the squashed change. Now attempted log file access from the 'viewer' menu displays the same as in the 'editor' i.e. ERROR: file not found: <file>.

All good now.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@hjoliver hjoliver merged commit 1d931e1 into cylc:master Apr 23, 2018
@matthewrmshin matthewrmshin deleted the fix-submission-failed-handler-on-bad-host-select branch April 23, 2018 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants