Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event handler enhancements #1503

Merged
merged 22 commits into from
Jul 7, 2015
Merged

Event handler enhancements #1503

merged 22 commits into from
Jul 7, 2015

Conversation

matthewrmshin
Copy link
Contributor

New and additional settings to suite.rc:

  • Configurable remote task job logs retrieval.
  • Configurable task event email.
  • General task event handlers.
  • Task event handler can now be script templates.
  • Event handlers can now retry on failure.

Improve diagnostics to job-activity.log.

  • Report command and return code as well as out and err of subprocess activities.
  • Clearer labels for activities.

Suite run time database, new task_job_logs table:

  • Store location, size and modified time of the job logs.
  • (Location column can be updated by our anticipated future job log housekeeping functionality.)
  • It is an improved version of what is currently updated by rose suite-hook rose_prune and rose suite-log --update, and viewed by Rose Bush.

Other re-factor and tidy up:

  • Store submit/run and event handler (re)try info in context objects.
  • Store the context and result of a shell command (for invoking under cylc.mp_pool) in a single object.
    • All information in one place.
    • Easy to log.
    • Easy to retry command.
  • Use table name constants for task proxy DB inserts/updates.

This closes #181, addresses bullet points 1 to 3 of #992, and begins the 1st step for #1052.

@matthewrmshin matthewrmshin self-assigned this Jun 16, 2015
@matthewrmshin matthewrmshin added this to the soon milestone Jun 16, 2015
@matthewrmshin
Copy link
Contributor Author

(This is not ready yet, but it needs some initial sound bites.)

# N.B. "scp" does not have a "max-size" option.
check_call([
"rsync", "-a", "--rsh=" + ssh_tmpl, "--max-size=" + opts.max_size,
source + "/", target])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about this: #1494 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be OK with our normal naming conventions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can now have ":" in filenames: YYYY-MM-DDThh:mm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only use the basic format YYMMDDThhmm under log/job/, or don't we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% find ~/cylc-run/bar/log/job/
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00/foo
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00/foo/NN
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00/foo/01
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00/foo/01/job.status
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00/foo/01/job.err
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00/foo/01/job.out
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00/foo/01/job
/home/oliverh/cylc-run/bar/log/job/2015-08-08T00:00/foo/01/job-activity.log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you have something like [cylc]cycle point format defined in your suite?

(The : character would probably make PATH-like search rather interesting?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cycle point format = %Y-%m-%dT%H:%M

That's the official "extended format" with minutes under ISO 8601. Which is nice and readable. My point is we're not disallowing this, or transforming in log paths (maybe we should?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a test for this to ensure this is okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added. I think rsync only has a problem if we have a colon right after a string that can be a valid URI scheme. It should not happen in our case here.

@hjoliver
Copy link
Member

I get a traceback on shutdown:

Processing 1 queued command(s)
  + set_stop_cleanly
Traceback (most recent call last):
  File "/home/oliverh/cylc/cylc.git/lib/cylc/run.py", line 75, in main
    server.run()
  File "/home/oliverh/cylc/cylc.git/lib/cylc/scheduler.py", line 957, in run
    proc_pool.handle_results_async()
  File "/home/oliverh/cylc/cylc.git/lib/cylc/mp_pool.py", line 168, in handle_results_async
    callback(value)
  File "/home/oliverh/cylc/cylc.git/lib/cylc/task_proxy.py", line 630, in job_submission_callback
    for line in result.out.splitlines(True):
AttributeError: 'NoneType' object has no attribute 'splitlines'

'mail retry delays' : vdr( vtype='interval_minutes_list', default=[] ),
'mail smtp' : vdr( vtype='string' ),
'mail to' : vdr( vtype='string' ),
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like the original cylc event hooks config - one handler and list of events for it - which we changed because it doesn't allow different handlers for different events. Is this intended to replace the current system (just left in for back compat) or do you want to keep both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to keep both, so we can have the best of both worlds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add setting of default values for "handlers" and "handler events" as has been done recently for hooks under #1501

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in theory. However, this change is an add-on to the [runtime][TASK][event hooks] section, which does not currently have the equivalent of #1501, so I am a bit reluctant to bundle this in as part of this change. I agree that we should have a follow on to implement #1501 functionality for [runtime][TASK][event hooks] and [runtime][TASK][events].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised as #1529

@matthewrmshin
Copy link
Contributor Author

All problems raised should be addressed. I'll document the proposed change and add some new tests next.

@hjoliver
Copy link
Member

Matt, is this now done as far as you're concerned? If so, I'll take another look (was distracted by zombie processes today, sorry).

@matthewrmshin
Copy link
Contributor Author

Matt, is this now done as far as you're concerned? If so, I'll take another look (was distracted by zombie processes today, sorry).

The functionality should be OK. I have yet to find time to add something the CUG and some automated tests for the new functionality.

@matthewrmshin
Copy link
Contributor Author

Branch re-based. I have now added docs and some automated tests.

@matthewrmshin
Copy link
Contributor Author

@hjoliver this is now ready for review.

@hjoliver
Copy link
Member

(sorry for the delay - the authentication branch just absorbed another day ... I should get on to this tomorrow)

@matthewrmshin
Copy link
Contributor Author

(sorry for the delay - the authentication branch just absorbed another day ... I should get on to this tomorrow)

Don't worry. It has taken me forever to get this branch done in the first place.

@matthewrmshin
Copy link
Contributor Author

Quick summary of what's new:

[runtime]
    [[root]]
        script = true
        [[[remote]]]
            host = my-favourite-host
            # Switch on job log retrieval for this family/task
            retrieve job logs = True
            # Specify initial + retry delays, default is to run immediately
            retrieve job logs retry delays = PT5S, PT30S, PT1M
        [[[events]]]
            # Email on these events
            mail events = submission failed, submission retry, failed, retry
            # You can even modify "from:", "to:" and the SMTP server's "host:port"
            # mail from = notifications@$HOSTNAME
            # mail to = $USER
            # mail smtp = localhost:25
            # Can specify an initial + retry delays
            # mail retry delays = PT0S
            #
            # Generic event handlers for specified events
            handler = hello-event-handler '%(point)s' '%(name)s' '%(submit_num)s' '%(event)s'
            handler events = succeeded, failed
            # Can specify an initial + some retry delays
            # This setting applies to specific event handlers as well.
            # handler retry delays = PT0S

Each event handler can be a template script or a command like before. For a template, the following sub-strings will be substituted:

  • %(event)s: event name
  • %(suite)s: suite name
  • %(point)s: cycle point
  • %(name)s: task name
  • %(submit_num)s: job submit number associated with event
  • %(id)s: task ID, i.e. task name dot cycle point
  • %(message)s: event message

(The classic interface is equivalent to command '%(event)s' '%(suite)s' '%(id)s' '%(message)s'.)

@matthewrmshin
Copy link
Contributor Author

(Just broken the tests slightly by my latest change. Will fix soon.)

@matthewrmshin
Copy link
Contributor Author

Branch re-based, broken tests fixed.

@hjoliver
Copy link
Member

The test battery passes in my environment.

Although ... tests/events/09-task-event-mail.t gets screwed by a $HOME/.mailrc file that sets an smtp server ... can this be handled somehow in the test?

the suite host.
although they can be retrieved by right-clicking on the task in the GUI. If you
want the job logs pulled back to the suite host automatically, you can set
\lstinline@[[[remote]]]retrieve job log=True@:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks funny when process because it is directly above a long-format code block. Maybe expand it more in text, "you can set 'retrieve job log = True' under the task [[[remote]]] section:"?

@matthewrmshin
Copy link
Contributor Author

Long hyperlink fixed. Branch re-based.

@matthewrmshin
Copy link
Contributor Author

I have updated my what's new quick summary.

@matthewrmshin
Copy link
Contributor Author

@arjclark please review-2.

Ensure that event handler entries go to the job-activity.log with the
correct submit number.
@@ -173,9 +173,69 @@ class CylcSuiteDAO(object):
MAX_TRIES = 100
TABLE_BROADCASTS = "broadcasts"
TABLE_TASK_JOBS = "task_jobs"
TABLE_TASK_JOB_LOGS = "task_job_logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new table going to pose a problem for restarts on older suites, or will it be created automatically if missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tables are created automatically if they are missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thought so but needed to confirm.

@arjclark
Copy link
Contributor

arjclark commented Jul 6, 2015

Getting an error from the test battery under ./tests/database/00-simple.t, looks like the test just needs updating:

-CREATE TABLE task_jobs(cycle TEXT, name TEXT, submit_num INTEGER, is_manual_submit INTEGER, try_num INTEGER, time_submit TEXT, time_submit_exit TEXT, submit_status TEXT, time_run TEXT, time_run_exit TEXT, run_signal TEXT, run_status TEXT, user_at_host TEXT, batch_sys_name TEXT, batch_sys_job_id TEXT);
+CREATE TABLE task_jobs(cycle TEXT, name TEXT, submit_num INTEGER, is_manual_submit INTEGER, try_num INTEGER, time_submit TEXT, time_submit_exit TEXT, submit_status INTEGER, time_run TEXT, time_run_exit TEXT, run_signal TEXT, run_status INTEGER, user_at_host TEXT, batch_sys_name TEXT, batch_sys_job_id TEXT);

@arjclark
Copy link
Contributor

arjclark commented Jul 6, 2015

^ presumably accidentally broken by matthewrmshin@c015453

@matthewrmshin
Copy link
Contributor Author

Test fixed.

@arjclark
Copy link
Contributor

arjclark commented Jul 7, 2015

Looks good to me. Tests passing in my environment, follow on work captured in additional issues #1528 and #1529

arjclark added a commit that referenced this pull request Jul 7, 2015
@arjclark arjclark merged commit 5938836 into cylc:master Jul 7, 2015
@matthewrmshin matthewrmshin modified the milestones: next release, soon Jul 7, 2015
@matthewrmshin matthewrmshin deleted the hook branch July 7, 2015 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend the argument list supplied to task event handlers
3 participants