job file: improve style #2141

matthewrmshin · 2017-02-02T16:26:06Z

Separate the job script into static part and generated part.

Static part now lives in lib/cylc/job.sh, which will be loaded by the job file in run time.
Generated part will continue to be generated by cylc.job_file. It is now quite small and contains:
- The header, unchanged.
- Essential environment variables to provide an initial environment, at the top.
- Logic fragments in shell functions for cylc environment, user environment, user script, etc.
- A line to load the static job script library, and a line to call its main function.
- The footer, unchanged.

Reduce number of environment variables in job file.

Mark old variables as duplicates.
Move communication related variables to contact file.
Remove variables that should not be needed by anything.

Improve diagnostics.

Job script and cylc message no longer repeat one another.
Report only relevant items for jobs. E.g. don't report suite host, port, etc.
New err-script allows user defined scripts to run at the end of the error trap. Close Dump environment to stderr on job fail? #2150.

Quote and brace shell and environment variable values.

Safer - in case there are spaces and other special characters in file system paths.
Logic easier to read.

Remove space from keys in "job_conf".

(Rationale behind this change is that I want to review the logic for job script generation before I move on to look at #2043.)

matthewrmshin · 2017-02-02T20:00:22Z

(I'd also quite like to be able to separate the job script into little functions. Static parts can then reside somewhere in CYLC_DIR and sourced in by the generated job script at run time. The generated part of the job script should then become quite small, with only the essential parts visible to users.)

hjoliver · 2017-02-05T19:40:54Z

Brilliant 💥

matthewrmshin · 2017-02-06T16:15:16Z

Please feel free to raise any concerns, objections, etc that you may feel with this change. In particular:

Shell is now restricted to bash only.
STDOUT for a job will look a bit different.
Task will now fail if an existing suite has pre-script, script and post-script fragments that only work if they are concatenated together. E.g.: if we have the start of a block in pre-script which only ends in post-script then the new logic will fail.
Is there anything else that we ought to change or discard?
Is it worth adding some syntax checking of the generated script (with bash -n)?

matthewrmshin · 2017-02-06T19:29:03Z

Having another thought about this.

E.g.: if we have the start of a block in pre-script which only ends in post-script then the new logic will fail.

I have not heard of a use case for this, but real users often do the unexpected.

I have separated pre-script, script and post-script in 3 functions. There is no reason for doing this, except for making them stand alone as logic units. I can combine the 3 functions into one, so exotic logic like the above can continue to work.

Is it worth adding bash syntax checking of the generated script (with bash -n)?

This is now essential because I have positioned error trapping in the static main function. This means bash syntax error in user scripts will not be trapped. A quick bash -n command on the job file after it is generated should not cost too much, but can be quite nice - so the job does not have to enter a batch system queue and then fail due to a bash syntax error. Alternatively, if we are happy to keep pre-script, script and post-script separated, then we can in theory run bash -n on these settings at validation - which is more efficient for the suite - as we only have to do it once at start up.

Please comment on preference.

oliver-sanders

Looks good to me. I think that syntax checking is a good idea. Will approve once finalised.

oliver-sanders · 2017-02-07T14:53:07Z

lib/cylc/job_file.py

@@ -89,21 +84,32 @@ def write(self, local_job_file_path, job_conf):
        os.chmod(local_job_file_path, mode)

    @classmethod


These class methods could be static methods

hjoliver

This is looking really good, but I'm not quite done yet, sorry (including thinking about your preference question above...)

hjoliver · 2017-02-08T05:48:50Z

lib/cylc/job.sh

+    if declare -F "cylc::job::inst::${NAME}" 1>'/dev/null' 2>&1; then
+        "cylc::job::inst::${NAME}"
+    fi
+}


Insert blank line, for consistency?

matthewrmshin · 2017-02-08T11:38:19Z

(I have now added bash -n syntax checking at the point when the job file is generated.)

hjoliver · 2017-02-08T20:37:58Z

Shell is now restricted to bash only.

I can still set [[[job]]]shell=/bin/ksh, so I guess that item (and its documentation) needs to be removed. (BTW I've seen a lot of ksh scripts from the MO over the years - do your users not set that in their suites?)

hjoliver · 2017-02-08T20:56:30Z

E.g.: if we have the start of a block in pre-script which only ends in post-script then the new logic will fail.

I have not heard of a use case for this, but real users often do the unexpected.

I can't think of a use case either, so I'm fine with this so long as we document that the script items have to be valid on their own.

hjoliver · 2017-02-08T21:26:09Z

Alternatively, if we are happy to keep pre-script, script and post-script separated, then we can in theory run bash -n on these settings at validation - which is more efficient for the suite - as we only have to do it once at start up.

I like this idea, but perhaps only as a non-default option - it might really slow down validation of very large suites?

Actually we might want run time checking anyway, to handle changes via broadcast or edit runs.

hjoliver · 2017-02-08T21:30:26Z

This means bash syntax error in user scripts will not be trapped.

I was concerned about this, but it seems you do mean literally just syntax errors, which are now handled by bash -n; any command with non-zero exit status is still trapped as normal.

hjoliver

Just the ksh question above, otherwise all good here.

matthewrmshin · 2017-02-09T09:00:43Z

On ksh, we have advertised to our users that cylc job script should all be written in bash, so I hope we don't have any suites doing [[[job]]]shell=/bin/ksh. I'll remove reference to ksh from the docs.

On trapping syntax error... Yes, only bash syntax error is essential until the error trap is switched on - because everything is wrapped in functions, nothing actually gets executed until the main function is called. However, if bash fails to parse the script, it will not get to the main function.

trwhitcomb · 2017-02-09T20:21:52Z

ACK! We still are using Korn shell, although most of the scripts should validate as bash scripts as we're not doing anything really fancy. For the sake of our operational partners, we also have the -S set in the job submission tasks.

Apologies that this is late to the game - I've been behind and decided to drop in and see what was new with pull requests!

hjoliver · 2017-02-10T05:12:50Z

@matthewrmshin - looks like @trwhitcomb has thrown a spanner in the works! Our own fault for support ksh in the first place I suppose.

Tim, if you look at a job script generated by a suite running with this branch, it's a beautiful thing - all the boiler-plate code is gone, leaving just what users want to see. But as the job file becomes more sophisticated it gets harder to support both shells because they have less in common in terms of function syntax etc.

If your users don't use screeds of inlined shell scripting (?do they?) it's very likely their tasks will still work as-is under bash. What do you think - is this a justifiable change to force on users?

matthewrmshin · 2017-02-10T09:24:34Z

@trwhitcomb Thanks for the info. Some quick questions:

(As @hjoliver already asked...) Is the usage of ksh in the job script just out of habit (like some of our users)? Or are they using some advanced ksh93-only syntax? If the former, the script is likely sh, ksh88 and bash compatible and should just port to bash with no issue. The latter is trickier, but it should still be possible to make some simple change to port the logic into bash. Alternatively, one can always move the logic out into its own command in the suite's bin/ directory. What do you think?
What does set -S do? It does not appear to be documented in man ksh.

Reduce number of environment variables in job file. * Mark old variables as duplicates. * Move communication related variables to contact file. * Remove variables that are pointless. Improve diagnostics. * Job script and `cylc message` no longer repeat one another. * Report only relevant items for jobs. E.g. don't report suite host, port, etc. Quote shell and environment variable values. Remove space from keys in "job_conf".

Static part of the job script now pull out, and loaded into the job script in runtime. This allows the generated part of the job script to be reasonably small.

matthewrmshin · 2017-02-13T09:32:28Z

if the worst comes to the worst, would it be hard to make this work under ksh too?

Anything is possible. If we only need to consider ksh93 and not ksh88, I'll only need to change:

local to typeset when declaring variables in functions.
Function names can no longer use :: as package delimiters. (This is really for readability only, so no big deal.) I'll use double underscore __ instead.
I'll leave PS4 prompt as default + in ksh.
typeset -f instead of declare -F to check for existence of a function.

matthewrmshin · 2017-02-13T12:18:52Z

I have now brought back support for ksh (but only ksh93).

I have also introduced err-script to allow users to configure custom logic to call at the end of the error trap.

oliver-sanders · 2017-02-13T12:40:57Z

doc/suiterc.tex

+This is any custom script to be invoked at the end the error trap.
+The output of this will always be sent to STDERR and \lstinline=$1= is set to
+the name of the signal caught by the error trap. The script should fast and
+uses very little system resource to ensure that the error trap can return


oliver-sanders · 2017-02-13T14:56:24Z

tests/jobscript/11-shell-ksh/suite.rc

+        graph = foo
+[runtime]
+    [[foo]]
+        script = test "${CYLC_TASK_SUBMIT_NUMBER}" -ge 2


Could potentially use some ksh only syntax e.g. print.

hjoliver · 2017-02-13T22:05:15Z

doc/cug.tex

+
+When a task is ready cylc generates a {\em task job script} that encapsulates
+the runtime settings (environment and scripts) defined for it in the suite.rc
+file. The job script is submitted to run by the {\em job submission method}


This old terminology should be updated ("job submission method" --> "batch system")

hjoliver · 2017-02-13T22:07:08Z

doc/cug.tex

+
+Therefore, commands in user scripts should abort with non-zero exit status if a
+fatal error occurs (this is just good coding practice anyway). To avoid a
+command that expects a non-zero return code from triggering a trap, you can use


Probably should to "to prevent a command..."

hjoliver · 2017-02-13T22:07:51Z

doc/cug.tex

-Like other runtime properties, you can set a suite default job submission
-method and override it for specific tasks or families:
+When a task is ready cylc generates a job script (see~\ref{JobScripts}). The
+job script is submitted to run by the {\em job submission method} chosen for


"batch system"

hjoliver · 2017-02-13T22:09:45Z

doc/suiterc.tex


 \begin{myitemize}
 \item {\em type:} string
 \item {\em default:} (none)
 \item {\em example:} \lstinline@env-script = "echo Hello World"@
 \end{myitemize}

+\paragraph[err-script]{[runtime] \textrightarrow [[\_\_NAME\_\_]] \textrightarrow err-script}
+
+This is any custom script to be invoked at the end the error trap.


"... at the end of the error trap, if the task fails."?

hjoliver · 2017-02-13T22:13:20Z

doc/suiterc.tex

+location.
+{\em Note: It has no bearing on any sub-shells that may be called by the job script.}
+
+Setting this to the path of a ksh93 interpretter is supported but not recommended.


One 't' in "interpreter".

Shall we give a deprecation warning for ksh93 (here, and print one if detected on file parsing) - so we can drop support sometime in the future?

hjoliver · 2017-02-13T22:16:23Z

lib/cylc/job.sh

+    export CYLC_TASK_CYCLE_POINT="$(cut -d '/' -f 1 <<<"${CYLC_TASK_JOB}")"
+    export CYLC_TASK_NAME="$(cut -d '/' -f 2 <<<"${CYLC_TASK_JOB}")"
+    # The "10#" part ensures that the submit number is interpretted in base 10.
+    # Otherwise, a zero padded number will be interpretted as an octal.


One 't' in 'interpreted'.

hjoliver · 2017-02-13T22:17:03Z

lib/cylc/job.sh

+}
+
+###############################################################################
+# Run a function in the task job instant file, if possible.


"instance"?

hjoliver · 2017-02-13T22:29:11Z

A few minor doc and comment typos, otherwise this is all good from my perspective.

matthewrmshin · 2017-02-17T12:22:40Z

I have reinstated CYLC_SUITE_HOST and CYLC_SUITE_OWNER, but only if the contact file is available. These variables are used by various suites in our repositories.

hjoliver · 2017-02-17T22:00:24Z

@matthewrmshin -

... but only if the contact file is available

would you mind explaining that (so I don't have to figure it out)?

Also, could you list in one place any other of the old env vars removed?

hjoliver · 2017-02-17T22:08:29Z

A minimal local test suite on this branch now has task messaging failures:

ERROR: task messaging failure.
Traceback (most recent call last):
  File "/home/vagrant/cylc/bin/cylc-message", line 65, in main
    task_message.send(args)
  File "/home/vagrant/cylc/lib/cylc/task_message.py", line 79, in send
    SuiteSrvFilesManager().load_contact_file(self.suite))
  File "/home/vagrant/cylc/lib/cylc/suite_srv_files_mgr.py", line 359, in load_contact_file
    self.FILE_BASE_CONTACT, reg, owner, host, content=True)
  File "/home/vagrant/cylc/lib/cylc/suite_srv_files_mgr.py", line 230, in get_auth_item
    path = os.path.join(os.environ[key], self.DIR_BASE_SRV)
  File "/usr/lib/python2.7/UserDict.py", line 23, in __getitem__
    raise KeyError(key)
KeyError: 'CYLC_SUITE_RUN_DIR_ON_SUITE_HOST'

hjoliver · 2017-02-17T22:35:06Z

Here's a diff of job.out for same task on master vs this branch, the task has script = printenv | grep CYLC.

Questions:

are you sure that some of the removed "Suite and Task Identity" info isn't useful sometimes?
ditto env vars (especially the one mentioned just above!)

1,11c1,3
< JOB SCRIPT STARTING
< cylc Suite and Task Identity:
<   Suite Name  : foo
<   Suite Host  : localhost
<   Suite Port  : 43076
<   Suite Owner : vagrant
<   Task ID     : foo.1
<   Task Host   : vagrant-ubuntu-trusty-64
<   Task Owner  : vagrant
<   Task Submit No.: 1
<   Task Try No.: 1
---
> Suite    : foo
> Task Job : 1/foo/01 (try 1)
> User@Host: vagrant@vagrant-ubuntu-trusty-64
14,17c6
< cylc (scheduler - 2017-02-17T22:18:45Z): started at 2017-02-17T22:18:45Z
< CYLC_SUITE_RUN_DIR_ON_SUITE_HOST=/home/vagrant/cylc-run/foo
< CYLC_TASK_SSH_LOGIN_SHELL=True
< CYLC_DEBUG=False
---
> 2017-02-17T22:17:44Z NORMAL - started
21d9
< CYLC_TASK_MSG_TIMEOUT=30.000000
29d16
< CYLC_MODE=scheduler
40d26
< CYLC_TASK_MSG_RETRY_INTVL=5.000000
43d28
< CYLC_JOB_PID=9475
45,46d29
< CYLC_DIR_ON_SUITE_HOST=/home/vagrant/cylc
< CYLC_TASK_MSG_MAX_TRIES=7
49,50d31
< CYLC_SUITE_PORT=43076
< CYLC_TASK_COMMS_METHOD=default
56,57c37,38
< cylc (scheduler - 2017-02-17T22:18:49Z): succeeded at 2017-02-17T22:18:49Z
< JOB SCRIPT EXITING (TASK SUCCEEDED)
---
> CYLC_TASK_JOB=1/foo/01
> 2017-02-17T22:17:54Z NORMAL - succeeded

hjoliver · 2017-02-17T22:41:30Z

To answer my own question, I think it is clear that none of those variables would be useful in user scripting (although perhaps occasionally useful diagnostic info?). So just internal use of CYLC_SUITE_RUN_DIR_ON_SUITE_HOST is the problem.

matthewrmshin · 2017-02-18T11:23:05Z

Sorry. I'll try to fix the issue on Monday. I don't think things like port number needs to be in the job environment. They are now in the contact file. Other variables such as job pid are now in the job status file.

matthewrmshin · 2017-02-18T11:27:13Z

The contact file is now not written to the remote job host if communication method is "poll". It does not make sense to have a contact file when backward communication is not possible.

matthewrmshin · 2017-02-18T21:02:55Z

(Travis CI is happy again.)

hjoliver

Thanks, all good now.

trwhitcomb · 2017-02-20T17:42:05Z

Hi guys - sorry for the delay coming back around on this. Was all set then we lost power for a few days!

I'm OK with this now that I've had a chance to look at what we're doing. For posterity, here are the answers to questions from @matthewrmshin and why I think we're fine now:

Reason for using Korn shell for scripts is worse than habit - it's the policy decided on a long time ago. Given the choices, I'm thankful they didn't go with c shell. So, all our task scripts are written in Korn shell. I suspect that about 99% of them will run just fine as Bash.
The "-S" I refer to is actually the PBS directive, telling it to interpret the task script as Korn shell instead.

I don't think we'll be able to convert all our scripts over to Bash due to the aforementioned policy. However, almost all our tasks have a "script" command that just points out to run another script, so you'd just be in the situation where the Bash script written out by cylc will call a Korn shell script written by us. Any inline scripting we have looks to be bash and Korn compatible.

Nice work, guys!

matthewrmshin · 2017-02-20T19:40:36Z

@trwhitcomb you should now be okay as we have brought back ksh support (ksh93 only, however). I can understand policy like what you have described. It hasn't been long since we moved away, and there are still many users uncomfortable with the idea of moving away from ksh.

matthewrmshin added this to the soon milestone Feb 2, 2017

matthewrmshin self-assigned this Feb 2, 2017

matthewrmshin force-pushed the job-file branch 2 times, most recently from cc80989 to b76c9de Compare February 3, 2017 16:07

matthewrmshin force-pushed the job-file branch from b76c9de to 186609a Compare February 6, 2017 12:29

matthewrmshin requested review from hjoliver and oliver-sanders February 6, 2017 15:56

matthewrmshin assigned hjoliver and oliver-sanders Feb 6, 2017

oliver-sanders reviewed Feb 7, 2017

View reviewed changes

matthewrmshin force-pushed the job-file branch from 7850a68 to 5f60d7b Compare February 7, 2017 19:56

hjoliver reviewed Feb 8, 2017

View reviewed changes

matthewrmshin force-pushed the job-file branch from 5f60d7b to e6b784a Compare February 8, 2017 09:18

hjoliver reviewed Feb 8, 2017

View reviewed changes

matthewrmshin mentioned this pull request Feb 9, 2017

Dump environment to stderr on job fail? #2150

Closed

matthewrmshin added 2 commits February 10, 2017 14:57

job file: separate static/generated logic

d8067bc

Static part of the job script now pull out, and loaded into the job script in runtime. This allows the generated part of the job script to be reasonably small.

Jobscript: err-script, and bring back ksh support

b2ce97b

oliver-sanders reviewed Feb 13, 2017

View reviewed changes

Fix doc + more ksh jobscript tests

05560de

hjoliver requested changes Feb 13, 2017

View reviewed changes

Update doc + add deprecation for ksh usage

c48623a

matthewrmshin modified the milestones: next release, soon Feb 15, 2017

This was referenced Feb 16, 2017

Fix tests metomi/rose#2055

Merged

Strftime Jinja2 filter #2162

Merged

Reinstate CYLC_SUITE_HOST and CYLC_SUITE_OWNER

60e7115

Fix usage of CYLC_SUITE_RUN_DIR_ON_SUITE_HOST

fc54349

hjoliver approved these changes Feb 18, 2017

View reviewed changes

oliver-sanders approved these changes Feb 20, 2017

View reviewed changes

oliver-sanders merged commit 1f5d9d2 into cylc:master Feb 20, 2017

matthewrmshin mentioned this pull request Mar 27, 2017

client commands in tasks: detect ssh comms #2214

Closed

matthewrmshin deleted the job-file branch May 11, 2017 14:16

dpmatthews mentioned this pull request Jan 31, 2023

job: bash syntax check #5331

Open

		@@ -89,21 +84,32 @@ def write(self, local_job_file_path, job_conf):
		os.chmod(local_job_file_path, mode)

		@classmethod

job file: improve style #2141

job file: improve style #2141

Conversation

matthewrmshin commented Feb 2, 2017 • edited Loading

matthewrmshin commented Feb 2, 2017 • edited Loading

hjoliver commented Feb 5, 2017

matthewrmshin commented Feb 6, 2017 • edited Loading

matthewrmshin commented Feb 6, 2017 • edited Loading

oliver-sanders left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjoliver left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewrmshin commented Feb 8, 2017

hjoliver commented Feb 8, 2017

hjoliver commented Feb 8, 2017

hjoliver commented Feb 8, 2017 • edited Loading

hjoliver commented Feb 8, 2017

hjoliver left a comment

Choose a reason for hiding this comment

matthewrmshin commented Feb 9, 2017

trwhitcomb commented Feb 9, 2017 • edited Loading

hjoliver commented Feb 10, 2017

matthewrmshin commented Feb 10, 2017

matthewrmshin commented Feb 13, 2017

matthewrmshin commented Feb 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjoliver commented Feb 13, 2017 • edited Loading

matthewrmshin commented Feb 17, 2017

hjoliver commented Feb 17, 2017

hjoliver commented Feb 17, 2017

hjoliver commented Feb 17, 2017

hjoliver commented Feb 17, 2017

matthewrmshin commented Feb 18, 2017

matthewrmshin commented Feb 18, 2017

matthewrmshin commented Feb 18, 2017

hjoliver left a comment

Choose a reason for hiding this comment

trwhitcomb commented Feb 20, 2017

matthewrmshin commented Feb 20, 2017

matthewrmshin commented Feb 2, 2017 •

edited

Loading

matthewrmshin commented Feb 2, 2017 •

edited

Loading

matthewrmshin commented Feb 6, 2017 •

edited

Loading

matthewrmshin commented Feb 6, 2017 •

edited

Loading

hjoliver left a comment •

edited

Loading

hjoliver commented Feb 8, 2017 •

edited

Loading

trwhitcomb commented Feb 9, 2017 •

edited

Loading

hjoliver commented Feb 13, 2017 •

edited

Loading