Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job file: improve style #2141

Merged
merged 14 commits into from
Feb 20, 2017
Merged

job file: improve style #2141

merged 14 commits into from
Feb 20, 2017

Conversation

matthewrmshin
Copy link
Contributor

@matthewrmshin matthewrmshin commented Feb 2, 2017

Separate the job script into static part and generated part.

  • Static part now lives in lib/cylc/job.sh, which will be loaded by the job file in run time.
  • Generated part will continue to be generated by cylc.job_file. It is now quite small and contains:
    • The header, unchanged.
    • Essential environment variables to provide an initial environment, at the top.
    • Logic fragments in shell functions for cylc environment, user environment, user script, etc.
    • A line to load the static job script library, and a line to call its main function.
    • The footer, unchanged.

Reduce number of environment variables in job file.

  • Mark old variables as duplicates.
  • Move communication related variables to contact file.
  • Remove variables that should not be needed by anything.

Improve diagnostics.

  • Job script and cylc message no longer repeat one another.
  • Report only relevant items for jobs. E.g. don't report suite host, port, etc.
  • New err-script allows user defined scripts to run at the end of the error trap. Close Dump environment to stderr on job fail? #2150.

Quote and brace shell and environment variable values.

  • Safer - in case there are spaces and other special characters in file system paths.
  • Logic easier to read.

Remove space from keys in "job_conf".

(Rationale behind this change is that I want to review the logic for job script generation before I move on to look at #2043.)

@matthewrmshin matthewrmshin added this to the soon milestone Feb 2, 2017
@matthewrmshin matthewrmshin self-assigned this Feb 2, 2017
@matthewrmshin
Copy link
Contributor Author

matthewrmshin commented Feb 2, 2017

(I'd also quite like to be able to separate the job script into little functions. Static parts can then reside somewhere in CYLC_DIR and sourced in by the generated job script at run time. The generated part of the job script should then become quite small, with only the essential parts visible to users.)

@hjoliver
Copy link
Member

hjoliver commented Feb 5, 2017

Brilliant 💥

@matthewrmshin
Copy link
Contributor Author

matthewrmshin commented Feb 6, 2017

Please feel free to raise any concerns, objections, etc that you may feel with this change. In particular:

  • Shell is now restricted to bash only.
  • STDOUT for a job will look a bit different.
  • Task will now fail if an existing suite has pre-script, script and post-script fragments that only work if they are concatenated together. E.g.: if we have the start of a block in pre-script which only ends in post-script then the new logic will fail.
  • Is there anything else that we ought to change or discard?
  • Is it worth adding some syntax checking of the generated script (with bash -n)?

@matthewrmshin
Copy link
Contributor Author

matthewrmshin commented Feb 6, 2017

Having another thought about this.

E.g.: if we have the start of a block in pre-script which only ends in post-script then the new logic will fail.

I have not heard of a use case for this, but real users often do the unexpected.

I have separated pre-script, script and post-script in 3 functions. There is no reason for doing this, except for making them stand alone as logic units. I can combine the 3 functions into one, so exotic logic like the above can continue to work.

Is it worth adding bash syntax checking of the generated script (with bash -n)?

This is now essential because I have positioned error trapping in the static main function. This means bash syntax error in user scripts will not be trapped. A quick bash -n command on the job file after it is generated should not cost too much, but can be quite nice - so the job does not have to enter a batch system queue and then fail due to a bash syntax error. Alternatively, if we are happy to keep pre-script, script and post-script separated, then we can in theory run bash -n on these settings at validation - which is more efficient for the suite - as we only have to do it once at start up.

Please comment on preference.

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think that syntax checking is a good idea. Will approve once finalised.

@@ -89,21 +84,32 @@ def write(self, local_job_file_path, job_conf):
os.chmod(local_job_file_path, mode)

@classmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These class methods could be static methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good, but I'm not quite done yet, sorry (including thinking about your preference question above...)

lib/cylc/job.sh Outdated
if declare -F "cylc::job::inst::${NAME}" 1>'/dev/null' 2>&1; then
"cylc::job::inst::${NAME}"
fi
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert blank line, for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@matthewrmshin
Copy link
Contributor Author

(I have now added bash -n syntax checking at the point when the job file is generated.)

@hjoliver
Copy link
Member

hjoliver commented Feb 8, 2017

Shell is now restricted to bash only.

I can still set [[[job]]]shell=/bin/ksh, so I guess that item (and its documentation) needs to be removed. (BTW I've seen a lot of ksh scripts from the MO over the years - do your users not set that in their suites?)

@hjoliver
Copy link
Member

hjoliver commented Feb 8, 2017

E.g.: if we have the start of a block in pre-script which only ends in post-script then the new logic will fail.

I have not heard of a use case for this, but real users often do the unexpected.

I can't think of a use case either, so I'm fine with this so long as we document that the script items have to be valid on their own.

@hjoliver
Copy link
Member

hjoliver commented Feb 8, 2017

Alternatively, if we are happy to keep pre-script, script and post-script separated, then we can in theory run bash -n on these settings at validation - which is more efficient for the suite - as we only have to do it once at start up.

I like this idea, but perhaps only as a non-default option - it might really slow down validation of very large suites?

Actually we might want run time checking anyway, to handle changes via broadcast or edit runs.

@hjoliver
Copy link
Member

hjoliver commented Feb 8, 2017

This means bash syntax error in user scripts will not be trapped.

I was concerned about this, but it seems you do mean literally just syntax errors, which are now handled by bash -n; any command with non-zero exit status is still trapped as normal.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the ksh question above, otherwise all good here.

@matthewrmshin
Copy link
Contributor Author

On ksh, we have advertised to our users that cylc job script should all be written in bash, so I hope we don't have any suites doing [[[job]]]shell=/bin/ksh. I'll remove reference to ksh from the docs.

On trapping syntax error... Yes, only bash syntax error is essential until the error trap is switched on - because everything is wrapped in functions, nothing actually gets executed until the main function is called. However, if bash fails to parse the script, it will not get to the main function.

@trwhitcomb
Copy link
Collaborator

trwhitcomb commented Feb 9, 2017

ACK! We still are using Korn shell, although most of the scripts should validate as bash scripts as we're not doing anything really fancy. For the sake of our operational partners, we also have the -S set in the job submission tasks.

Apologies that this is late to the game - I've been behind and decided to drop in and see what was new with pull requests!

@hjoliver
Copy link
Member

@matthewrmshin - looks like @trwhitcomb has thrown a spanner in the works! Our own fault for support ksh in the first place I suppose.

Tim, if you look at a job script generated by a suite running with this branch, it's a beautiful thing - all the boiler-plate code is gone, leaving just what users want to see. But as the job file becomes more sophisticated it gets harder to support both shells because they have less in common in terms of function syntax etc.

If your users don't use screeds of inlined shell scripting (?do they?) it's very likely their tasks will still work as-is under bash. What do you think - is this a justifiable change to force on users?

@matthewrmshin
Copy link
Contributor Author

@trwhitcomb Thanks for the info. Some quick questions:

  • (As @hjoliver already asked...) Is the usage of ksh in the job script just out of habit (like some of our users)? Or are they using some advanced ksh93-only syntax? If the former, the script is likely sh, ksh88 and bash compatible and should just port to bash with no issue. The latter is trickier, but it should still be possible to make some simple change to port the logic into bash. Alternatively, one can always move the logic out into its own command in the suite's bin/ directory. What do you think?
  • What does set -S do? It does not appear to be documented in man ksh.

Reduce number of environment variables in job file.
* Mark old variables as duplicates.
* Move communication related variables to contact file.
* Remove variables that are pointless.
Improve diagnostics.
* Job script and `cylc message` no longer repeat one another.
* Report only relevant items for jobs. E.g. don't report suite host, port, etc.
Quote shell and environment variable values.
Remove space from keys in "job_conf".
Static part of the job script now pull out, and loaded into the job
script in runtime.
This allows the generated part of the job script to be reasonably small.
@matthewrmshin
Copy link
Contributor Author

if the worst comes to the worst, would it be hard to make this work under ksh too?

Anything is possible. If we only need to consider ksh93 and not ksh88, I'll only need to change:

  • local to typeset when declaring variables in functions.
  • Function names can no longer use :: as package delimiters. (This is really for readability only, so no big deal.) I'll use double underscore __ instead.
  • I'll leave PS4 prompt as default + in ksh.
  • typeset -f instead of declare -F to check for existence of a function.

@matthewrmshin
Copy link
Contributor Author

I have now brought back support for ksh (but only ksh93).

I have also introduced err-script to allow users to configure custom logic to call at the end of the error trap.

doc/suiterc.tex Outdated
This is any custom script to be invoked at the end the error trap.
The output of this will always be sent to STDERR and \lstinline=$1= is set to
the name of the signal caught by the error trap. The script should fast and
uses very little system resource to ensure that the error trap can return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

graph = foo
[runtime]
[[foo]]
script = test "${CYLC_TASK_SUBMIT_NUMBER}" -ge 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could potentially use some ksh only syntax e.g. print.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

doc/cug.tex Outdated

When a task is ready cylc generates a {\em task job script} that encapsulates
the runtime settings (environment and scripts) defined for it in the suite.rc
file. The job script is submitted to run by the {\em job submission method}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This old terminology should be updated ("job submission method" --> "batch system")

doc/cug.tex Outdated

Therefore, commands in user scripts should abort with non-zero exit status if a
fatal error occurs (this is just good coding practice anyway). To avoid a
command that expects a non-zero return code from triggering a trap, you can use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should to "to prevent a command..."

doc/cug.tex Outdated
Like other runtime properties, you can set a suite default job submission
method and override it for specific tasks or families:
When a task is ready cylc generates a job script (see~\ref{JobScripts}). The
job script is submitted to run by the {\em job submission method} chosen for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"batch system"

doc/suiterc.tex Outdated

\begin{myitemize}
\item {\em type:} string
\item {\em default:} (none)
\item {\em example:} \lstinline@env-script = "echo Hello World"@
\end{myitemize}

\paragraph[err-script]{[runtime] \textrightarrow [[\_\_NAME\_\_]] \textrightarrow err-script}

This is any custom script to be invoked at the end the error trap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... at the end of the error trap, if the task fails."?

doc/suiterc.tex Outdated
location.
{\em Note: It has no bearing on any sub-shells that may be called by the job script.}

Setting this to the path of a ksh93 interpretter is supported but not recommended.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One 't' in "interpreter".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we give a deprecation warning for ksh93 (here, and print one if detected on file parsing) - so we can drop support sometime in the future?

lib/cylc/job.sh Outdated
export CYLC_TASK_CYCLE_POINT="$(cut -d '/' -f 1 <<<"${CYLC_TASK_JOB}")"
export CYLC_TASK_NAME="$(cut -d '/' -f 2 <<<"${CYLC_TASK_JOB}")"
# The "10#" part ensures that the submit number is interpretted in base 10.
# Otherwise, a zero padded number will be interpretted as an octal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One 't' in 'interpreted'.

lib/cylc/job.sh Outdated
}

###############################################################################
# Run a function in the task job instant file, if possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"instance"?

@hjoliver
Copy link
Member

hjoliver commented Feb 13, 2017

A few minor doc and comment typos, otherwise this is all good from my perspective.

@matthewrmshin matthewrmshin modified the milestones: next release, soon Feb 15, 2017
This was referenced Feb 16, 2017
@matthewrmshin
Copy link
Contributor Author

I have reinstated CYLC_SUITE_HOST and CYLC_SUITE_OWNER, but only if the contact file is available. These variables are used by various suites in our repositories.

@hjoliver
Copy link
Member

@matthewrmshin -

... but only if the contact file is available

would you mind explaining that (so I don't have to figure it out)?

Also, could you list in one place any other of the old env vars removed?

@hjoliver
Copy link
Member

A minimal local test suite on this branch now has task messaging failures:

ERROR: task messaging failure.
Traceback (most recent call last):
  File "/home/vagrant/cylc/bin/cylc-message", line 65, in main
    task_message.send(args)
  File "/home/vagrant/cylc/lib/cylc/task_message.py", line 79, in send
    SuiteSrvFilesManager().load_contact_file(self.suite))
  File "/home/vagrant/cylc/lib/cylc/suite_srv_files_mgr.py", line 359, in load_contact_file
    self.FILE_BASE_CONTACT, reg, owner, host, content=True)
  File "/home/vagrant/cylc/lib/cylc/suite_srv_files_mgr.py", line 230, in get_auth_item
    path = os.path.join(os.environ[key], self.DIR_BASE_SRV)
  File "/usr/lib/python2.7/UserDict.py", line 23, in __getitem__
    raise KeyError(key)
KeyError: 'CYLC_SUITE_RUN_DIR_ON_SUITE_HOST'

@hjoliver
Copy link
Member

Here's a diff of job.out for same task on master vs this branch, the task has script = printenv | grep CYLC.

Questions:

  • are you sure that some of the removed "Suite and Task Identity" info isn't useful sometimes?
  • ditto env vars (especially the one mentioned just above!)
1,11c1,3
< JOB SCRIPT STARTING
< cylc Suite and Task Identity:
<   Suite Name  : foo
<   Suite Host  : localhost
<   Suite Port  : 43076
<   Suite Owner : vagrant
<   Task ID     : foo.1
<   Task Host   : vagrant-ubuntu-trusty-64
<   Task Owner  : vagrant
<   Task Submit No.: 1
<   Task Try No.: 1
---
> Suite    : foo
> Task Job : 1/foo/01 (try 1)
> User@Host: vagrant@vagrant-ubuntu-trusty-64
14,17c6
< cylc (scheduler - 2017-02-17T22:18:45Z): started at 2017-02-17T22:18:45Z
< CYLC_SUITE_RUN_DIR_ON_SUITE_HOST=/home/vagrant/cylc-run/foo
< CYLC_TASK_SSH_LOGIN_SHELL=True
< CYLC_DEBUG=False
---
> 2017-02-17T22:17:44Z NORMAL - started
21d9
< CYLC_TASK_MSG_TIMEOUT=30.000000
29d16
< CYLC_MODE=scheduler
40d26
< CYLC_TASK_MSG_RETRY_INTVL=5.000000
43d28
< CYLC_JOB_PID=9475
45,46d29
< CYLC_DIR_ON_SUITE_HOST=/home/vagrant/cylc
< CYLC_TASK_MSG_MAX_TRIES=7
49,50d31
< CYLC_SUITE_PORT=43076
< CYLC_TASK_COMMS_METHOD=default
56,57c37,38
< cylc (scheduler - 2017-02-17T22:18:49Z): succeeded at 2017-02-17T22:18:49Z
< JOB SCRIPT EXITING (TASK SUCCEEDED)
---
> CYLC_TASK_JOB=1/foo/01
> 2017-02-17T22:17:54Z NORMAL - succeeded

@hjoliver
Copy link
Member

To answer my own question, I think it is clear that none of those variables would be useful in user scripting (although perhaps occasionally useful diagnostic info?). So just internal use of CYLC_SUITE_RUN_DIR_ON_SUITE_HOST is the problem.

@matthewrmshin
Copy link
Contributor Author

Sorry. I'll try to fix the issue on Monday. I don't think things like port number needs to be in the job environment. They are now in the contact file. Other variables such as job pid are now in the job status file.

@matthewrmshin
Copy link
Contributor Author

The contact file is now not written to the remote job host if communication method is "poll". It does not make sense to have a contact file when backward communication is not possible.

@matthewrmshin
Copy link
Contributor Author

(Travis CI is happy again.)

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, all good now.

@oliver-sanders oliver-sanders merged commit 1f5d9d2 into cylc:master Feb 20, 2017
@trwhitcomb
Copy link
Collaborator

Hi guys - sorry for the delay coming back around on this. Was all set then we lost power for a few days!

I'm OK with this now that I've had a chance to look at what we're doing. For posterity, here are the answers to questions from @matthewrmshin and why I think we're fine now:

  1. Reason for using Korn shell for scripts is worse than habit - it's the policy decided on a long time ago. Given the choices, I'm thankful they didn't go with c shell. So, all our task scripts are written in Korn shell. I suspect that about 99% of them will run just fine as Bash.

  2. The "-S" I refer to is actually the PBS directive, telling it to interpret the task script as Korn shell instead.

I don't think we'll be able to convert all our scripts over to Bash due to the aforementioned policy. However, almost all our tasks have a "script" command that just points out to run another script, so you'd just be in the situation where the Bash script written out by cylc will call a Korn shell script written by us. Any inline scripting we have looks to be bash and Korn compatible.

Nice work, guys!

@matthewrmshin
Copy link
Contributor Author

@trwhitcomb you should now be okay as we have brought back ksh support (ksh93 only, however). I can understand policy like what you have described. It hasn't been long since we moved away, and there are still many users uncomfortable with the idea of moving away from ksh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants