Add caching support for CWL #5187

stxue1 · 2024-12-17T03:55:37Z

Resolves #4298

Adds support for the cwltool equivalent --cachedir. This should make toil-cwl-runner be cache aware and use previous steps when possible and properly restart when there are new. This is different than the --restart flag. Jobs previously ran with --cachedir can rerun with --cachedir and not with --restart. --restart should be used to run failed jobs that should succeed. If the CWL needs editing, then caching should be used, although this will take significantly more storage space compared to the default behavior. Ideally, this should only be used for development purposes.

Changelog Entry

To be copied to the draft changelog by merger:

Add caching to toil-cwl-runner. Use --cachedir [dir] to enable and rerun previously cached jobs.

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

mr-c

Thank you for this @stxue1 !

I tested this PR using the CWL conformance tests; without --cachedir they still pass.

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

stxue1 · 2024-12-18T00:15:40Z

Thank you for this @stxue1 !

I tested this PR using the CWL conformance tests; without --cachedir they still pass.

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

Thanks, I tested it on my machine and I think I tracked down the bug. I think it has to do with how we move around files from cwltool to Toil. We do some relocation of outputs from when we call cwltool internally which is messing some things up. It seems like setting a cache directory will set the current working directory of the file to be written to.

When we execute the Toil job's side, I think we set the destination directory as the jobstore, resulting in the outputs being relocated eventually without copying/symlinking.

Since caching depends on cwltool behavior, it is probably best to either copy the files into the jobstore (or figure out a way for the jobstore to be cache aware?). I have yet to find an entrypoint to control the relocateOutputs behavior on the Toil side though.

mr-c · 2024-12-18T19:42:53Z

I have yet to find an entrypoint to control the relocateOutputs behavior on the Toil side though.

We can add one on the cwltool side, if needed.

stxue1 · 2024-12-18T22:27:40Z

I have yet to find an entrypoint to control the relocateOutputs behavior on the Toil side though.

We can add one on the cwltool side, if needed.

I believe I was able to find an entrypoint

mr-c · 2024-12-19T19:53:15Z

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

As of 68f88e0, 128 tests fail.

stxue1 · 2024-12-19T20:12:08Z

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

As of 68f88e0, 128 tests fail.

@mr-c Was this ran with a clean cache directory? If the cache directory was populated with the previous broken version of toil-cwl-runner then the runner will look up files from previous cached runs, and they won't exist. I ran some of the tests on my machine and they seem okay so far. Also, how many tests in parallel were run? It might be possible it could be a synchronization issue.

mr-c · 2024-12-19T20:21:39Z

Alas, if I add --cachedir then 134 of the conformance tests fail (though 242 test DO pass!)

As of 68f88e0, 128 tests fail.

@mr-c Was this ran with a clean cache directory? If the cache directory was populated with the previous broken version of toil-cwl-runner then the runner will look up files from previous cached runs, and they won't exist. I ran some of the tests on my machine and they seem okay so far. Also, how many tests in parallel were run? It might be possible it could be a synchronization issue.

Clean directory, -j8. I can try without parallel, and l will confirm that cwltool can run with caching in parallel

mr-c · 2024-12-19T21:53:01Z

I ran some of the tests on my machine and they seem okay so far.

FYI: Here is my invocation run from the root of a checkout of https://github.com/common-workflow-language/cwl-v1.2

./run_test.sh RUNNER=toil-cwl-runner EXTRA="--clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache"

Running not in parallel, I still get 127 failures.

with RUNNER=cwltool I am able to run in parallel (with 6 failures, though I think they are specific to my development environment); And re-running using the previous cache directory repeats the outcome.

stxue1 · 2024-12-20T20:30:07Z

I ran some of the tests on my machine and they seem okay so far.

FYI: Here is my invocation run from the root of a checkout of https://github.com/common-workflow-language/cwl-v1.2

./run_test.sh RUNNER=toil-cwl-runner EXTRA="--clean=always --relax-path-checks --podman --cachedir=/home/michael/cwl-v1.2/toil-cache"

Running not in parallel, I still get 127 failures.

with RUNNER=cwltool I am able to run in parallel (with 6 failures, though I think they are specific to my development environment); And re-running using the previous cache directory repeats the outcome.

Oddly enough, running the same command on my machine (but without --podman) only gives me 2 failing tests:

$ ./run_test.sh RUNNER=toil-cwl-runner EXTRA="--clean=always --relax-path-checks --cachedir=cache"
...

Test [232/378] cwl_requirements_override_static: Test conflicting requirements in input document via EnvVarRequirement
Test 232 failed: /home/heaucques/Documents/toil/venv3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --cachedir=cache --outdir=/tmp/tmpzie43tg5 --quiet tests/env-tool3.cwl tests/env-job4.yaml
Test conflicting requirements in input document via EnvVarRequirement
Compare failure expected: {
    "out": {
        "checksum": "sha1$715e62184492851512a020c36ab7118eca114a59",
        "class": "File",
        "location": "out",
        "size": 23
    }
}
got: {
    "out": {
        "basename": "out",
        "checksum": "sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519",
        "class": "File",
        "location": "file:///tmp/tmpzie43tg5/out",
        "nameext": "",
        "nameroot": "out",
        "path": "/tmp/tmpzie43tg5/out",
        "size": 15
    }
}
caused by: failed comparison for key 'out': expected: {
    "checksum": "sha1$715e62184492851512a020c36ab7118eca114a59",
    "class": "File",
    "location": "out",
    "size": 23
}
got: {
    "basename": "out",
    "checksum": "sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519",
    "class": "File",
    "location": "file:///tmp/tmpzie43tg5/out",
    "nameext": "",
    "nameroot": "out",
    "path": "/tmp/tmpzie43tg5/out",
    "size": 15
}
caused by: Output file checksums do not match: actual 'sha1$b3ec4ed1749c207e52b3a6d08c59f31d83bff519' is not equal to expected 'sha1$715e62184492851512a020c36ab7118eca114a59'

...

Test [337/378] iwd-container-entryname3: Test input mount locations when container is a hint (should fail)
Test 337 failed: /home/heaucques/Documents/toil/venv3.12/bin/toil-cwl-runner --clean=always --relax-path-checks --cachedir=cache --outdir=/tmp/tmpfeo35a5x --quiet tests/iwd/iwd-container-entryname3.cwl tests/loadContents/input.yml
Test input mount locations when container is a hint (should fail)
Returned zero but it should be non-zero

I'll run it again with podman and see if anything changes.

stxue1 and others added 3 commits December 2, 2024 13:15

Add cachedir argument and make sure directories exist

8f1b3d1

Add a test

e83fe3d

Merge branch 'master' into issues/4298-cwl-job-restart-edit

7d52d00

mr-c mentioned this pull request Dec 17, 2024

CWL: Fix --no-container & prepull #5188

Open

19 tasks

mr-c requested changes Dec 17, 2024

View reviewed changes

stxue1 added 2 commits December 18, 2024 14:28

Do a copy when not bypassing filestore and when caching is enabled

121afa1

Fix test

68f88e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add caching support for CWL #5187

Add caching support for CWL #5187

stxue1 commented Dec 17, 2024 •

edited

Loading

mr-c left a comment

stxue1 commented Dec 18, 2024

mr-c commented Dec 18, 2024

stxue1 commented Dec 18, 2024

mr-c commented Dec 19, 2024

stxue1 commented Dec 19, 2024 •

edited

Loading

mr-c commented Dec 19, 2024

mr-c commented Dec 19, 2024

stxue1 commented Dec 20, 2024

Add caching support for CWL #5187

Are you sure you want to change the base?

Add caching support for CWL #5187

Conversation

stxue1 commented Dec 17, 2024 • edited Loading

Changelog Entry

Reviewer Checklist

Merger Checklist

mr-c left a comment

Choose a reason for hiding this comment

stxue1 commented Dec 18, 2024

mr-c commented Dec 18, 2024

stxue1 commented Dec 18, 2024

mr-c commented Dec 19, 2024

stxue1 commented Dec 19, 2024 • edited Loading

mr-c commented Dec 19, 2024

mr-c commented Dec 19, 2024

stxue1 commented Dec 20, 2024

stxue1 commented Dec 17, 2024 •

edited

Loading

stxue1 commented Dec 19, 2024 •

edited

Loading