-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use sentinel value to deal with optional coerced files #4994
Use sentinel value to deal with optional coerced files #4994
Conversation
(may want to remove monkeypatching entirely)
library for task outputs
…github.com:DataBiosphere/toil into issues/4988-defer-virtualization-for-coerced-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, but I think it needs more documentation for the new parameters it adds. I also am concerned we might be missing a virtualization pass that we still need. (So some high-level documentation in a long comment about what our general plan is with regard to when files are or are not local paths, and when they are or are not null, and when they are or are not the magic sentinel URIs, might be useful.)
src/toil/wdl/wdltoil.py
Outdated
@@ -622,6 +624,8 @@ def evaluate_output_decls(output_decls: List[WDL.Tree.Decl], all_bindings: WDL.E | |||
output_bindings: WDL.Env.Bindings[WDL.Value.Base] = WDL.Env.Bindings() | |||
for output_decl in output_decls: | |||
output_value = evaluate_decl(output_decl, all_bindings, standard_library) | |||
drop_if_missing_with_workdir = partial(drop_if_missing, work_dir=getattr(standard_library, "_execution_dir")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to make standard_library
a ToilWDLStdLibBase
so we don't need getattr.
Also, it starts with _
and so maybe isn't really meant to be accessed here; maybe we want to be calling a method that gets this instead? Or maybe we want to rename the attribute if we need it outside the stdlib class itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be better if all standard_library's were typed as ToilWDLStdLibBase
; this was mainly done for mypy as mypy thinks the type is WDL.StdLib.Base
and not ToilWDLStdLibBase
.
I can probably turn the gettr into a method, although this current logic may not be sufficient. Another issue I found is that this will coerce all nonexistent files to null, even if the specified type is not optional. For example:
...
output {
File f = "nonexistent.txt"
}
...
This wasn't an issue previously as we ensured file existence at type coercion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue actually is #5006
src/toil/wdl/wdltoil.py
Outdated
def __init__(self, file_store: AbstractFileStore, execution_dir: Optional[str] = None, enforce_nonexistence: bool = True): | ||
""" | ||
Set up the standard library. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does enforce_nonexistence
actually do? Does it cause things to not exist? What happens if we don't enforce it?
I find the name a bit confusing so it definitely needs some documentation.
src/toil/wdl/wdltoil.py
Outdated
enforce_nonexistence: bool = True) -> str: | ||
""" | ||
Download or export a WDL virtualized filename/URL to the given directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would also want to document enforce_nonexistence in this docstring.
src/toil/wdl/wdltoil.py
Outdated
@@ -888,6 +899,7 @@ def devirtualize_to(filename: str, dest_dir: str, file_source: Union[AbstractFil | |||
logger.debug("Virtualized file %s is already a local path", filename) | |||
|
|||
if not os.path.exists(result): | |||
# Devirtualizing an unvirtualized file means the file is coerced from a string and never used/virtualized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the real point of this block is to catch cases where such a string has made it from one context to another without going through the virtualization/devirtualization that would ensure the file it refers to is actually available too.
Do we now expect to be passing a bunch of local paths to the devirtualize function? If we mostly expect to hit this block when we try to access a string filename that doesn't exist, more than we expect to hit it when a file has escaped its node without being virtualized, then maybe we want to change the resulting error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment was carried through accidentally from a previous implementation before using sentinel values.
src/toil/wdl/wdltoil.py
Outdated
def drop_if_missing(value_type: WDL.Type.Base, filename: str, work_dir: str) -> Optional[str]: | ||
""" | ||
Return None if a file doesn't exist, or its path if it does. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring here should document value_type
and work_dir
. It seems like maybe the filename is assumed to belong to a WDL value of the given type (which might be File
or File?
), and that relative paths are interpreted relative to work_dir
, but that should all be in the docstring.
Also probably filename
should be the first argument, since that's what we're really operating on, and then value_type
which is extra information about this file for error reporting, and finally work_dir
since that's environment-level information that would stay the same over multiple calls (and could even sensibly have a default value of "."
).
I guess we got this argument order from map_over_typed_files_in_bindings
, so maybe we just need to leave it. Or we could change the order map_over_typed_files_in_bindings
passes things in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the argument order of map_over_typed_files_in_value
may require tracking down all transform functions, so maybe I'll leave the current argument order. Maybe there's a reason those transform functions wanted the type first?
src/toil/wdl/wdltoil.py
Outdated
run_job = WDLTaskJob(self._task, virtualize_files(bindings, standard_library), virtualize_files(runtime_bindings, standard_library), self._task_id, self._namespace, self._task_path, cores=runtime_cores or self.cores, memory=runtime_memory or self.memory, disk=runtime_disk or self.disk, accelerators=runtime_accelerators or self.accelerators, wdl_options=self._wdl_options) | ||
run_job = WDLTaskJob(self._task, bindings, runtime_bindings, self._task_id, self._namespace, self._task_path, cores=runtime_cores or self.cores, memory=runtime_memory or self.memory, disk=runtime_disk or self.disk, accelerators=runtime_accelerators or self.accelerators, wdl_options=self._wdl_options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we still need to virtualize here? What if one of the task input defaults or body decls is like File the_file = write_lines(["a"])
and then we want to use its path in the command? It needs to get shipped from the job that evaluated the decls to the job that runs the command.
src/toil/wdl/wdltoil.py
Outdated
# Since we process nonexistent files in WDLTaskWrapperJob as those must be run locally, don't try to devirtualize them | ||
standard_library = ToilWDLStdLibBase(file_store, enforce_nonexistence=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like enforce_nonexistence
might be the wrong name; it really is enforcing existence or detecting nonexistence as an error.
src/toil/wdl/wdltoil.py
Outdated
def monkeypatch_coerce(standard_library: ToilWDLStdLibBase, null_nonexistent_files: bool = False) -> Generator[None, None, None]: | ||
""" | ||
Monkeypatch miniwdl's WDL.Value.Base.coerce() function to virtualize files when they are represented as Strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should document null_nonexistent_files
here. What's the default behavior for nonexistent files when it isn't set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I forgot to remove this from an earlier implementation.
@@ -574,6 +575,7 @@ def recursive_dependencies(root: WDL.Tree.WorkflowNode) -> Set[str]: | |||
# in the same destination directory, when dealing with basename conflicts. | |||
|
|||
TOIL_URI_SCHEME = 'toilfile:' | |||
TOIL_NONEXISTENT_URI_SCHEME = 'nonexistent:' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explain our clever plan of using a whole bunch of different nonexistent:blah
URIs to represent distinct nonexistent files somewhere, including some kind of reference back to the WDL spec for why we need it. Maybe here?
Otherwise people will see this and scratch their heads and be tempted to simplify it down to just None
or a single nonexistent sentinel value, since they won't know why that can't work.
…-for-coerced-files' into issues/4988-defer-virtualization-for-coerced-files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my comments have been addressed.
filename represents a URI or file name belonging to a WDL value of type value_type. work_dir represents | ||
the current working directory of the job and is where all relative paths will be interpreted from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be :param:
directives.
map_over_files_in_bindings(environment, lambda x: paths.append(x)) | ||
|
||
def append_to_paths(path: str) -> Optional[str]: | ||
# Append element and return the element. This is to avoid a logger warning inside map_over_typed_files_in_value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just stop issuing that warning if we're having to return values we don't care about just to avoid it?
This closes #4988
Changelog Entry
To be copied to the draft changelog by merger:
File?
type for string to file coercion is now supported (will be nullified)Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist