How to pass deps, products, other objects to pytask-{r|julia|stata} #203

tobiasraabe · 2022-01-24T14:51:01Z

tobiasraabe
Jan 24, 2022
Maintainer

Problem

pytask-r, pytask-julia, pytask-stata, ..., all have in common that they wrap an executable which is used to run a script in another language than Python which contains the actual logic of a task.

The task function in Python is only used to register the task with pytask. The function body is empty since it is replaced with an internal implementation - a wrapper around subprocess.run.

Currently, options to the executable and arguments to the script are passed via the specific decorator of the method. See the readme of pytask-julia for an example. (It is also the only package which differentiates between options for the executable and arguments to the script.)

Passing information like paths to dependency or products to the script via the command line has one huge problem: Command line arguments are accessed in the script via positional indexing although label-based indexing proved to be one of pytask's strengths.

Another requirement might be to be able to debug the scripts in another language. Unfortunately, most languages do not have a post-mortem debugger like Python which would enable interfaces around pytask --pdb. Instead, users could be enabled to run a script independently from pytask to get feedback quicker. If dependencies and products are usually dynamically generated and passed to the script by pytask, they need to be persisted such that users can run the scripts on their own.

Answers do not have to provide a full solution, but can also tackle sub-problems. Feel free to also provide feedback, dissatisfaction with the current approach and your use-case.

Answered by tobiasraabe

Jan 25, 2022

Exporting inputs to the script to JSON, yaml, etc.

This idea was raised by @hmgaudecker and has two elements.

The decorator

The decorator of the specific packages, @pytask.mark.r or @pytask.mark.julia only accepts options to the executable.

In addition to that, the decorator accepts two arguments. The first is a function for converting the file arguments to some format like yaml. The second provides the appropriate file ending.

Thus, a decorator has the following signature.

def julia(
    *,
    script: str | Path,
    options: Iterable[str],
    converter: Callable[Any, str] | None = None,
    file_suffix: str | None = None
):
    ...

There will be builtin converters for yaml, json, toml…

View full answer

tobiasraabe · 2022-01-25T00:06:55Z

tobiasraabe
Jan 25, 2022
Maintainer Author

Exporting inputs to the script to JSON, yaml, etc.

This idea was raised by @hmgaudecker and has two elements.

The decorator

The decorator of the specific packages, @pytask.mark.r or @pytask.mark.julia only accepts options to the executable.

In addition to that, the decorator accepts two arguments. The first is a function for converting the file arguments to some format like yaml. The second provides the appropriate file ending.

Thus, a decorator has the following signature.

def julia(
    *,
    script: str | Path,
    options: Iterable[str],
    converter: Callable[Any, str] | None = None,
    file_suffix: str | None = None
):
    ...

There will be builtin converters for yaml, json, toml which can be easily selected by passing a string to converter. Custom converters need to pass a function to converter and pass a file_suffix. A converter needs to have the following signature.

import json


def json_converter(info: dict[str, Any]) -> str:
    return json.dumps(info)

The auxiliary file

Arguments to the script are exported to some auxiliary file format like JSON or yaml.

The path to the auxiliary file is passed as the first argument to the script and can be parsed inside the script keeping the nested structure of dependencies and products.

The auxiliary file is stored in some folder inside the build folder. It must be ensured that the filename is unique and persists over multiple pytask runs. When pytask is executed, the file is generated.

Users who want to run a failing task can see that path to the file using pytask collect or by looking at the error report when the task is run.

Uniqueness

Uniqueness can be ensured by using the short name of a task and converting all forward slashes, squared brackets, dots and colons to underscores.

task_example.py::task_example[depends_on0-produces0]

to

task_example_py_task_example_depends_on0-produces0_.xxx

4 replies

tobiasraabe Feb 14, 2022
Maintainer Author

I changed the signature of the julia decorator. It now accepts a pathlike object to the script. I think this is way better than the current approach which relies on obscure key in the dependencies.

Parametrizations are also easier to specify since usually the julia decorator is constant.

Wdyt?

hmgaudecker Feb 15, 2022

Looks great!!!

(I assume you did not change the section "The auxiliary file"? Or did you revert to putting the file in build instead of a subfolder .pytask or so?)

tobiasraabe Feb 15, 2022
Maintainer Author

Just laziness. Never updated it. The files will be in src.

tobiasraabe Apr 19, 2022
Maintainer Author

This approach is now implemented in pytask-r and pytask-julia. Read the readme for an introduction. https://github.com/pytask-dev/pytask-julia

hmgaudecker · 2022-01-25T08:15:51Z

hmgaudecker
Jan 25, 2022

Thanks for summarising (and extending!!!) what I thought of.

A couple of reactions if we go down that route:

My view is that, in order to generate meaningful adoption outside of pure Python, we really need to make it as natural as possible to develop in the usual environment for a given language. The focus on post-mortem debugging above thus is a little bit off -- I do not simply want to check stuff once it goes wrong (and not on the command line!), but during development I want to see the implementation, set breakpoints in my IDE, etc.. If there is one thing I learned in the past 10-15 years, it is that we will not fundamentally change the way how 98% of part-time software developers (econ PhD students, data scientists, etc.) go about programming...
We should, thus, at least contemplate putting hidden files into the script's directory as opposed to the usual thing in the build folder. Maybe have some machinery for checking whether a file is run from pytask and then one could do :
```
if run_from_pytask:
    model_spec = json.read(args[1])
else:
   model_spec = json.read(".task_example_py_task_example_depends_on0-produces0_.xxx")
```
Yes, I know it is not too hard to paste a different path there... Still, little things in the way are a real show-stopper for adoption.

Maybe a middle ground would be to go for the build directory, but provide an option in pytask to display paths relative to the directory where the {R, Julia, Stata}-script lives?
For Stata, we will likely need to write a custom exporter converting the structure to do-files etc. -- there is no way to read in json & friends to its locals/globals structure here is an ancient example for a similar use case
Only sort of related, I probably said that before: If there was some way of making the parametrization work using keywords instead of locations, that would be a real boon. I find them extremely hard to explain as is, let alone see through them in reasonably complex situations.

0 replies

janosg · 2022-01-25T09:28:12Z

janosg
Jan 25, 2022

Switching perspectives: Do we need to get paths in or out?

First of all, I like the solution presented above. Just want to present this alternative to have another option to discuss. This solution does not talk about options to the executable, just arguments to the script.

The above discussion assumes that pytask knows the dependencies and targets from the decorator and the problem is to pass that information to a script.

I think for non Python users it would be more intuitive to do as much as possible in their language and keep the task files minimal. Thus it could be preferable to define the paths of dependencies and targets inside stata, matlab, ... as long as there is a way for pytask to infer them.

A general syntax parser that would infer the paths is of course out of the scope. But what about a simple pragma? If this was used to run a python script instead of a task function, it could be look like this:

# @pytask-dependency
in_path = "a/b/c.csv"
# do stuff
# @pytask-target
out_path = "d/e/f.png"

3 replies

hmgaudecker Jan 25, 2022

Interesting!

My initial reaction: Could you describe how this might scale to more complex situations? Specifically:

one benefit of pytask is that it describes many things in a central location and allows sharing it across modules (think SRC, BLD, models, ...).
would parametrizations be possible?

janosg Jan 25, 2022

I think it does not scale very well to complex workflows. I just tried to think of a solution that the typical Stata user would like. As soon as the Path is not just a simple string, the parser becomes more complex and probably too complex to be worth it.

hmgaudecker Jan 25, 2022

Always a fine line to walk between making it easy for newcomers and still allowing for complex features. If in doubt, I would err on the former.

This, in general, seems like a good selling point for pytask. It scales much better than anything else I know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to pass deps, products, other objects to pytask-{r|julia|stata} #203

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to pass deps, products, other objects to pytask-{r|julia|stata} #203

tobiasraabe Jan 24, 2022 Maintainer

Problem

Exporting inputs to the script to JSON, yaml, etc.

The decorator

Replies: 3 comments · 7 replies

tobiasraabe Jan 25, 2022 Maintainer Author

Exporting inputs to the script to JSON, yaml, etc.

The decorator

The auxiliary file

tobiasraabe Feb 14, 2022 Maintainer Author

hmgaudecker Feb 15, 2022

tobiasraabe Feb 15, 2022 Maintainer Author

tobiasraabe Apr 19, 2022 Maintainer Author

hmgaudecker Jan 25, 2022

janosg Jan 25, 2022

Switching perspectives: Do we need to get paths in or out?

hmgaudecker Jan 25, 2022

janosg Jan 25, 2022

hmgaudecker Jan 25, 2022

tobiasraabe
Jan 24, 2022
Maintainer

Replies: 3 comments 7 replies

tobiasraabe
Jan 25, 2022
Maintainer Author

tobiasraabe Feb 14, 2022
Maintainer Author

tobiasraabe Feb 15, 2022
Maintainer Author

tobiasraabe Apr 19, 2022
Maintainer Author

hmgaudecker
Jan 25, 2022

janosg
Jan 25, 2022