Implicit depends (data flow) does not re-execute when there is a script change #1186

gaow · 2019-01-25T15:07:46Z

For the example below:

[test]
output: '1.txt'
print(99)
_output.touch()


[default]
input: '1.txt'
output: '2.txt'
_output.touch()

if I run the example above, then change print(99) to print(999) to introduce a change in script, I would expect the test step be re-executed. But it does not. What I think is reasonable behavior is that if SoS found that 1.txt is output of another step, then any changes in that step will trigger rerun of that step even if 1.txt already exists. If 1.txt is not output of any other steps then nothing changes.

This is one reason I am a big fan of sos_step because for the workflow below:

[test]
output: '1.txt'
print(99)
_output.touch()


[default]
depends: sos_step('test')
output: '2.txt'
_output.touch()

if I change test it will reran, which is very nice. Previously we use explicity provides but now we seem to try promoting the use of data flow, this issue is more problematic in this context.

The text was updated successfully, but these errors were encountered:

BoPeng · 2019-01-25T15:13:38Z

Isn't this the same behavior for provides as well? As far as I remember we have the "check signature if an input file was generated by another step" behavior and I think the conclusion was that checking the "source" of existing files is too costly so we just used existing files if they exist.

Perhaps we should check what Makefile and snakemake do in this case.

BoPeng · 2019-01-25T15:19:39Z

Here is the example for make. When I changed a.txt or change the make file between make runs, make does not care

A : a.txt
	echo I am 'A'

a.txt:
	echo making a.txt
	touch a.txt

(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ make A
echo making a.txt
making a.txt
touch a.txt
echo I am 'A'
I am A
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ make A
echo I am 'A'
I am A
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ echo 'a' > a.txt
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ make A
echo I am 'A'
I am A
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ mvim makefile . # change the make file
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ make A
echo I am 'A'
I am A

The solution for make was to has an explicit clean step (written by user).

gaow · 2019-01-25T15:26:09Z

Isn't this the same behavior for provides as well?

I think so. My "workaround" is to use sos_step() instead.

Yes I was aware of Makefile behavior, but I always thought Snakemake is different (more advanced). It tunrs out it is not:

rule test1:
  output: '1.txt'
  shell: "echo 998 && touch 1.txt"

rule test2:
  input: '1.txt'
  output: '2.txt'
  shell: "touch 2.txt"
 
rule all:
  input: '2.txt'

save it as Snakefile, then snakemake:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	test1
	1

rule test1:
    output: 1.txt
    jobid: 0

999
Finished job 0.
1 of 1 steps (100%) done

change the echo part to a different number, then snakemake

Nothing to be done.

Maybe it is fair enough to not check it at least by default? But what if I want to check it explicitly -- what can i do other than sos_step?

gaow · 2019-01-25T15:29:43Z

Hmm I think the workaround now is simple!

[test]
output: '1.txt'
print(9977)
_output.touch()


[default]
input: output_from('test')
output: '2.txt'
_output.touch()

This is very nice !

BoPeng · 2019-01-25T15:30:50Z

The problem here is that it is difficult to know what file has been generated before. We used to have single file signatures (a bunch of .sig files for each output) so we could check if any input file has been touched. The problems are too many .sig files and performance so we removed this feature when we consolidate the signature stuff.

gaow · 2019-01-25T15:34:52Z

But when I try to "abuse" it a bit,

[test]
output: '1.txt'
print(997)
_output.touch()


[default]
depends: output_from('test')
output: '2.txt'
_output.touch()

it gives me the error:

ERROR: Function output_from can only be used in input statements

which is completely understandable. In some cases when depends and input are the same, we can always make the dependency a input. But there are cases when they are not:

depends: output_from('some_step')
input: for_each = 'some_var'

then I guess we have to resort to sos_step?

The problem here is that it is difficult to know what file has been generated before.

Sure, we should not check if it is not straightforward. I'm just trying to find a way to achieve it when there is a need. Boundary between depends and input can be vague here.

BoPeng · 2019-01-25T15:44:11Z

We can certainly allow output_from in depends but in your case what you actually need is depends: sos_step, right?

gaow · 2019-01-25T16:03:49Z

Well, maybe it is not clear from the MWE above, but in practice it is more like depends: output_from -- because I cannot think of a better syntax for it ... For example in this notebook:

https://github.com/zouyuxin/GTEx/blob/master/pipeline/mashr_flashr_workflow.ipynb

please jump to the bottom of it. You see in Cell 11,

depends: vhat_data
input: posterior_input, group_by = 1

because not sure how else I mix grouped input with something shared by groups. But if there is no grouping involved I'd really like to use output_from(f'vhat_{vhat}') like what I did in previous steps, for example see Cell 10.

Currently it works for Cell 11, only that it does not track the changes. I can change it to sos_step() I guess. But it reads a bit inconsistent considering my other codes are data oriented ...

BoPeng · 2019-01-25T16:10:22Z

I meant, when you need output_from but not using the output, you need the step information, that is sos_step. In your example, it seems like you need something like

depends: 'a.txt'

but it does not work because it only checks if a.txt exists. Then you need

depends: named_output("a.txt") . # wrong here

to actually refer to the step that produces the output.

So perhaps we can implement

depends: 'a.txt'

as the unnamed version of named_output() that actually depends on the step?

gaow · 2019-01-25T16:37:58Z

as the unnamed version of named_output() that actually depends on the step?

So by doing it, at least when we explicitly use depends we will try to track and check changes in its source? It should be acceptable performance since we don't do it often? What is the possible cons of this change?

BoPeng · 2019-01-25T16:43:34Z

We advertise depends mostly as statement to build step dependencies, and sos_variables, sos_step are used in this fashion. It would not be too stretchy to say named_output() and file_target() also try to build step dependency. Right now named_output() is not allowed but conceptually agreeable, file_target is a bit more complicated because we certainly should allow depends: 'system_file', but we can say file_target in depends will try to find dependency in workflow, then check if it is on disk, then raise error.

So in the end we can provide makefile style data flow with

input: 'a.txt'

and process-style data flow (meaning all steps will be executed) with

depends: 'a.txt'

gaow · 2019-01-25T16:59:03Z

Sounds good! I think also now our interpretation of process vs outcome oriented workflow is converging.

…1186

BoPeng · 2019-01-26T04:59:55Z

OK, pending further testing, this is what I have done

dependent targets will try to add its upstream steps even if the targets already exists.
Existing input targets will not try to add its upstream steps
Targets passed from -t TARGETS are handled as depends. Previously nothing will be done if the targets exist, now it will run steps to generate them, but do nothing if the target exists and no step to generate them.

This however, only happens during DAG building when depends can be resolved successfully. That is to say, while

[10]
depends: '1.txt'
_output.touch()

works,

[10]
input: 'file.bam'
depends: _input.with_suffix('.bam.bai')

will work in the makefile style. Basically, because depends cannot be resolved statically, the step will be executed and proceed if file.bam.bai exists, and try to resolve file.bam.bai if it does not exist. Trying to resolve file.bam.bai after entering 10 is technically even logically difficult (because we need to keep track of who generates each target to avoid infinite loop).

BoPeng · 2019-01-26T05:50:40Z

Updated https://vatlab.github.io/sos-docs/doc/user_guide/step_dependencies.html because this sounds like a conceptually important addition to the dependency building mechanism of SoS. I also added a table to summarize the syntaxes: https://vatlab.github.io/sos-docs/doc/user_guide/step_dependencies.html#Summary

gaow · 2019-01-26T14:03:28Z

Great! Looks like the code change was drastic ... sorry for the troubles but I believe it is more intuitive and powerful than before. I think users will find it the same thanks to the summary table.

Basically, because depends cannot be resolved statically, the step will be executed and proceed if file.bam.bai exists, and try to resolve file.bam.bai if it does not exist.

This is understandable limitation and at least I never use depends after input. So far I do not find it limiting.

BoPeng · 2019-01-26T14:25:10Z

As usual, the limitations could be resolved, and in this case make things conceptually cleaner, but there will be a cost. I will create a new ticket for it.

BoPeng · 2019-01-27T22:39:21Z

A lot of changes were introduced to handle the cases of dynamic depends. That is to say,

[BAI: provides='{filename}.bam.bai']
_output.touch()

[BAM]
output: 'a.bam'
_output.touch()

[default]
input: 'a.bam'
depends: _input.with_suffix('.bam.bai')

will trigger the rerun of BAI even if a.bam.bai already exists. The code basically

detects if the step has dynamic depends
if so, send evaluated depends targets to the master workflow
master workflow will try to resolve the step
if more nodes are added, the current step will be put on hold and wait for the completion of the upstream steps
otherwise the master will tell the step to go ahead.

I sort of doubt if we have over-do this feature because neither gnumake nor snakemake is doing this, but we are potentially making SoS running slower with these changes. However, the code is supposed to only affect dynamic depends so should affect only a very small proportion of use cases, and the last patch completes our claimed feature on depends.

BoPeng pushed a commit that referenced this issue Jan 25, 2019

Allow the use of named_output and output_from in dependent statement #…

d6edc18

…1186

BoPeng pushed a commit that referenced this issue Jan 25, 2019

Change related tests #1186

0bf889f

BoPeng pushed a commit that referenced this issue Jan 26, 2019

Switching to a potentially more efficient target resolving mechanism #…

5a84538

…1186

BoPeng pushed a commit that referenced this issue Jan 26, 2019

Handling input and depends targets differently #1186

59d928c

BoPeng pushed a commit that referenced this issue Jan 26, 2019

Handle existing dependent targets #1186

7b589eb

BoPeng pushed a commit to vatlab/sos-docs that referenced this issue Jan 26, 2019

Update doc regarding vatlab/sos#1186

052bf97

BoPeng closed this as completed Jan 26, 2019

BoPeng mentioned this issue Jan 26, 2019

Break and reentry of dynamic dependency matching. #1189

Closed

BoPeng pushed a commit that referenced this issue Jan 27, 2019

Verify dynamic depends #1186

51f98b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implicit depends (data flow) does not re-execute when there is a script change #1186

Implicit depends (data flow) does not re-execute when there is a script change #1186

gaow commented Jan 25, 2019

BoPeng commented Jan 25, 2019

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019

gaow commented Jan 25, 2019

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019 •

edited by BoPeng

Loading

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019

BoPeng commented Jan 26, 2019 •

edited

Loading

BoPeng commented Jan 26, 2019

gaow commented Jan 26, 2019

BoPeng commented Jan 26, 2019

BoPeng commented Jan 27, 2019

Implicit depends (data flow) does not re-execute when there is a script change #1186

Implicit depends (data flow) does not re-execute when there is a script change #1186

Comments

gaow commented Jan 25, 2019

BoPeng commented Jan 25, 2019

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019

gaow commented Jan 25, 2019

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019 • edited by BoPeng Loading

BoPeng commented Jan 25, 2019

gaow commented Jan 25, 2019

BoPeng commented Jan 26, 2019 • edited Loading

BoPeng commented Jan 26, 2019

gaow commented Jan 26, 2019

BoPeng commented Jan 26, 2019

BoPeng commented Jan 27, 2019

gaow commented Jan 25, 2019 •

edited by BoPeng

Loading

BoPeng commented Jan 26, 2019 •

edited

Loading