Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implicit depends (data flow) does not re-execute when there is a script change #1186

Closed
gaow opened this issue Jan 25, 2019 · 17 comments
Closed

Comments

@gaow
Copy link
Member

gaow commented Jan 25, 2019

For the example below:

[test]
output: '1.txt'
print(99)
_output.touch()


[default]
input: '1.txt'
output: '2.txt'
_output.touch()

if I run the example above, then change print(99) to print(999) to introduce a change in script, I would expect the test step be re-executed. But it does not. What I think is reasonable behavior is that if SoS found that 1.txt is output of another step, then any changes in that step will trigger rerun of that step even if 1.txt already exists. If 1.txt is not output of any other steps then nothing changes.

This is one reason I am a big fan of sos_step because for the workflow below:

[test]
output: '1.txt'
print(99)
_output.touch()


[default]
depends: sos_step('test')
output: '2.txt'
_output.touch()

if I change test it will reran, which is very nice. Previously we use explicity provides but now we seem to try promoting the use of data flow, this issue is more problematic in this context.

@BoPeng
Copy link
Contributor

BoPeng commented Jan 25, 2019

Isn't this the same behavior for provides as well? As far as I remember we have the "check signature if an input file was generated by another step" behavior and I think the conclusion was that checking the "source" of existing files is too costly so we just used existing files if they exist.

Perhaps we should check what Makefile and snakemake do in this case.

@BoPeng
Copy link
Contributor

BoPeng commented Jan 25, 2019

Here is the example for make. When I changed a.txt or change the make file between make runs, make does not care

A : a.txt
	echo I am 'A'

a.txt:
	echo making a.txt
	touch a.txt
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ make A
echo making a.txt
making a.txt
touch a.txt
echo I am 'A'
I am A
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ make A
echo I am 'A'
I am A
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ echo 'a' > a.txt
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ make A
echo I am 'A'
I am A
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ mvim makefile . # change the make file
(sos) [bpeng1@BCBMF5KWK0F8F69:~/sos/sos (master *)]$ make A
echo I am 'A'
I am A

The solution for make was to has an explicit clean step (written by user).

@gaow
Copy link
Member Author

gaow commented Jan 25, 2019

Isn't this the same behavior for provides as well?

I think so. My "workaround" is to use sos_step() instead.

Yes I was aware of Makefile behavior, but I always thought Snakemake is different (more advanced). It tunrs out it is not:

rule test1:
  output: '1.txt'
  shell: "echo 998 && touch 1.txt"

rule test2:
  input: '1.txt'
  output: '2.txt'
  shell: "touch 2.txt"
 
rule all:
  input: '2.txt'

save it as Snakefile, then snakemake:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	test1
	1

rule test1:
    output: 1.txt
    jobid: 0

999
Finished job 0.
1 of 1 steps (100%) done

change the echo part to a different number, then snakemake

Nothing to be done.

Maybe it is fair enough to not check it at least by default? But what if I want to check it explicitly -- what can i do other than sos_step?

@gaow
Copy link
Member Author

gaow commented Jan 25, 2019

Hmm I think the workaround now is simple!

[test]
output: '1.txt'
print(9977)
_output.touch()


[default]
input: output_from('test')
output: '2.txt'
_output.touch()

This is very nice !

@BoPeng
Copy link
Contributor

BoPeng commented Jan 25, 2019

The problem here is that it is difficult to know what file has been generated before. We used to have single file signatures (a bunch of .sig files for each output) so we could check if any input file has been touched. The problems are too many .sig files and performance so we removed this feature when we consolidate the signature stuff.

@gaow
Copy link
Member Author

gaow commented Jan 25, 2019

But when I try to "abuse" it a bit,

[test]
output: '1.txt'
print(997)
_output.touch()


[default]
depends: output_from('test')
output: '2.txt'
_output.touch()

it gives me the error:

ERROR: Function output_from can only be used in input statements

which is completely understandable. In some cases when depends and input are the same, we can always make the dependency a input. But there are cases when they are not:

depends: output_from('some_step')
input: for_each = 'some_var'

then I guess we have to resort to sos_step?

The problem here is that it is difficult to know what file has been generated before.

Sure, we should not check if it is not straightforward. I'm just trying to find a way to achieve it when there is a need. Boundary between depends and input can be vague here.

@BoPeng
Copy link
Contributor

BoPeng commented Jan 25, 2019

We can certainly allow output_from in depends but in your case what you actually need is depends: sos_step, right?

@gaow
Copy link
Member Author

gaow commented Jan 25, 2019

Well, maybe it is not clear from the MWE above, but in practice it is more like depends: output_from -- because I cannot think of a better syntax for it ... For example in this notebook:

https://github.com/zouyuxin/GTEx/blob/master/pipeline/mashr_flashr_workflow.ipynb

please jump to the bottom of it. You see in Cell 11,

depends: vhat_data
input: posterior_input, group_by = 1

because not sure how else I mix grouped input with something shared by groups. But if there is no grouping involved I'd really like to use output_from(f'vhat_{vhat}') like what I did in previous steps, for example see Cell 10.

Currently it works for Cell 11, only that it does not track the changes. I can change it to sos_step() I guess. But it reads a bit inconsistent considering my other codes are data oriented ...

@BoPeng
Copy link
Contributor

BoPeng commented Jan 25, 2019

I meant, when you need output_from but not using the output, you need the step information, that is sos_step. In your example, it seems like you need something like

depends: 'a.txt'

but it does not work because it only checks if a.txt exists. Then you need

depends: named_output("a.txt") . # wrong here

to actually refer to the step that produces the output.

So perhaps we can implement

depends: 'a.txt'

as the unnamed version of named_output() that actually depends on the step?

@gaow
Copy link
Member Author

gaow commented Jan 25, 2019

as the unnamed version of named_output() that actually depends on the step?

So by doing it, at least when we explicitly use depends we will try to track and check changes in its source? It should be acceptable performance since we don't do it often? What is the possible cons of this change?

@BoPeng
Copy link
Contributor

BoPeng commented Jan 25, 2019

We advertise depends mostly as statement to build step dependencies, and sos_variables, sos_step are used in this fashion. It would not be too stretchy to say named_output() and file_target() also try to build step dependency. Right now named_output() is not allowed but conceptually agreeable, file_target is a bit more complicated because we certainly should allow depends: 'system_file', but we can say file_target in depends will try to find dependency in workflow, then check if it is on disk, then raise error.

So in the end we can provide makefile style data flow with

input: 'a.txt'

and process-style data flow (meaning all steps will be executed) with

depends: 'a.txt'

@gaow
Copy link
Member Author

gaow commented Jan 25, 2019

Sounds good! I think also now our interpretation of process vs outcome oriented workflow is converging.

@BoPeng
Copy link
Contributor

BoPeng commented Jan 26, 2019

OK, pending further testing, this is what I have done

  1. dependent targets will try to add its upstream steps even if the targets already exists.
  2. Existing input targets will not try to add its upstream steps
  3. Targets passed from -t TARGETS are handled as depends. Previously nothing will be done if the targets exist, now it will run steps to generate them, but do nothing if the target exists and no step to generate them.

This however, only happens during DAG building when depends can be resolved successfully. That is to say, while

[10]
depends: '1.txt'
_output.touch()

works,

[10]
input: 'file.bam'
depends: _input.with_suffix('.bam.bai')

will work in the makefile style. Basically, because depends cannot be resolved statically, the step will be executed and proceed if file.bam.bai exists, and try to resolve file.bam.bai if it does not exist. Trying to resolve file.bam.bai after entering 10 is technically even logically difficult (because we need to keep track of who generates each target to avoid infinite loop).

BoPeng pushed a commit to vatlab/sos-docs that referenced this issue Jan 26, 2019
@BoPeng
Copy link
Contributor

BoPeng commented Jan 26, 2019

Updated https://vatlab.github.io/sos-docs/doc/user_guide/step_dependencies.html because this sounds like a conceptually important addition to the dependency building mechanism of SoS. I also added a table to summarize the syntaxes: https://vatlab.github.io/sos-docs/doc/user_guide/step_dependencies.html#Summary

@BoPeng BoPeng closed this as completed Jan 26, 2019
@gaow
Copy link
Member Author

gaow commented Jan 26, 2019

Great! Looks like the code change was drastic ... sorry for the troubles but I believe it is more intuitive and powerful than before. I think users will find it the same thanks to the summary table.

Basically, because depends cannot be resolved statically, the step will be executed and proceed if file.bam.bai exists, and try to resolve file.bam.bai if it does not exist.

This is understandable limitation and at least I never use depends after input. So far I do not find it limiting.

@BoPeng
Copy link
Contributor

BoPeng commented Jan 26, 2019

As usual, the limitations could be resolved, and in this case make things conceptually cleaner, but there will be a cost. I will create a new ticket for it.

@BoPeng
Copy link
Contributor

BoPeng commented Jan 27, 2019

A lot of changes were introduced to handle the cases of dynamic depends. That is to say,

[BAI: provides='{filename}.bam.bai']
_output.touch()

[BAM]
output: 'a.bam'
_output.touch()

[default]
input: 'a.bam'
depends: _input.with_suffix('.bam.bai')

will trigger the rerun of BAI even if a.bam.bai already exists. The code basically

  1. detects if the step has dynamic depends
  2. if so, send evaluated depends targets to the master workflow
  3. master workflow will try to resolve the step
  4. if more nodes are added, the current step will be put on hold and wait for the completion of the upstream steps
  5. otherwise the master will tell the step to go ahead.

I sort of doubt if we have over-do this feature because neither gnumake nor snakemake is doing this, but we are potentially making SoS running slower with these changes. However, the code is supposed to only affect dynamic depends so should affect only a very small proportion of use cases, and the last patch completes our claimed feature on depends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants