Aggregating the output artifacts of parallel steps (fan-in) #934

jessesuen · 2018-08-03T10:13:49Z

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

What happened:

Separating this issue from #861 to handle aggregation of output artifacts. Similar to output parameters which have been expanded using loops, we need some mechanism to aggregate artifacts from parallel steps. With parameters, the solution was to introduce a new variable, steps.XXXX.outputs.parameters as a json list. For artifacts we need something similar. The trick is how we would place aggregate artifacts into a subsequent pod.

The text was updated successfully, but these errors were encountered:

edlee2121 · 2018-08-30T20:35:58Z

When aggregating and importing the artifacts generated by steps in a loop, I think the most useful option may be to allow subsequent steps to reference a composite artifact that is a merge of the contents of all the output artifacts into a single composite artifact.

Each instance of the loop could use the iteration index or other parameter passed in to appropriately name the output file/directory so that the merge does not overwrite content. This gives the user control over how to name/structure the layout of the output artifact.

Example:
If the output artifact directory in each step is output-artifact-dir, each step would generate a file or artifact name something like output-artifact-dir/<input-param>-artifact.

The composite artifact could be referenced as steps.my-loop-step.outputs.artifacts.name1
Where name1 is the name of the artifact generated by the output steps. This would create a composite artifact my merging all output artifacts with the name ‘name1’. Note that, one could create multiple output artifacts in a loop step using different names.

What do you think?

jessesuen · 2018-08-30T22:23:07Z

Each instance of the loop could use the iteration index or other parameter passed in to appropriately name the output file/directory so that the merge does not overwrite content. This gives the user control over how to name/structure the layout of the output artifact.

What if the user does not have control over the location of the output artifact. For example, what if the container being run is some off-the-shelf docker image, and the user wants something like /var/run/mysql.db to be the output artifact?

edlee2121 · 2018-08-31T01:27:57Z

Even with an off-the-shelf image, I think the user could override the command parameter to customize the name of the output file/directory.

If desired, we could support a second form for accessing the composite output artifact. E.g. if the input artifact is specified as, for example, steps.my-loop-step.output.artifacts.* the unpacked input artifacts could be placed at input-artifact-dir/{0,1,2,3,...}/. That is, each input artifact is unpacked into a subdirectory corresponding to an index number.

One draw back with this approach is that it only works well with loops. I think the first approach is more general.

mamoit · 2019-04-12T16:58:51Z

This seems to be a fairly stale issue, but it would be extremely handy for easily parallelizable tasks such as scraping multiple sources and generating large datasets that need to be merged in the end and analysed as a whole.
Just spent the past couple of hours trying to achieve this 😄

To be more specific, my use case is something like this:

Get a list of names
For each name generate a dataset (this may take some time and hence it would be ideal to have it parallelized)
Gather the datasets that were created in parallel in the previous step and create a unified one.

This last step is exactly the use case for the feature described in this issue, everything else I was able to do.

jessesuen · 2019-04-19T22:09:46Z

Since this bug was filed, we now support artifacts with archive: none: {}, and now there is a pattern developed where parallel steps can output to a common s3 "directory". The following step (which needs the aggregated outputs), will download the s3 "directory" as an artifact.

fj-sanchez · 2019-04-19T22:38:24Z

@jessesuen can you point me to that pattern where multiple steps can output to a common s3 directory?

jessesuen · 2019-04-19T23:00:28Z

Need to write a proper example, but the idea is that you would disable .tgz archiving on a file like so:
https://github.com/argoproj/argo/blob/master/examples/artifact-disable-archive.yaml#L38

And then the subsequent step would recursively download the parent s3 key as a "directory." The enhancement that was made in v2.2, is that if the S3 location appears to be a "directory" instead of a file, it performs a recursive download of all of the contents of that "directory." Directory is in quotes because S3 is really just a key/value store.

fj-sanchez · 2019-04-19T23:12:53Z

So, does your container job need to be aware of its "index" or is the template using it to save the artifacts with unique names (using the item index i.e.)?

mostaphaRoudsari · 2019-07-13T22:15:52Z

@jessesuen, did you ever get a chance to add an example which uses archive: none: {} to pass outputs to common s3 directory for parallel task?

laubosslink · 2019-07-15T17:15:50Z

@jessesuen extremely interesting as well to perform experiments, FYI (https://www.ovh.com/blog/simplify-your-research-experiments-with-kubernetes/)

Downchuck · 2019-08-01T03:34:48Z

This seems potentially broken for GCS per #1351

Downchuck · 2019-08-01T19:55:17Z

@jessesuen: Is there a particular concern in trying to support tasks.X.outputs.artifacts when using loops, and simply extracting all matches? It's up to the workflow designer to ensure the overlay is correct.

@edlee2121 @fj-sanchez I'm suggesting we skip any complexity around indexes and just use unique names.

For example, this is working just fine in a loop to get the artifact onto storage:

    outputs:
      artifacts:
      - name: split
        path: "/tmp/split-{{inputs.parameters.offset}}"

All I'd need to pull out of storage is to just loop through the list, much as parameter aggregation does, and unpack the archives into the target folder.

Unfortunately, due to the GCS folders bug, I can't use the folder clone technique to just download all of the {{workspace.name}} key. #1351 -- I did go ahead and write a template to just use the cloud SDK and gsutil cp to do that work -- mentioned in the comment to that report.

Parameter aggregation is a little funky, {{pod.name}} is evaluated prior to the loop and so is not the same pod that actually writes out the artifact. This PR may address that: #1336

sebinsua · 2019-11-12T13:03:23Z

Does anybody have a working example of how to do this?

If not, I'll work something out myself.

Edit 1: One thing I tried was using {inputs,outputs}.artifacts.s3 to access the same S3 bucket and key from different pods, however I would need them to use a key which is specific to the current workspace.name, and currently it doesn't seem that {{workspace.name}} is interpolated within inputs.artifacts.s3.key.

Edit 2: I got something to work! Here is my attempt: sebinsua/k8s-argo-parallel-aggregate-workflow

TekTimmy · 2020-02-11T16:07:26Z

Does anybody have a working example of how to do this?

If not, I'll work something out myself.

Edit 1: One thing I tried was using {inputs,outputs}.artifacts.s3 to access the same S3 bucket and key from different pods, however I would need them to use a key which is specific to the current workspace.name, and currently it doesn't seem that {{workspace.name}} is interpolated within inputs.artifacts.s3.key.

Edit 2: I got something to work! Here is my attempt: sebinsua/k8s-argo-parallel-aggregate-workflow

I have used that approach initially as well, it works as long as all steps are running on the same Node or the volume runs in ReadWriteMany mode. Minikube provides volumes in mode "ReadWriteMany" but EKS (AWS Kubernetes) EBS volumes do not support to be mounted to several nodes at once.
To implement this feature with ReadWriteOnce volumes the parallel running withParam step must all mount there own volume and when finished the "pre-aggregation" step has to mount all those volumes and copy the data into a single volume. Then the "aggregation" step can mount that single volume that contains all results.

yoshua0x · 2020-03-29T14:39:51Z

@sebinsua This example rocks, thanks for sharing!

I was able to get this running with ReadWriteMany volumes using nfs-server-provisioner on multiple cloud providers backed with volumes that are usually constrained to ReadWriteOnce.

@TekTimmy if you are still blocked on this, I provided a link to the helm chart below.

https://github.com/helm/charts/tree/master/stable/nfs-server-provisioner

alexec · 2020-03-30T23:35:06Z

Document the pattern. #2549

foobarbecue · 2020-05-20T12:00:00Z

I'm having a little trouble figuring out the state of this. Looks like there's a solution that works on a volume mount but not on an artifact store? EDIT: I see now, you can use the "hard-wired" s3 approach, specifying endpoint, bucket, etc. along with a directory key. Doesn't quite work for me because transferring the whole directory to the container is too much. I need to be able to use a parameter as part of the key name, or something like a withArtifact that would work similar to withParam. EDIT2: I just searched for withArtifact and found this #2758

alexec · 2020-09-30T02:06:49Z

Hmm. Should not have been closed.

alexec · 2020-09-30T20:41:20Z

I've created an example of a map-reduce job in Argo Workflows that aggregates outputs. Please take a look

#4175

Ark-kun · 2020-10-26T09:39:57Z

If desired, we could support a second form for accessing the composite output artifact. E.g. if the input artifact is specified as, for example, steps.my-loop-step.output.artifacts.* the unpacked input artifacts could be placed at input-artifact-dir/{0,1,2,3,...}/. That is, each input artifact is unpacked into a subdirectory corresponding to an index number.

I like this form, but I think there needs to be a way to access artifacts per-output. For example tasks.my-loop.output.artifacts.something. Then the artifacts are downloaded into input-artifact-dir/{0,1,2,3,...}/ sudirectories as you've proposed.

Ark-kun · 2020-11-01T00:04:28Z

I've created an example of a map-reduce job in Argo Workflows that aggregates outputs. Please take a look
#4175

Interesting example.
The problem I see is that it requires explicit manipulation of the artifact location, which makes the workflows less reusable.

Ark-kun · 2020-11-01T00:07:30Z

The map-reduce example gave me an idea how Argo could solve the artifact aggregation with minimal effort.
But then I understood that this is a bad idea and won't support DAGs or artifacts with custom URIs.

<bad_idea>
Suppose there is a normal task. It outputs all its artifacts to <run-id>/<task-id>/<output-name>.
Now we make this task a loop by adding withItems, withParam or withSequence.
What if the resulting sub-nodes would output their artifacts to <run-id>/<task-id>/<output-name>/<loop-idx>?
In this case the loop node's output artifact URI is still <run-id>/<task-id>/<output-name> and when downloaded by downstream component it will automatically contain artifacts from all lop iterations.
</bad_idea>

Ark-kun · 2020-11-01T00:18:19Z

Perhaps we could implement artifact aggregation the same way as for parameters - the loop node should collect per-output artifact lists and the init executor should be able to download multiple artifacts under a single path.

We could make it possible to consume a list of artifacts (just for illustration - most users won't use this directly - only the init executor would see this):

name: aggregate
inputs:
  artifacts:
  - name: in-art-1
    path: /tmp/inputs/in-art-1/  #Artifacts end up in /tmp/inputs/in-art-1/{1,2,3,4}
    uris:
    - s3: ...
    - s3: ...

The artifact lists can be produced by loop nodes and passed to the aggregators:

name: dag-aggregate-task
template: aggregate
arguments:
- name: in-art-1
   from: {{tasks.loop-task.outputs.artifacts.out1}}

Downchuck · 2020-11-01T03:01:00Z

It's simply that map reduce patterns need some enthusiastic support.

I found the easiest way to explore the space was to actually mock out the yaml flows as objects. In one experiment, I just used javascript with a dag library.

Looking forward to seeing what the community comes up with (my experiment was nuked by the client); I'm just saying: yes, map reduce needs to happen, but you can experiment with it in just thinking about an in memory dag.

alexec · 2020-12-03T18:33:47Z

This can be achieved for bucket based artifacts (S3/GCP/OSS) using key-only artifacts. See #4618

jessesuen mentioned this issue Aug 3, 2018

Aggregating the output parameters of parallel steps #861

Closed

jessesuen added this to the v2.2 milestone Aug 5, 2018

jessesuen modified the milestones: v2.2, V2.3 Aug 30, 2018

alexmt modified the milestones: v2.3, v2.4 Jan 25, 2019

jessesuen removed this from the v2.4 milestone Apr 19, 2019

Downchuck mentioned this issue Jul 31, 2019

Please tell me how can I pass multiple files at one time through Artifacts？ #1495

Closed

alexec added this to the v2.8 milestone Mar 30, 2020

alexec added the docs label Mar 30, 2020

alexec assigned simster7 Apr 13, 2020

alexec modified the milestones: v2.8, v2.9 Apr 27, 2020

alexec removed this from the v2.9 milestone Jul 24, 2020

alexec added epic: open source and evangelism labels Jul 27, 2020

stale bot closed this as completed Aug 3, 2020

alexec removed the wontfix label Sep 30, 2020

alexec reopened this Sep 30, 2020

alexec added the type/feature Feature request label Sep 30, 2020

This was referenced Sep 30, 2020

Map-Reduce example #4165

Closed

docs: Add map-reduce example. Closes #4165 #4175

Merged

Ark-kun mentioned this issue Nov 1, 2020

Optional input artifacts not working kubeflow/pipelines#4701

Closed

alexec mentioned this issue Nov 23, 2020

Cannot get the result outputs when we have many parents #4584

Closed

alexec mentioned this issue Dec 7, 2020

Doc: Need to specify the concurrency model for WithSequence loop #4657

Closed

alexec linked a pull request Jan 19, 2021 that will close this issue

feat(controller)!: Key-only artifacts. Fixes #3184 #4618

Merged

1 task

alexec changed the title ~~Aggregating the output artifacts of parallel steps~~ Aggregating the output artifacts of parallel steps (fan-in) Jan 19, 2021

alexec added this to the v3.1 milestone Jan 19, 2021

alexec closed this as completed in #4618 Jan 20, 2021

alexec linked a pull request Jan 26, 2021 that will close this issue

fix(cli): Update the map-reduce example, fix bug. #4948

Merged

1 task

sarabala1979 self-assigned this Jan 26, 2021

sarabala1979 reopened this Jan 26, 2021

sarabala1979 removed their assignment Jan 27, 2021

mszostok mentioned this issue Feb 15, 2021

Add root workflow, handle user-input as k8s secret capactio/capact#124

Merged

alexec closed this as completed Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregating the output artifacts of parallel steps (fan-in) #934

Aggregating the output artifacts of parallel steps (fan-in) #934

jessesuen commented Aug 3, 2018

edlee2121 commented Aug 30, 2018 •

edited

Loading

jessesuen commented Aug 30, 2018

edlee2121 commented Aug 31, 2018

mamoit commented Apr 12, 2019 •

edited

Loading

jessesuen commented Apr 19, 2019

fj-sanchez commented Apr 19, 2019

jessesuen commented Apr 19, 2019 •

edited

Loading

fj-sanchez commented Apr 19, 2019

mostaphaRoudsari commented Jul 13, 2019

laubosslink commented Jul 15, 2019 •

edited

Loading

Downchuck commented Aug 1, 2019

Downchuck commented Aug 1, 2019 •

edited

Loading

sebinsua commented Nov 12, 2019 •

edited

Loading

TekTimmy commented Feb 11, 2020 •

edited

Loading

yoshua0x commented Mar 29, 2020 •

edited

Loading

alexec commented Mar 30, 2020

foobarbecue commented May 20, 2020 •

edited

Loading

alexec commented Sep 30, 2020

alexec commented Sep 30, 2020

Ark-kun commented Oct 26, 2020

Ark-kun commented Nov 1, 2020

Ark-kun commented Nov 1, 2020

Ark-kun commented Nov 1, 2020 •

edited

Loading

Downchuck commented Nov 1, 2020

alexec commented Dec 3, 2020 •

edited

Loading

Aggregating the output artifacts of parallel steps (fan-in) #934

Aggregating the output artifacts of parallel steps (fan-in) #934

Comments

jessesuen commented Aug 3, 2018

edlee2121 commented Aug 30, 2018 • edited Loading

jessesuen commented Aug 30, 2018

edlee2121 commented Aug 31, 2018

mamoit commented Apr 12, 2019 • edited Loading

jessesuen commented Apr 19, 2019

fj-sanchez commented Apr 19, 2019

jessesuen commented Apr 19, 2019 • edited Loading

fj-sanchez commented Apr 19, 2019

mostaphaRoudsari commented Jul 13, 2019

laubosslink commented Jul 15, 2019 • edited Loading

Downchuck commented Aug 1, 2019

Downchuck commented Aug 1, 2019 • edited Loading

sebinsua commented Nov 12, 2019 • edited Loading

TekTimmy commented Feb 11, 2020 • edited Loading

yoshua0x commented Mar 29, 2020 • edited Loading

alexec commented Mar 30, 2020

foobarbecue commented May 20, 2020 • edited Loading

alexec commented Sep 30, 2020

alexec commented Sep 30, 2020

Ark-kun commented Oct 26, 2020

Ark-kun commented Nov 1, 2020

Ark-kun commented Nov 1, 2020

Ark-kun commented Nov 1, 2020 • edited Loading

Downchuck commented Nov 1, 2020

alexec commented Dec 3, 2020 • edited Loading

edlee2121 commented Aug 30, 2018 •

edited

Loading

mamoit commented Apr 12, 2019 •

edited

Loading

jessesuen commented Apr 19, 2019 •

edited

Loading

laubosslink commented Jul 15, 2019 •

edited

Loading

Downchuck commented Aug 1, 2019 •

edited

Loading

sebinsua commented Nov 12, 2019 •

edited

Loading

TekTimmy commented Feb 11, 2020 •

edited

Loading

yoshua0x commented Mar 29, 2020 •

edited

Loading

foobarbecue commented May 20, 2020 •

edited

Loading

Ark-kun commented Nov 1, 2020 •

edited

Loading

alexec commented Dec 3, 2020 •

edited

Loading