Supporting Step/Task output caching and refer in another workflows #2157

sarabala1979 · 2020-02-04T06:38:00Z

Summary

Cached step/task output (parameter or artifacts) can be referred to in another workflow to avoid the same step execution which will save time and resources.

Similar issue #944

Motivation

In ETL and ML use cases, Some steps/tasks in all workflow will be the same output if the same input is passed. If Argo has the ability to cache the output for those steps, it can be referred to in another workflow. The cached steps/task execution will be skipped and just used the cached output

Proposal

The template will have a flag for cachable.

  - name: gen-number-list
     cachable: true
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(20, 31)], sys.stdout)

Create new CRD which will hold the node status of the latest succeed template

apiVersion: argoproj.io/v1alpha1
kind: CachedNodeStatus
metadata:
  Name:  retry-to-completion
  Namespace: argo
  labels:
    lastExecution: 02/04/2020  19:30
spec:
  boundaryID: steps-6c4tm
  displayName: hello1
  finishedAt: "2020-02-04T06:22:28Z"
  id: steps-6c4tm-1651667224
  inputs:
  parameters:
  - name: message
    value: hello1
  message: 'failed to save outputs: Failed to establish pod watch: unknown (get
  pods)'
  name: steps-6c4tm[0].hello1
  phase: Error
  startedAt: "2020-02-04T06:22:09Z"
  templateName: whalesay
  type: Pod

cache reference:

  - name: gen-number-list
    fatchFromCache: true
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import json
        import sys
        json.dump([i for i in range(20, 31)], sys.stdout)

The text was updated successfully, but these errors were encountered:

TekTimmy · 2020-02-04T10:04:50Z

I agree it's not a bad idea but for Argo you are responsible for the data flow! Which means you copy the results of the step into S3 and let all depending steps copy the data back. I use Amazon EKS and it's clearly restricted to "WriteReadOnce" volumes (EBS) which means a volume can be mounted to one node only.

What could be possible technically is to have separation and aggregation of artifacts. So separation would mean i copy data from one volume to many other volumes and aggregation means I copy from many volumes to one. This would allow a single step producing results that are processed in parallel by the next step without the need of using S3 buckets in between. With EKS there is still the restriction, that one EBS volume is restricted to it's Availiblity Zone (AZ) which is a problem for aggregation when the volumes have been created in different AZs.

sarabala1979 added the type/feature Feature request label Feb 4, 2020

alexec added the artifacts label May 12, 2020

This was referenced May 12, 2020

Allow input/output artifacts to inherit auth from config #2998

Closed

[Question] S3 Artifact: Handling Multiple Accounts & Multiple Buckets #1816

Closed

How to use default artifactory with input artifacts? #1377

Closed

alexec self-assigned this May 13, 2020

alexec removed their assignment Jul 5, 2020

alexec removed the epic/artifacts label Sep 21, 2021

alexec added area/templates/dag area/templates/steps labels Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting Step/Task output caching and refer in another workflows #2157

Supporting Step/Task output caching and refer in another workflows #2157

sarabala1979 commented Feb 4, 2020 •

edited

Loading

TekTimmy commented Feb 4, 2020 •

edited

Loading

Supporting Step/Task output caching and refer in another workflows #2157

Supporting Step/Task output caching and refer in another workflows #2157

Comments

sarabala1979 commented Feb 4, 2020 • edited Loading

Summary

Motivation

Proposal

TekTimmy commented Feb 4, 2020 • edited Loading

sarabala1979 commented Feb 4, 2020 •

edited

Loading

TekTimmy commented Feb 4, 2020 •

edited

Loading