Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume artifacts #1227

Closed
wants to merge 9 commits into from
Closed

Volume artifacts #1227

wants to merge 9 commits into from

Conversation

dtaniwaki
Copy link
Member

@dtaniwaki dtaniwaki commented Feb 22, 2019

I know this is not a great practice in ML on Kubernetes, but saving/loading files from/to NFS is very useful for ML researchers who highly depend on NFS in their daily tasks. To provide NFS access, we restrict users of their containers as their own UID because running containers with root can do access any files in mounted NFS volume. So I made the following changes in this PR.

  • Implemented volume artifact
  • Added security context config of executors
  • Ensure write permission of /argo directory in executors

The reason why I just don't use mounted NFS volume in the main container is performance overhead. In legacy Deep Learning jobs, we want to interact with millions of files, but we want to avoid too many requests over network, so loading gzipped files from NFS to local and saving gzipped files from local to NFS make sense. Honestly speaking, this way of file interacting in Deep Learning jobs is still not smart and there's new method like RecordIO of MXNet, but many researchers haven't got used to it.

Do you think this artifact is useful for others too?

@dtaniwaki
Copy link
Member Author

I think the idle timeout of make lint is too short and I observe it sometimes cause timeout error in CI.

@Ark-kun
Copy link
Member

Ark-kun commented Mar 12, 2019

I'm not sure I understand what value does this provide over volumeMount:

Current:

  - name: whalesay
    outputs:
      artifacts:
      - name: hello-art
        path: /argo_outputs/hello-art.txt
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee /argo_outputs/hello-art.txt"]
      volumeMounts:
      - name: vol01
        path: /argo_outputs

Proposed:

  - name: whalesay
    outputs:
      artifacts:
      - name: hello-art
        path: /argo_outputs/hello-art.txt
        volume:
          name: vol01
          path: /argo_outputs
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee /argo_outputs/hello-art.txt"]

If someone wants to store many files as a single zip, then they can just work with files in a temp directory and then zip it to the mounted volume. With this strategy the user has the flexibility to do what they want:

  • Write the files directly to the output volume vs. work in a temp dir and then copy final results
  • Pack (any format) vs. not pack
  • Any directory structure.

I'm all for a volume-based artifact-passing solution, but I do not see this PR adding much value over volumeMounts.

Compare it with a system like this:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: volume-artifact-passing-
spec:
  entrypoint: volume-artifact-example
  artifactStorage:
    volume: #Will automatically add volume mounts to all task pods
      persistentVolumeClaim:
        claimName: vol01
      subPath: /artifact_storage/{{workflow.uid}}/{{pod.name}}/{{io.name}}/
      inputMountPath: /argo/inputs/{{io.name}}/
      outputMountPath: /argo/outputs/{{io.name}}/
 
 templates:

  - name: producer
    outputs:
      artifacts:
      - name: out-art
        generatePath: True #The local output path will be auto-generated following the "outputMountPath: /argo/outputs/{{io.name}}/data" template. We can make this the default when the `path` is missing
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee {{outputs.artifacts.out-art.path}}"]
      #{{outputs.artifacts.out-art.path}} will resolve to the path prepared for the out-art output. E.g. /argo/outputs/out-art/data

  - name: consumer
    inputs:
      artifacts:
      - name: in-art
        generatePath: True #The local input path will be auto-generated following the "inputMountPath: /argo/inputs/{{io.name}}/data" template. We can make this the default when the `path` is missing
    container:
      image: busybox
      command: [sh, -c]
      args: ["cat {{inputs.artifacts.in-art.path}}"]
      #{{inputs.artifacts.in-art.path}} will resolve to the path prepared for the out-art output. E.g. /argo/inputs/in-art/data

This system will setup volume-based artifact passing for you and relieve the user from manually managing the volumeMounts and paths.

@dtaniwaki
Copy link
Member Author

I understand your solution also works.

My point is if a user has to write exporting logic by himself, the usability of artifacts becomes different. In other words, he needs to change his workflow a lot if he wants to change artifacts to S3.

@Ark-kun
Copy link
Member

Ark-kun commented Mar 14, 2019

Just trying to understand the difference between the proposed approach and the current approach ( volumeMount):

Current:

  - name: whalesay
    outputs:
      artifacts:
      - name: hello-art
        path: /argo_outputs/hello-art.txt
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee /argo_outputs/hello-art.txt"]
      volumeMounts:
      - name: vol01
        path: /argo_outputs

Proposed:

  - name: whalesay
    outputs:
      artifacts:
      - name: hello-art
        path: /argo_outputs/hello-art.txt
        volume:
          name: vol01
          path: /argo_outputs
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["cowsay hello world | tee /argo_outputs/hello-art.txt"]

It looks like it just moves the volumeMounts section around a bit.

@dtaniwaki
Copy link
Member Author

dtaniwaki commented Mar 15, 2019

As I wrote in the PR,

The reason why I just don't use mounted NFS volume in the main container is performance overhead. In legacy Deep Learning jobs, we want to interact with millions of files, but we want to avoid too many requests over network, so loading gzipped files from NFS to local and saving gzipped files from local to NFS make sense. Honestly speaking, this way of file interacting in Deep Learning jobs is still not smart and there's new method like RecordIO of MXNet, but many researchers haven't got used to it.

We don't want to touch too many times on NFS volume.

So your example above is not wrong, but it's not the best way to compare the advantage of this feature.

The ideal comparison is...

Current (bad way):

  - name: whalesay
    outputs:
      artifacts:
      - name: hello-art
        path: /argo_outputs/tmp.tar.gz
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["for i in {1..10000}; do dd if=/dev/urandom of=/argo_outputs/sample-$i.txt bs=1K count=1; sleep 10; done; tar cvzf /argo_outputs/tmp.tar.gz /argo_outputs/sample-*.txt"]
      volumeMounts:
      - name: vol01
        path: /argo_outputs

Current (better way):

  - name: whalesay
    outputs:
      artifacts:
      - name: hello-art
        path: /argo_outputs/tmp.tar.gz
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["for i in {1..10000}; do dd if=/dev/urandom of=/tmp/sample-$i.txt bs=1K count=1; sleep 10; done; tar cvzf /argo_outputs/tmp.tar.gz /tmp"]
      volumeMounts:
      - name: vol01
        path: /argo_outputs

Proposed:

  - name: whalesay
    outputs:
      artifacts:
      - name: hello-art
        path: /tmp
        volume:
          name: vol01
          path: /argo_outputs/tmp.tar.gz
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["for i in {1..10000}; do dd if=/dev/urandom of=/tmp/sample-$i.txt bs=1K count=1; sleep 10; done"]

@Ark-kun
Copy link
Member

Ark-kun commented Apr 19, 2019

From my perspective, the "Current (better way)" is a sweet spot.
It requires just a few extra characters in the code (tar cvzf ...) and is very flexible.

AFAIK, In the "Proposed" variant there is some implicit archiving configuration that cannot be changed. And if we allow changing it, the the implementation becomes too complex for the number of characters saved (it's already too complex).

@fj-sanchez
Copy link

I'm all for a volume-based artifact-passing solution, but I do not see this PR adding much value over volumeMounts.

@Ark-kun I think it would be awesome to have implemented something like what you described, is there any other issue where this is captured? I don't think this should be lost in the comments...

@Ark-kun
Copy link
Member

Ark-kun commented Apr 30, 2019

I'm all for a volume-based artifact-passing solution, but I do not see this PR adding much value over volumeMounts.

@Ark-kun I think it would be awesome to have implemented something like what you described, is there any other issue where this is captured? I don't think this should be lost in the comments...

I've created the issue #1349 I'm working on implementing this.

@dtaniwaki
Copy link
Member Author

@fj-sanchez Is it possible to achieve it with the change of this PR and the ArchiveLocation combo?
https://github.com/argoproj/argo/blob/4e37a444bde2a034885d0db35f7b38684505063e/pkg/apis/workflow/v1alpha1/types.go#L226

@Ark-kun
Copy link
Member

Ark-kun commented Oct 16, 2019

Here is a draft version of my volume-based data passing rewriter script: https://github.com/Ark-kun/pipelines/blob/SDK---Compiler---Added-support-for-volume-based-data-passing/sdk/python/kfp/compiler/_data_passing_using_volume.py
It can be run as a command-line program to rewrite Argo Workflow from using artifacts to volume-based data passing.

What does everyone think? Please comment at #1349

@dtaniwaki
Copy link
Member Author

I'm closing this issue as I no longer need it and it's not desired by others too.

@dtaniwaki dtaniwaki closed this Nov 22, 2019
@dtaniwaki dtaniwaki deleted the volume-artifacts branch November 22, 2019 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants