-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offical artifact passing example fails with PNS executor #4257
Comments
@juliusvonkohout could please try to switch the lines
containerRuntimeExecutor: pns, and re-try? |
I changed it with
and deleted the pod with It still fails at chroot. |
@alfsuse I tried it from scratch with k3s (containerd) and pns and it worked. Interestingly it fails with minikube which is using docker internally. |
Humm by chance did you have any psp in the previous deployment? I noticed that psp without SYS_PTRACE throw that error |
It might be useful to also report this to Argo. Although, AFAIK, this is a known issue. See argoproj/argo-workflows#1256 (comment) Maybe with argoproj/argo-workflows#2679 fixed, PNS can work more reliably.
Can you give more details about the infeasibility? |
This is my current status
The only cluster not working is on azure. I will have to check back with the maintainer of the azure cluster and report back here.
"Humm by chance did you have any psp in the previous deployment? I noticed that psp without SYS_PTRACE throw that error"
Maybe it helps to add CAP_SYS_CHROOT ? Maye then allowPrivilegeEscalation: true becomes unnecessary "Can you give more details about the infeasibility?" Well hostpath and docker.sock access is a security issue. You cannot expect anyone to manage your cluster with that security hole. |
@juliusvonkohout one thing for sure is that with that psp you should have issues with Jupiter notebook spawning if I remember correctly you should also have NET_RAW and NET_ADMIN.
Double check on azure if there are also psp there that could affect you in the same way. |
@alfsuse i am using the pipelines only at the moment. i still get
This happens also with longer running main containers (30+ seconds)
Here is the output from the minikube cluster
|
I managed to reproduce "Failed to wait for container id 'b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f': Failed to determine pid for containerID b76016a16aa672b5aae36b3eb1ca214983410d14c6b1a1e6a4cd9ad68190448f: container may have exited too quickly" error once with Docker Desktop + pns executor, but it fails randomly instead of always failing with this. |
Can you try mounting an |
I already tried this before, but just to be sure i ran it again.
Logs on azure cluster:
|
I'm also experiencing this issue. I'm running K3OS which comes with CR-IO runtime. The pns executor was the only one that I got working at all. |
i got feedback from the azure cluster maintainer and he acknowledged that the cluster is partially broken. I also tested PNS successfully on GCP with PodSecurityPolicies. I will report back here if i get access to the new Azure cluster. @ggogel i had success with minikube and CRI-O. Maybe try that first. |
@juliusvonkohout I somewhat got it working with the k8sapi, but if I submit the workflow it fails and then only after retry it succeeds. I switched to a VolumeClaim and it works fine. Artifact passing on executors other than the docker executor seems to be immature. |
i enabled debugging and noticed in the logs, that there is a "docker-" prefix for containerIDs on that specific cluster.
In
a bit later it fails with
|
@juliusvonkohout Can you please report this issue to Argo? They're pretty responsive now and have first-hand experience with different executors. Please link the issue here so that we can upvote it. |
See argoproj/argo-workflows#4186 (comment) so rootles (just PTRACE and CHROOT capabilities) is possible |
What steps did you take:
I am using the offical example https://github.com/argoproj/argo/blob/master/examples/artifact-passing.yaml that runs fine out of the box with the argo docker executor.
Then i changed the executor to pns
What happened:
Every pipeline that passes outputs (including the offical example) is now failing.
The problem seems to be that the main container exits properly and the wait container cannot chroot into it anymore:
The docker executor works around this by abusing docker.sock to copy the outputs from the terminated main container which is obviously completely infeasible in production.
The funny thing is that you can manually mount an emptydir under /tmp/outputs and add the proper output path (e.g. tmp/outputs/numbers/data) to op.output_artifact_paths.
Then the output file (tmp/outputs/numbers/data) is successfully extracted via the mirrored mounts functionality, but extracting the same file with chroot fails.
What did you expect to happen:
I expect PNS to extract the output successfully
Environment:
i just tried kubeflow pipelines on a Kubernetes 1.17 (Azure) and 1.18 (minikube) cluster with docker as container engine.
How did you deploy Kubeflow Pipelines (KFP)?
Download and extract https://github.com/kubeflow/pipelines/archive/1.0.0.zip
Install with
kubectl apply -k '/home/julius/Schreibtisch/kubeflow/pipelines-1.0.0/manifests/kustomize/cluster-scoped-resources'
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k '/home/julius/Schreibtisch/kubeflow/pipelines-1.0.0/manifests/kustomize/env/dev'
KFP version:
I am using the 1.0.0 release https://github.com/kubeflow/pipelines/releases/tag/1.0.0.
KFP SDK version:
[julius@julius-asus ~]$ pip list | grep kfp
kfp 1.0.0
kfp-server-api 1.0.0
Anything else you would like to add:
I also experimented with
op.file_outputs
without success.I also experimented with emptydir and the k8sapi executor without success.
I tried newer argo workflow and exec images (2.8.3 and 2.9.3 in deployment/workflow-controller) without success.
So i am wondering why pns is working for others.
Next to the offical examples I am also using some very simple pipelines
which work perfectly fine with the docker executor and fail miserably with pns.
See also #1654 (comment)
#1654
/kind bug
The text was updated successfully, but these errors were encountered: