Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure kfp-ui can show logs from Argo #582

Closed
kimwnasptd opened this issue Nov 13, 2024 · 5 comments
Closed

Ensure kfp-ui can show logs from Argo #582

kimwnasptd opened this issue Nov 13, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@kimwnasptd
Copy link
Contributor

Context

This is in order to resolve canonical/bundle-kubeflow#1120

The KFP frontend has an environment variable ARGO_ARCHIVE_LOGS, that is used by the frontend to know if it should proxy logs from MinIO. More on this can be found in canonical/bundle-kubeflow#1120 (comment)

We'll need to introduce a new config option, that will be configuring this env var to True by default, to ensure the UI will be fetching logs from Argo by default.

An extra step we'll also need to do will be to have one more config option for disabling the GKE metadata, which was making the upstream container constantly restart canonical/bundle-kubeflow#1120 (comment)

What needs to get done

  1. Ensure we can set the ARGO_ARCHIVE_LOGS env var in kpf-ui
  2. Ensure we can set the DISABLE_GKE_METADATA env var in kfp-ui

Definition of Done

  1. The kfp-ui can fetch logs from MinIO, after applying configurations if needed
@kimwnasptd kimwnasptd added the enhancement New feature or request label Nov 13, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6542.

This message was autogenerated

@NohaIhab
Copy link
Contributor

Reproduce the error

I was able to reproduce the error by the following steps:

  1. Deploy kfp bundle latest/edge + kubeflow dashboard + dex-auth + oidc
  2. Create an experiment and run from the example pipeline Data passing pipeline
  3. Wait for the run to finish and view the logs from the ui -> logs are there
  4. Delete the argo workflow from the user namespace (this simulates the workflow being GCed, we can do the same by modifying the TTL_SECONDS_AFTER_WORKFLOW_FINISH env of kfp-persistence to a short time e.g. 60) then view the logs -> logs cannot be viewed with error message:
    Screenshot from 2024-11-20 14-43-01

Test the fix

To test the fix suggested in canonical/bundle-kubeflow#1120, I:

  1. Deployed kfp from feat: add envs to ensure kfp-ui can show logs #605 rebased on the branch from chore: Upgrade manifests to 2.3.0 #583. This way the bundle has the following changes:
  • kfp manifests and images upgraded to 2.3.0
  • new configs introduced to kfp-ui that control the following envs:
    • ARGO_ARCHIVE_LOGS defaulting to true
    • DISABLE_GKE_METADATA defaulting to true
  1. Followed the steps 2-4 from the above section

Results

logs cannot be viewed, with a different error message this time:
Screenshot from 2024-11-20 15-09-46

The error is Could not get main container logs: S3Error: The specified key does not exist.

Debugging

From the log above Could not get main container logs: S3Error: The specified key does not exist, it looks like an issue getting the persisted logs from the S3 Object storage i.e. MinIO

Looking at the logs from kfp-ui pod, the ml-pipeline-ui container has the following log:

024-11-20T15:27:01.852Z [ml-pipeline-ui] Getting logs for pod, tutorial-data-passing-tcvbx-system-container-impl-1663429819, from mlpipeline/artifacts/tutorial-data-passing-tcvbx/2024/11/20/tutorial-data-passing-tcvbx-system-container-impl-1663429819/main.log.

we can see there the request to minio is trying to fetch from the path mlpipeline/artifacts/tutorial-data-passing-tcvbx/2024/11/20/tutorial-data-passing-tcvbx-system-container-impl-1663429819/main.log

Now, let's get inside the minio container to see if we can find the persisted data, and if it's at the expected path in the bucket:

kubectl exec -it minio-0 -n kubeflow -- /bin/bash
Defaulted container "minio" out of: minio, juju-pod-init (init)
[root@minio-0 /]# ls data/
mlpipeline
[root@minio-0 /]# ls data/mlpipeline/
pipelines  tutorial-data-passing-qw857	tutorial-data-passing-skw8p  tutorial-data-passing-tcvbx  v2

we can observe there that the persisted data is indeed found there, but it's not at the expected path.
The logs from kfp-ui suggest they should be at mlpipeline/artifacts/tutorial-data-passing-tcvbx, meanwhile the data is located at mlpipeline/tutorial-data-passing-tcvbx. Additionally, the file structure is different under tutorial-data-passing-tcvbx.

Looking at the upstream changes from kfp 2.2 to 2.3, they have added a new env in the frontend's config.ts: ARGO_KEYFORMAT. It is set to 'artifacts/{{workflow.name}}/{{workflow.creationTimestamp.Y}}/{{workflow.creationTimestamp.m}}/{{workflow.creationTimestamp.d}}/{{pod.name}}', which is the new structure we see in the minio bucket.

This env ARGO_KEYFORMAT tells the kfp frontend what is the format that artifacts are stored with, the ui then formats the request to minio based on this format.

Seen also in upstream, in this comment it is mentioned that the value for ARGO_KEYFORMAT env must match the value of keyFormat specified in the Argo workflow-controller-configmap ConfigMap. We can see it was modified in the pipelines manifests for argo to match the env.

In our CKF, this ConfigMap is created by argo-controller charm using this template. It does not set the keyFormat field, which causes it to be set to the default. The default of the keyFormat field is documented in the upstream argo-workflows repo to be:

{{workflow.name}}/{{pod.name}}

so this is the format that is used by our argo-controller charm to organize the pipeline logs in the S3 storage.

And due to the change upstream, the kfp-ui 2.3 is now configured to fetch the logs with the new format, causing a mismatch with the argo-controller configuration in CKF.

@NohaIhab
Copy link
Contributor

NohaIhab commented Nov 26, 2024

Testing the fixes

Fixes implemented in #605 and canonical/argo-operators#208 must be tested simultaneously.

  1. Deploy the kfp charms +dex+oidc (for dashboard access) from latest/edge in a kubeflow model, you can use this bundle to deploy
bundle: kubernetes
name: kubeflow
docs: https://discourse.charmhub.io/t/3749
applications:
  admission-webhook:
    charm: admission-webhook
    channel: latest/edge
    trust: true
    scale: 1
    _github_repo_name: admission-webhook-operator
    _github_repo_branch: main
  argo-controller:
    charm: argo-controller
    channel: latest/edge
    trust: true
    scale: 1
    _github_repo_name: argo-operators
    _github_repo_branch: main
  dex-auth:
    charm: dex-auth
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: dex-auth-operator
    _github_repo_branch: main
  envoy:
    charm: envoy
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: envoy-operator
    _github_repo_branch: main
  istio-ingressgateway:
    charm: istio-gateway
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: istio-operators
    _github_repo_branch: main
    options:
      kind: ingress
  istio-pilot:
    charm: istio-pilot
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: istio-operators
    _github_repo_branch: main
    options:
      default-gateway: kubeflow-gateway
  kfp-api:
    charm: kfp-api
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kfp-operators
    _github_repo_branch: main
  kfp-db:
    charm: mysql-k8s
    channel: 8.0/edge
    scale: 1
    trust: true
    constraints: mem=2G
    _github_dependency_repo_name: mysql-k8s-operator
    _github_dependency_repo_branch: main
  kfp-metadata-writer:
    charm: kfp-metadata-writer
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kfp-operators
    _github_repo_branch: main
  kfp-persistence:
    charm: kfp-persistence
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kfp-operators
    _github_repo_branch: main
  kfp-profile-controller:
    charm: kfp-profile-controller
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kfp-operators
    _github_repo_branch: main
  kfp-schedwf:
    charm: kfp-schedwf
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kfp-operators
    _github_repo_branch: main
  kfp-ui:
    charm: kfp-ui
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kfp-operators
    _github_repo_branch: main
  kfp-viewer:
    charm: kfp-viewer
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kfp-operators
    _github_repo_branch: main
  kfp-viz:
    charm: kfp-viz
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kfp-operators
    _github_repo_branch: main
  kubeflow-dashboard:
    charm: kubeflow-dashboard
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kubeflow-dashboard-operator
    _github_repo_branch: main
  kubeflow-profiles:
    charm: kubeflow-profiles
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kubeflow-profiles-operator
    _github_repo_branch: main
  kubeflow-roles:
    charm: kubeflow-roles
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: kubeflow-roles-operator
    _github_repo_branch: main
  metacontroller-operator:
    charm: metacontroller-operator
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: metacontroller-operator
    _github_repo_branch: main
  mlmd:
    charm: mlmd
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: mlmd-operator
    _github_repo_branch: main
  minio:
    charm: minio
    channel: latest/edge
    scale: 1
    _github_repo_name: minio-operator
    _github_repo_branch: main
  oidc-gatekeeper:
    charm: oidc-gatekeeper
    channel: latest/edge
    scale: 1
    trust: true
    _github_repo_name: oidc-gatekeeper-operator
    _github_repo_branch: main
relations:
  - [argo-controller, minio]
  - [dex-auth:dex-oidc-config, oidc-gatekeeper:dex-oidc-config]
  - [dex-auth:oidc-client, oidc-gatekeeper:oidc-client]
  - [istio-pilot:ingress, dex-auth:ingress]
  - [istio-pilot:ingress, kfp-ui:ingress]
  - [istio-pilot:ingress, kubeflow-dashboard:ingress]
  - [istio-pilot:ingress, kubeflow-volumes:ingress]
  - [istio-pilot:ingress, oidc-gatekeeper:ingress]
  - [istio-pilot:ingress, envoy:ingress]
  - [istio-pilot:ingress-auth, oidc-gatekeeper:ingress-auth]
  - [istio-pilot:istio-pilot, istio-ingressgateway:istio-pilot]
  - [kfp-api:relational-db, kfp-db:database]
  - [kfp-api:kfp-api, kfp-persistence:kfp-api]
  - [kfp-api:kfp-api, kfp-ui:kfp-api]
  - [kfp-api:kfp-viz, kfp-viz:kfp-viz]
  - [kfp-api:object-storage, minio:object-storage]
  - [kfp-profile-controller:object-storage, minio:object-storage]
  - [kfp-ui:object-storage, minio:object-storage]
  - [kubeflow-profiles, kubeflow-dashboard]
  - [kubeflow-dashboard:links, kfp-ui:dashboard-links]
  - [mlmd:grpc, envoy:grpc]
  - [mlmd:grpc, kfp-metadata-writer:grpc]
  1. Refresh kfp-ui charm to the PR version
juju refresh kfp-ui --channel=latest/edge/pr-605
  1. Refresh argo-controller to the PR version
juju refresh argo-controller --channel=latest/edge/pr-208
  1. Wait until all charms are active/idle
  2. Set the dex-auth username and password to be able to login
juju config dex-auth static-username=admin
juju config dex-auth static-password=admin
  1. Login to the dashboard and create an experiment from an example v2 pipeline, and create a run
  2. From the UI of the pipeline run, view the logs of any pipeline step once it is completed. Logs should be visible in logs tab.
  3. Delete the argo workflow after it has completed to simulate the workflow being GCed
  4. From the UI, try to view the logs of any step again.

Expected result that logs are still visible even when workflow is deleted:
image

v1 Pipelines logs

The logs of v1 pipelines can be viewed from the Output artifacts section of a pipeline step as follows:
image

They are not affected by the workflow being deleted since they directly link to the logs in the s3 storage. To test it, you can use this v1 pipeline yaml from our integration tests.

@kimwnasptd
Copy link
Contributor Author

kimwnasptd commented Nov 29, 2024

A note on why V1 Pipelines persist the logs, even if the keyFormat of Argo changes (which will happen in our upgrade):

In MySQL (kfp-db) there's the table mlpipeline > run_details which hold the information about a KFP Run. Specifically it holds both the UID of KFP Runs but also the whole spec of the completed Argo Workflow.

For V1 Pipelines, we see in the above table that the status of the Argo Workflow contains as an output artifact the logs, and also the key of this artifact:

status: 
  phase: "Succeeded"
  nodes:
    execution-order-pipeline-6bbzm-1432297067: 
      id: "execution-order-pipeline-6bbzm-1432297067"
      name: "execution-order-pipeline-6bbzm.echo1-op"
      displayName: "echo1-op"
      phase: "Succeeded"
      ...
      inputs: 
        parameters: 
        - name: "text1"
          value: "message 1"
      outputs: 
        artifacts: 
        - name: "main-logs"
          s3: 
            key: "execution-order-pipeline-6bbzm/execution-order-pipeline-6bbzm-echo1-op-1432297067/main.log"
        exitCode: "0"
      children: 
      - "execution-order-pipeline-6bbzm-3473380184"

Summary

  1. For V1, as described above, the backend explicitly puts the logs as an output artifact
  2. The yaml/json of the completed Argo Workflow is stored in MySQL (kfp-db) by kfp-persistence when the workflow succeeds
  3. The yaml of the completed Argo Workflow contains in the .status the path in S3 to fetch the S3 artifact
  4. Even if in the future the keyFormat in Argo changes, the correct path is stored in MySQL

@NohaIhab
Copy link
Contributor

NohaIhab commented Dec 2, 2024

Closed by #605 and canonical/argo-operators#208
both now in latest/edge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants