Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline logs are disappearing after 24h #1120

Closed
AxoyTO opened this issue Oct 18, 2024 · 10 comments
Closed

Pipeline logs are disappearing after 24h #1120

AxoyTO opened this issue Oct 18, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@AxoyTO
Copy link

AxoyTO commented Oct 18, 2024

Bug Description

After 24 hours of creation, all logs belonging to a pipeline run disappear from the Charmed Kubeflow UI, despite the logs still being present in MinIO/mlpipeline (AWS S3). This leads to difficulty in troubleshooting and tracking the progress or failures of pipeline runs after the 24-hour period.

image

!aws --endpoint-url $MINIO_ENDPOINT_URL s3 ls s3://mlpipeline
                           [...]
                           PRE addition-pipeline-4g94d/
                           PRE addition-pipeline-4qwt4/
                           PRE download-preprocess-train-deploy-pipeline-8wjv9/
                           PRE mnist-pipeline-fcmgr/
                           [...]
!aws --endpoint-url $MINIO_ENDPOINT_URL s3 ls s3://mlpipeline/download-preprocess-train-deploy-pipeline-8wjv9/download-preprocess-train-deploy-pipeline-8wjv9-system-container-impl-1190848556/
2024-10-15 15:27:49      10796 main.log

To Reproduce

  1. Deploy Charmed Kubeflow 1.9 using Juju.
  2. Create a pipeline and run it.
  3. After the run completes, observe that logs are available in the Kubeflow UI.
  4. Wait for 24 hours after the pipeline run completes.
  5. Attempt to view the pipeline logs in the UI again.
    Expected: Logs should still be accessible.
    Actual: Logs are no longer visible in the UI, but are still present in the underlying MinIO/mlpipeline (AWS S3).

Environment

CKF: 1.9/stable
minio: ckf-1.9/stable
argo-controller: 3.4/stable
Juju: 3.5.4
See the full bundle on: https://paste.ubuntu.com/p/NXXFhDqmVn/

Relevant Log Output

<none>

Additional Context

Notebook that is used to create a pipeline, which was ran on a notebook server with a GPU:

import kfp
from kfp import dsl, kubernetes

@dsl.component(
    base_image="tensorflow/tensorflow:latest-gpu",
    # packages_to_install=["tensorflow"]
)
def foo():
    '''Calculates sum of two arguments'''
    print("GPU Test")
    import tensorflow as tf
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    print("GPU Test")


@dsl.pipeline(
    name='Addition pipeline',
    description='An example pipeline that performs addition calculations.')
def foo_pipeline():
    task = (foo()
            .set_cpu_request(str(2))
            .set_cpu_limit(str(4))
            .set_memory_request("2Gi")
            .set_memory_limit("4Gi")
            .set_gpu_limit("1")
            .set_accelerator_type("nvidia.com/gpu")
           )

    task = kubernetes.set_image_pull_policy(task=task, policy="Always")

    task = kubernetes.add_toleration(
        task=task,
        key="sku",
        operator="Equal",
        value="gpu",
        effect="NoSchedule",
    )

namespace = "admin"

client = kfp.Client()

run = client.create_run_from_pipeline_func(
    run_name="gpu_test",
    pipeline_func=foo_pipeline,
    namespace=namespace,
    experiment_name="gpu-foo")

Could be related to upstream: kubeflow/pipelines#7617

@AxoyTO AxoyTO added the bug Something isn't working label Oct 18, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6494.

This message was autogenerated

@kimwnasptd
Copy link
Contributor

First of all I managed to reproduce this by

  1. Creating a run
  2. Waiting for about a day

The underlying Argo Workflow was fully deleted, alongside all the Pipeline Pods in the user namespace. Then in the UI I would see the error above

@kimwnasptd
Copy link
Contributor

kimwnasptd commented Oct 30, 2024

Looking at the kfp-persistence logs I saw the following:

2024-10-29T08:51:37.111Z [persistenceagent] time="2024-10-29T08:51:37Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
2024-10-29T08:51:37.111Z [persistenceagent] time="2024-10-29T08:51:37Z" level=info msg="Success while syncing resource (admin/tutorial-data-passing-6nvfj)"
2024-10-29T08:51:40.103Z [persistenceagent] time="2024-10-29T08:51:40Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj
2024-10-29T08:51:40.103Z [persistenceagent] time="2024-10-29T08:51:40Z" level=info msg="Success while syncing resource (admin/tutorial-data-passing-6nvfj)"
2024-10-29T08:51:55.600Z [persistenceagent] time="2024-10-29T08:51:55Z" level=info msg="Syncing Workflow (tutorial-data-passing-6nvfj): success, processing complete." Workflow=tutorial-data-passing-6nvfj

But in upstream logs from ml-pipeline-persistenceagent pod I see the following:

time="2024-10-30T14:33:37Z" level=info msg="Wait for shut down"
time="2024-10-30T14:33:38Z" level=info msg="Syncing Workflow (tutorial-data-passing-xgj6p): success, processing complete." Workflow=tutorial-data-passing-xgj6p
time="2024-10-30T14:33:38Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:33:40Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:33:40Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:34:08Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:34:08Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:34:38Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:34:38Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"
time="2024-10-30T14:35:08Z" level=info msg="Skip syncing Workflow (tutorial-data-passing-xgj6p): workflow marked as persisted."
time="2024-10-30T14:35:08Z" level=info msg="Success while syncing resource (kubeflow-user-example-com/tutorial-data-passing-xgj6p)"

What is weird is that in the Charmed pod we see multiple times the Syncing Workflow... message, while in upstream component we only see this once. But am not entirely sure if this is related to the issue.

@kimwnasptd
Copy link
Contributor

Also with a quick research in the upstream project I see a similar issue raised
kubeflow/pipelines#8935

But that seems to be resolved, and the PR of the fix is merged in KFP 2.2.0, which is used by Kubeflow 1.9.

@kimwnasptd
Copy link
Contributor

I've managed to reproduce this with upstream Kubeflow 1.9 and 1.9.1. We can track its progress in kubeflow/pipelines#11357

A potential temporary workaround will be to play with the TTL_SECONDS_AFTER_WORKFLOW_FINISH ENV var of the persistence agent (will need to create a charm config option) to set the GC deadline to something more than a day.

@kimwnasptd
Copy link
Contributor

kimwnasptd commented Nov 8, 2024

Looking a bit more into the source code of KFP I found this ARGO_ARCHIVE_LOGS env var that is set by default to false (!)
https://github.com/kubeflow/pipelines/blob/2.2.0/frontend/server/configs.ts#L89

Enabling this env var results in now the logs to fail with a different error

Error response: Could not get main container logs: S3Error: The specified key does not exist.

Which means now the ml-pipeline-ui Pod (kubeflow namespace) is trying also to reach MinIO. This makes sense since after inspecting more the source code it looks like if this setting is enabled then the logs handler will try to talk to MinIO

  1. https://github.com/kubeflow/pipelines/blob/2.3.0/frontend/server/handlers/pod-logs.ts#L61
  2. https://github.com/kubeflow/pipelines/blob/2.3.0/frontend/server/app.ts#L202

(From the above also we deduce that the ml-pipeline-ui is not proxying the requests via the ml-pipeline-ui-artifacts deployment in the user namespace, as it does for the artifacts of a pipeline)

@kimwnasptd
Copy link
Contributor

After looking at the logs from ml-pipeline-ui I see the following:

Getting logs for pod:tutorial-data-passing-8dntv-system-container-impl-2572032214 from mlpipeline/logs/tutorial-data-passing-8dntv-system-container-impl/tutorial-data-passing-8dntv-system-container-impl-2572032214/main.log

But this is wrong, since the path that the logs are exposed in MinIO is the following:

mlpipeline/artifacts/tutorial-data-passing-8dntv/2024/11/08/tutorial-data-passing-8dntv-system-container-impl-2572032214/main.log

So the kfp-ui is not calculating the path for the logs in KFP 2.2.0

@kimwnasptd
Copy link
Contributor

kimwnasptd commented Nov 8, 2024

After looking closer at the source code it looks like this was a bug in V2 that exists all the way to KFP v2.2.0. This is resolved though in upstream in the following kubeflow/pipelines#11010

Although after trying the v2.3.0 image then the UI pod server process was restarting and I was seeing a lot of errors like the following:

/server/node_modules/node-fetch/lib/index.js:1491
                        reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
                               ^
FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo EAI_AGAIN metadata
    at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (node:events:517:28)
    at Socket.socketErrorListener (node:_http_client:501:9)
    at Socket.emit (node:events:517:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  type: 'system',
  errno: 'EAI_AGAIN',
  code: 'EAI_AGAIN'
}

And in the UI I see

Error response: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111

This looks like to be from a specific GKE handler of KFP caused from
https://github.com/kubeflow/pipelines/blob/2.3.0/frontend/server/app.ts#L237-L238, and reported in upstream in kubeflow/pipelines#11247

EDIT: We also have a dedicated bug in canonical/kfp-operators#584

@kimwnasptd
Copy link
Contributor

kimwnasptd commented Nov 8, 2024

In the end I managed to get the logs working by:

  1. Updating the image of the UI to gcr.io/ml-pipeline/frontend:2.3.0
  2. Adding the env var ARGO_ARCHIVE_LOGS: "true"
  3. Adding the env var DISABLE_GKE_METADATA: "true"

In order to resolve the above issue for CKF we'll need to bump KFP to 2.3.0 and also allow the kfp-ui charm to dynamically set those env vars as well

@NohaIhab
Copy link
Contributor

NohaIhab commented Dec 3, 2024

Closed by canonical/kfp-operators#605 and canonical/argo-operators#208
Fix is now released to 1.9/stable bundle, check out the announcement

@NohaIhab NohaIhab closed this as completed Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants