-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline logs are disappearing after 24h #1120
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6494.
|
First of all I managed to reproduce this by
The underlying Argo Workflow was fully deleted, alongside all the Pipeline Pods in the user namespace. Then in the UI I would see the error above |
Looking at the
But in upstream logs from
What is weird is that in the Charmed pod we see multiple times the |
Also with a quick research in the upstream project I see a similar issue raised But that seems to be resolved, and the PR of the fix is merged in KFP 2.2.0, which is used by Kubeflow 1.9. |
I've managed to reproduce this with upstream Kubeflow 1.9 and 1.9.1. We can track its progress in kubeflow/pipelines#11357 A potential temporary workaround will be to play with the |
Looking a bit more into the source code of KFP I found this Enabling this env var results in now the logs to fail with a different error
Which means now the
(From the above also we deduce that the |
After looking at the logs from
But this is wrong, since the path that the logs are exposed in MinIO is the following:
So the kfp-ui is not calculating the path for the logs in KFP 2.2.0 |
After looking closer at the source code it looks like this was a bug in V2 that exists all the way to KFP v2.2.0. This is resolved though in upstream in the following kubeflow/pipelines#11010 Although after trying the v2.3.0 image then the UI pod server process was restarting and I was seeing a lot of errors like the following: /server/node_modules/node-fetch/lib/index.js:1491
reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
^
FetchError: request to http://metadata/computeMetadata/v1/instance/attributes/cluster-name failed, reason: getaddrinfo EAI_AGAIN metadata
at ClientRequest.<anonymous> (/server/node_modules/node-fetch/lib/index.js:1491:11)
at ClientRequest.emit (node:events:517:28)
at Socket.socketErrorListener (node:_http_client:501:9)
at Socket.emit (node:events:517:28)
at emitErrorNT (node:internal/streams/destroy:151:8)
at emitErrorCloseNT (node:internal/streams/destroy:116:3)
at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
type: 'system',
errno: 'EAI_AGAIN',
code: 'EAI_AGAIN'
} And in the UI I see
This looks like to be from a specific GKE handler of KFP caused from EDIT: We also have a dedicated bug in canonical/kfp-operators#584 |
In the end I managed to get the logs working by:
In order to resolve the above issue for CKF we'll need to bump KFP to 2.3.0 and also allow the |
Closed by canonical/kfp-operators#605 and canonical/argo-operators#208 |
Bug Description
After 24 hours of creation, all logs belonging to a pipeline run disappear from the Charmed Kubeflow UI, despite the logs still being present in MinIO/mlpipeline (AWS S3). This leads to difficulty in troubleshooting and tracking the progress or failures of pipeline runs after the 24-hour period.
To Reproduce
Expected: Logs should still be accessible.
Actual: Logs are no longer visible in the UI, but are still present in the underlying MinIO/mlpipeline (AWS S3).
Environment
CKF: 1.9/stable
minio: ckf-1.9/stable
argo-controller: 3.4/stable
Juju: 3.5.4
See the full bundle on: https://paste.ubuntu.com/p/NXXFhDqmVn/
Relevant Log Output
Additional Context
Notebook that is used to create a pipeline, which was ran on a notebook server with a GPU:
Could be related to upstream: kubeflow/pipelines#7617
The text was updated successfully, but these errors were encountered: