Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s logtail #3758

Merged
merged 13 commits into from
Aug 24, 2024
Merged

k8s logtail #3758

merged 13 commits into from
Aug 24, 2024

Conversation

asaiacai
Copy link
Contributor

@asaiacai asaiacai commented Jul 17, 2024

Closes #3429

This swaps the container sleep with a logtailing process so that pod logs now show task logs. These are retrievable via kubectl logs <POD_NAME> and also written to /var/log/pods/*/*/*.log on the node disk, so it should integrate with most k8s logging solutions out of the box (OpenTelemetry, Fluentd, Filebeats, etc.)

Implemented as a watcher process that does tail -f if there's an open writing file handle on the task log by checking lsof and closes the tail -f program if there are no such writers.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_container_logs_multinode_kubernetes
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_container_logs_two_jobs_kubernetes
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_container_logs_two_simultaneous_jobs_kubernetes

@asaiacai asaiacai marked this pull request as ready for review July 18, 2024 23:58
@asaiacai
Copy link
Contributor Author

@romilbhardwaj I finished running the smoke tests on my end if you have a spare cycle to take a look

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai!

sky/templates/kubernetes-ray.yml.j2 Outdated Show resolved Hide resolved
sky/templates/kubernetes-ray.yml.j2 Outdated Show resolved Hide resolved
@asaiacai
Copy link
Contributor Author

@romilbhardwaj reran the smoke tests and manually checked kubectl logs again but let me know and I can draft up some additional tests.

@romilbhardwaj
Copy link
Collaborator

Thanks @asaiacai. Pushed some changes to reduce sleep (which was causing delays in pod termination). lmk what you think.

BTW, tests seem to be failing for me:

$ pytest tests/test_smoke.py::test_container_logs_two_simultaneous_jobs_kubernetes --kubernetes

Logs: https://gist.github.com/romilbhardwaj/4065c409f089760fb11415f0bf186262

Looks like logs are correctly tailed on the first sky launch, but logs from any subsequent sky exec are not tailed.

This was on a GKE cluster. Interestingly, the manual repro seems to work on my local sky local up cluster, but the test still fails at kubectl get pods | grep t-container-logs-two-5n-e5 | grep head | awk '{print $1}' | xargs -I {} kubectl logs {} | wc -l | grep 18.

@asaiacai
Copy link
Contributor Author

asaiacai commented Aug 22, 2024

hey @romilbhardwaj thanks for the edits! seems like there was variable delay between the task stdout/stderr being written to /var/log/pods which was occasionally longer than 5 seconds. Increasing the sleep time seemed to reliably get rid of it and i was able to run it 19 times in a row without fail on GKE. Let me know if you also see this reflected on your end.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai! Tested it, works nicely. LGTM.

@romilbhardwaj romilbhardwaj added this pull request to the merge queue Aug 24, 2024
Merged via the queue into skypilot-org:master with commit 4e7478b Aug 24, 2024
20 checks passed
@asaiacai asaiacai deleted the k8s_container_logs branch September 6, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Log SkyPilot output to container stdout
2 participants