Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: increase wait time for ray head #790

Closed
wants to merge 1 commit into from
Closed

Conversation

genlu2011
Copy link
Collaborator

@genlu2011 genlu2011 commented Aug 29, 2024

[skip ci]
During investigate the flakiness of our CI, I found the root cause for majority of ray application's flakiness is the readinessProbe.
Today in rag application test:

  1. We use kubectl wait --all pods -n ml-$SHORT_SHA-$_BUILD_ID-ray --for=condition=Ready --timeout=1200s to wait for the Ray head to be ready. When ray head is ready, it serves the traffic on address: 127.0.0.1:6379
  2. However, the readinessProbe is defined as:
    readinessProbe:
    exec:
    command:
    - bash
    - -c
    - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep
    success && wget -T 2 -q -O- http://localhost:8265/api/gcs_healthz | grep
    success
  3. The prober only probes the raylet and gcs, but not the ray service on port 6379

Therefore, in our CI tests, we see that when a job is submitted in ray application test, the connection is broken due to the server is not ready.

Change-Id: Iedd59e8a78dd1fc5a53b1e3f72e0774e9004cde0
@genlu2011 genlu2011 requested a review from andrewsykim August 29, 2024 18:42
@genlu2011
Copy link
Collaborator Author

[skip ci]

@spencer-p
Copy link
Collaborator

Could we change the readiness probe to require :6379 to be live instead of adding a 40s sleep?

@genlu2011
Copy link
Collaborator Author

Could we change the readiness probe to require :6379 to be live instead of adding a 40s sleep?

I have created a ticket to track that. The module used to create ray cluster is owned by another repo.

@genlu2011 genlu2011 closed this Aug 29, 2024
@genlu2011
Copy link
Collaborator Author

Added in #781, which also touched cloudbuild.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants