Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer docker image URL accessibility check to srun when not caching locally #164

Merged
merged 1 commit into from
Aug 2, 2024

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Jul 30, 2024

Summary

Defer docker image URL accessibility check to srun when not caching locally. This PR is created to fix a bug reported by @srivatsankrishnan. CloudAI has an install mode to install test templates. In the install stage, docker images are cached locally when a user sets cache_docker_images_locally to true. Even if cache_docker_images_locally is set to false, the accessibility of the given docker image URL is tested. The problem arises when the head node does not have enroot. In install mode, CloudAI calls DockerImageCacheManager's check_docker_image_exists. check_docker_image_exists uses enroot, and when the head node does not have enroot, it fails to check and then reports an unknown error as shown below.

[ERROR] Accessibility check failed for Docker image URL: .... Error: Failed to access Docker image URL: .... Unknown error.

While DockerImageCacheManager has _check_prerequisites to check prerequisites like enroot and srun, it is only called in cache_docker_image and not in check_docker_image_exists. Therefore, if check_docker_image_exists is called directly, _check_prerequisites is not called, and it fails to check the accessibility of the URL. This PR fixes the error by always returning True in check_docker_image_exists. This is the rationale: if a docker image URL is not accessible, it will be identified by the actual sbatch or srun command. Therefore, DockerImageCacheManager does not need to check it.

Test Plan

@TaekyungHeo TaekyungHeo marked this pull request as ready for review July 30, 2024 19:06
@TaekyungHeo TaekyungHeo added the bug Something isn't working label Jul 30, 2024
@amaslenn
Copy link
Contributor

It seems we have a mess of several features:

  1. Checking docker image availability
  2. Caching docker image locally
  3. Using head node's enroot or not

IMO we should control all of them separately:

  1. SystemTOML.cache_docker_image_locally - whether to cache docker image locally
  2. SystemTOML.use_head_node_enroot - whether to use head node's enroot
  3. TestScenarioTOML.check_docker_image_availability - whether to check docker image availability before running the test

Scenarios:

  1. If (1) is true, it automatically implies (3) is true as well. We can raise a warning, if (3) is false and (1) is true.
  2. If (3) is true, but (1) is false, we check the availability of the docker image, but do not cache it locally.

(2) is independent of the other two. This is about the way we check availability and/or cache the docker image. When (1) or (3) is set to true, we check this flag to decide whether to use head node's enroot or use srun for that.

Note, that if we can check availability without enroot or srun, that could simplify the approach as we would not need (2) at all.

@TaekyungHeo
Copy link
Member Author

@amaslenn, yes, your point makes sense. My concern is that your proposal may present too many options to users, which could be confusing. Perhaps we should discuss this in our design meeting, but the issue is that this is quite urgent. Do you have any suggestions, @srinivas212?

@srinivas212 srinivas212 merged commit 1951994 into NVIDIA:main Aug 2, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants