Defer docker image URL accessibility check to srun when not caching locally #164

TaekyungHeo · 2024-07-30T18:44:22Z

Summary

Defer docker image URL accessibility check to srun when not caching locally. This PR is created to fix a bug reported by @srivatsankrishnan. CloudAI has an install mode to install test templates. In the install stage, docker images are cached locally when a user sets cache_docker_images_locally to true. Even if cache_docker_images_locally is set to false, the accessibility of the given docker image URL is tested. The problem arises when the head node does not have enroot. In install mode, CloudAI calls DockerImageCacheManager's check_docker_image_exists. check_docker_image_exists uses enroot, and when the head node does not have enroot, it fails to check and then reports an unknown error as shown below.

[ERROR] Accessibility check failed for Docker image URL: .... Error: Failed to access Docker image URL: .... Unknown error.

While DockerImageCacheManager has _check_prerequisites to check prerequisites like enroot and srun, it is only called in cache_docker_image and not in check_docker_image_exists. Therefore, if check_docker_image_exists is called directly, _check_prerequisites is not called, and it fails to check the accessibility of the URL. This PR fixes the error by always returning True in check_docker_image_exists. This is the rationale: if a docker image URL is not accessible, it will be identified by the actual sbatch or srun command. Therefore, DockerImageCacheManager does not need to check it.

Test Plan

…ocally

amaslenn · 2024-07-31T09:06:34Z

It seems we have a mess of several features:

Checking docker image availability
Caching docker image locally
Using head node's enroot or not

IMO we should control all of them separately:

SystemTOML.cache_docker_image_locally - whether to cache docker image locally
SystemTOML.use_head_node_enroot - whether to use head node's enroot
TestScenarioTOML.check_docker_image_availability - whether to check docker image availability before running the test

Scenarios:

If (1) is true, it automatically implies (3) is true as well. We can raise a warning, if (3) is false and (1) is true.
If (3) is true, but (1) is false, we check the availability of the docker image, but do not cache it locally.

(2) is independent of the other two. This is about the way we check availability and/or cache the docker image. When (1) or (3) is set to true, we check this flag to decide whether to use head node's enroot or use srun for that.

Note, that if we can check availability without enroot or srun, that could simplify the approach as we would not need (2) at all.

TaekyungHeo · 2024-07-31T11:04:15Z

@amaslenn, yes, your point makes sense. My concern is that your proposal may present too many options to users, which could be confusing. Perhaps we should discuss this in our design meeting, but the issue is that this is quite urgent. Do you have any suggestions, @srinivas212?

Defer docker image URL accessibility check to srun when not caching l…

aa79e3e

…ocally

TaekyungHeo marked this pull request as ready for review July 30, 2024 19:06

TaekyungHeo added the bug Something isn't working label Jul 30, 2024

srinivas212 approved these changes Aug 2, 2024

View reviewed changes

srinivas212 merged commit 1951994 into NVIDIA:main Aug 2, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer docker image URL accessibility check to srun when not caching locally #164

Defer docker image URL accessibility check to srun when not caching locally #164

TaekyungHeo commented Jul 30, 2024 •

edited

Loading

amaslenn commented Jul 31, 2024

TaekyungHeo commented Jul 31, 2024

Defer docker image URL accessibility check to srun when not caching locally #164

Defer docker image URL accessibility check to srun when not caching locally #164

Conversation

TaekyungHeo commented Jul 30, 2024 • edited Loading

Summary

Test Plan

amaslenn commented Jul 31, 2024

TaekyungHeo commented Jul 31, 2024

TaekyungHeo commented Jul 30, 2024 •

edited

Loading