Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation retry #240

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Installation retry #240

wants to merge 4 commits into from

Conversation

almusil
Copy link
Contributor

@almusil almusil commented Sep 2, 2024

Sometimes the installation fails for various reasons in AI, most common was that the registry is not accessible and the agent won't be able to download it. Add retry mechanism to the installation of workers which greatly helps with that. This code has been running in our lab for more than a week without any issue.

@almusil almusil force-pushed the install_retry branch 2 times, most recently from b9c6192 to bfac27f Compare September 2, 2024 06:19
clusterDeployer.py Outdated Show resolved Hide resolved
clusterDeployer.py Outdated Show resolved Hide resolved
@almusil
Copy link
Contributor Author

almusil commented Sep 6, 2024

@bn222 It is updated as we have agreed on, please take a look when you have some time

@@ -99,7 +99,7 @@ def download_iso_with_retry(self, infra_env: str, path: str = os.getcwd()) -> No
except Exception:
time.sleep(timeout)
else:
logger.error(f"Failed to download the ISO after {retries} attempts")
logger.error_and_exit(f"Failed to download the ISO after {retries} attempts")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

@bn222
Copy link
Owner

bn222 commented Oct 7, 2024

@almusil thanks for the rebase. We are waiting for a green CI run due to other changes. We will merge this as soon as we know it doesn't break the rest.

@almusil
Copy link
Contributor Author

almusil commented Oct 7, 2024

@bn222 Sounds good, thanks!

@almusil almusil force-pushed the install_retry branch 2 times, most recently from 2ca8c81 to 906d423 Compare October 17, 2024 08:31
Slightly simplify the deployment process by moving/creating
some additional helper functions.

Signed-off-by: Ales Musil <amusil@redhat.com>
Instead of waiting for installation steps in bulk wait for every
step in series per node. To achieve that make the installation
process more serial, in other words, instead of waiting for
all nodes to do something, do the in sequence and parallelize
the sequence. This also serves as preparation for installation
retry.

Signed-off-by: Ales Musil <amusil@redhat.com>
If the worker installation ends up with error retry the installation
until it suceeds, this should help mainly with VM workers that are
deployed in larger batches.

Signed-off-by: Ales Musil <amusil@redhat.com>
When the ISO doesn't come up in time fail the whole deployment,
otherwise it might happen that the old ISO is used and the
deployment gets stuck waiting for known status. The nodes will never
register with old iso becuase the infraenv id doesn't match.

Signed-off-by: Ales Musil <amusil@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants