add retry to avoid ECR get auth token failed for some transient issues #1886

wwvela · 2024-07-16T21:24:07Z

Issue #, if available:

ECR get Auth token api might failed if some transient issue happened like the short network disconnect.
Add retry here to prevent node failed in config stage for some transient failure

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done
make

nodeadm/internal/aws/ecr/ecr.go

ndbaker1

trying to understand the issue more deeply.

so there's a failure case when the ecr client cannot resolve EC2 roles via IMDS because it fails to make the IMDSv2 call due to intermittent network issues and then falls back to IMDSv1 even when it is disabled for the instance.
This then this causes the ecr request to fail without even doing retries because this is part of the request finalizers and not the request itself.
Finally, since we can't control how the imds client is configured within the ecr client, we're just using a retry around the entire call itself.

Does that sound right?

wwvela · 2024-07-17T06:38:45Z

trying to understand the issue more deeply.

1. so there's a failure case when the ecr client [cannot resolve EC2 roles via IMDS](https://github.com/aws/aws-sdk-go-v2/blob/03768e0d0276b360a6abaa4d30318d4aedc44995/credentials/ec2rolecreds/provider.go#L181) because it fails to make the IMDSv2 call due to intermittent network issues and then falls back to IMDSv1 even when it is disabled for the instance.

2. This then this causes the ecr request to fail without even doing retries because [this is part of the request finalizers](https://github.com/aws/aws-sdk-go-v2/blob/03768e0d0276b360a6abaa4d30318d4aedc44995/service/ecr/auth.go#L216-L231) and not the request itself.

3. Finally, since we can't control how the imds client is configured within the ecr client, we're just using a retry around the entire call itself.

Does that sound right?

Yes it is the whole picture. And also attached the link that AL2023 has disabled IMDSv1 https://docs.aws.amazon.com/linux/al2023/ug/deprecated-al2023.html

nodeadm/internal/aws/ecr/ecr.go

awslabs#1886) * add retry to avoid fetching token failed with some transient issues * add retry to avoid ECR get auth token failed for some transient issues * add retry to avoid ECR get auth token failed for some transient issues

add retry to avoid fetching token failed with some transient issues

782f15b

wwvela requested a review from cartermckinnon July 16, 2024 21:24

ndbaker1 reviewed Jul 16, 2024

View reviewed changes

nodeadm/internal/aws/ecr/ecr.go Outdated Show resolved Hide resolved

ndbaker1 reviewed Jul 16, 2024

View reviewed changes

wwvela and others added 2 commits July 18, 2024 20:11

add retry to avoid ECR get auth token failed for some transient issues

b82d0a7

Merge branch 'awslabs:main' into main

1dc7f3c

ndbaker1 requested changes Jul 19, 2024

View reviewed changes

nodeadm/internal/aws/ecr/ecr.go Outdated Show resolved Hide resolved

add retry to avoid ECR get auth token failed for some transient issues

34d9939

ndbaker1 approved these changes Jul 19, 2024

View reviewed changes

wwvela merged commit a5fd2de into awslabs:main Jul 19, 2024
10 checks passed

wwvela deleted the main branch July 19, 2024 04:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add retry to avoid ECR get auth token failed for some transient issues #1886

add retry to avoid ECR get auth token failed for some transient issues #1886

wwvela commented Jul 16, 2024 •

edited

Loading

ndbaker1 left a comment •

edited

Loading

wwvela commented Jul 17, 2024 •

edited

Loading

add retry to avoid ECR get auth token failed for some transient issues #1886

add retry to avoid ECR get auth token failed for some transient issues #1886

Conversation

wwvela commented Jul 16, 2024 • edited Loading

ndbaker1 left a comment • edited Loading

Choose a reason for hiding this comment

wwvela commented Jul 17, 2024 • edited Loading

wwvela commented Jul 16, 2024 •

edited

Loading

ndbaker1 left a comment •

edited

Loading

wwvela commented Jul 17, 2024 •

edited

Loading