Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add retry to avoid ECR get auth token failed for some transient issues #1886

Merged
merged 4 commits into from
Jul 19, 2024

Conversation

wwvela
Copy link
Contributor

@wwvela wwvela commented Jul 16, 2024

Issue #, if available:

  • ECR get Auth token api might failed if some transient issue happened like the short network disconnect.
  • Add retry here to prevent node failed in config stage for some transient failure

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done
make

Copy link
Member

@ndbaker1 ndbaker1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to understand the issue more deeply.

  1. so there's a failure case when the ecr client cannot resolve EC2 roles via IMDS because it fails to make the IMDSv2 call due to intermittent network issues and then falls back to IMDSv1 even when it is disabled for the instance.
  2. This then this causes the ecr request to fail without even doing retries because this is part of the request finalizers and not the request itself.
  3. Finally, since we can't control how the imds client is configured within the ecr client, we're just using a retry around the entire call itself.

Does that sound right?

@wwvela
Copy link
Contributor Author

wwvela commented Jul 17, 2024

trying to understand the issue more deeply.

1. so there's a failure case when the ecr client [cannot resolve EC2 roles via IMDS](https://github.com/aws/aws-sdk-go-v2/blob/03768e0d0276b360a6abaa4d30318d4aedc44995/credentials/ec2rolecreds/provider.go#L181) because it fails to make the IMDSv2 call due to intermittent network issues and then falls back to IMDSv1 even when it is disabled for the instance.

2. This then this causes the ecr request to fail without even doing retries because [this is part of the request finalizers](https://github.com/aws/aws-sdk-go-v2/blob/03768e0d0276b360a6abaa4d30318d4aedc44995/service/ecr/auth.go#L216-L231) and not the request itself.

3. Finally, since we can't control how the imds client is configured within the ecr client, we're just using a retry around the entire call itself.

Does that sound right?

Yes it is the whole picture. And also attached the link that AL2023 has disabled IMDSv1 https://docs.aws.amazon.com/linux/al2023/ug/deprecated-al2023.html

nodeadm/internal/aws/ecr/ecr.go Outdated Show resolved Hide resolved
@wwvela wwvela merged commit a5fd2de into awslabs:main Jul 19, 2024
10 checks passed
@wwvela wwvela deleted the main branch July 19, 2024 04:33
mebays pushed a commit to mebays/amazon-eks-ami that referenced this pull request Jul 26, 2024
awslabs#1886)

* add retry to avoid fetching token failed with some transient issues

* add retry to avoid ECR get auth token failed for some transient issues

* add retry to avoid ECR get auth token failed for some transient issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants