Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS AssumeRole credentials don't work in latest 0.4.1-dev #826

Closed
jaltavilla opened this issue Jan 10, 2024 · 4 comments
Closed

AWS AssumeRole credentials don't work in latest 0.4.1-dev #826

jaltavilla opened this issue Jan 10, 2024 · 4 comments

Comments

@jaltavilla
Copy link

We updated to the latest (at this time) nightly build (hashicorppreview/nomad-autoscaler:0.4.1-dev-50d5105101b410b344a8457580da735f019c2033) to get the changes in #807. When we did, the asg plugin was no longer able to retrieve aws credentials. It would give the following error message:

[WARN]  policy_manager.policy_handler: failed to get target status:
policy_id=5e025564-89be-c136-a619-64c752b6fad6 error="failed to describe
 AWS Autoscaling Group: operation error Auto Scaling:
DescribeAutoScalingGroups, get identity: get credentials: failed to
refresh cached credentials, operation error STS: AssumeRole, get
identity: get credentials: failed to refresh cached credentials, no EC2
IMDS role found, not found, Signing"

I spent most of a day investigating this. The only thing I could find that potentially explained this was the update of the aws sdk version. I went to a slightly older nightly build (hashicorppreview/nomad-autoscaler:0.4.1-dev-ac515347b40ae10823106b7d844078fbc42a1971) and credentials started working again.

I noticed that the commit in the nightly build only updated some aws packages and there are subsequent commits updating more packages that didn't result in a build. So I'm not sure if the problem is mismatched package versions or an actual bug in the aws sdk. (If it's in aws I suspect aws/aws-sdk-go-v2#2438, but I couldn't see it).

Thanks!


AWS Configuration

The aws plugin is configured as follows in the hcl config:

target "aws-asg" {
  driver = "aws-asg"
  config = {
    aws_region            = us-east-1

    # Leaving the following values either unset or blank should lead to using the standard credential chain
    # As a result this should allow assuming a role to perform aws actions via the ~/.aws/config
    # aws_access_key_id: ""
    # aws_secret_access_key: ""
    # aws_credential_provider: ""
  }
}

Then in the container user's home dir there is the aws config (/home/nomad-autoscaler/.aws/config):

[default]
credential_source = Ec2InstanceMetadata
session_name = "mgp-dev-nomad-autoscaler"
region = us-east-1
role_arn = arn:aws:iam::<redacted>:role/mgp-dev-role-nomad-autoscaler-AutoscalerRole
external_id = <redacted>

The autoscaler role has the required aws permissions from the autoscaler docs, along with a trust relationship allowing our ec2 role to assume it with the proper external id. The ec2 instance role has permission to use 'sts:AssumeRole'.

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 12, 2024

Hi @jaltavilla 👋

Thank you for testing the nightly build and catching this bug!

Looking at the commit history I also can't see anything that could've caused this regression other than the SDK upgrade. Checking the go.mod diff we also didn't bump too far:

❯ git diff v0.4.0 HEAD -- go.mod
diff --git a/go.mod b/go.mod
index 7d3d524..6581553 100644
--- a/go.mod
+++ b/go.mod
@@ -10,12 +10,12 @@ require (
        github.com/DataDog/datadog-api-client-go v1.16.0
        github.com/armon/go-metrics v0.4.1
        github.com/aws/aws-sdk-go-v2 v1.24.0
-       github.com/aws/aws-sdk-go-v2/config v1.26.1
-       github.com/aws/aws-sdk-go-v2/credentials v1.16.12
-       github.com/aws/aws-sdk-go-v2/service/autoscaling v1.36.5
+       github.com/aws/aws-sdk-go-v2/config v1.26.2
+       github.com/aws/aws-sdk-go-v2/credentials v1.16.13
+       github.com/aws/aws-sdk-go-v2/service/autoscaling v1.36.6
        github.com/golang/protobuf v1.5.3
        github.com/google/go-cmp v0.6.0
-       github.com/hashicorp/go-hclog v1.6.1
+       github.com/hashicorp/go-hclog v1.6.2
        github.com/hashicorp/go-msgpack v1.1.5
        github.com/hashicorp/go-multierror v1.1.1
        github.com/hashicorp/go-plugin v1.6.0
@@ -24,14 +24,14 @@ require (
        github.com/mitchellh/cli v1.1.5
        github.com/mitchellh/copystructure v1.2.0
        github.com/mitchellh/go-homedir v1.1.0
-       github.com/prometheus/client_golang v1.17.0
+       github.com/prometheus/client_golang v1.18.0
        github.com/prometheus/common v0.45.0
        github.com/shoenig/test v1.7.0
        github.com/stretchr/testify v1.8.4
        golang.org/x/text v0.14.0
-       google.golang.org/api v0.153.0
-       google.golang.org/grpc v1.59.0
-       google.golang.org/protobuf v1.31.0
+       google.golang.org/api v0.154.0
+       google.golang.org/grpc v1.60.1
+       google.golang.org/protobuf v1.32.0
 )

 require (
@@ -61,7 +61,7 @@ require (
        github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.10.9 // indirect
        github.com/aws/aws-sdk-go-v2/service/sso v1.18.5 // indirect
        github.com/aws/aws-sdk-go-v2/service/ssooidc v1.21.5 // indirect
-       github.com/aws/aws-sdk-go-v2/service/sts v1.26.5 // indirect
+       github.com/aws/aws-sdk-go-v2/service/sts v1.26.6 // indirect
        github.com/aws/smithy-go v1.19.0 // indirect
        github.com/beorn7/perks v1.0.1 // indirect
        github.com/bgentry/speakeasy v0.1.0 // indirect
@@ -71,6 +71,9 @@ require (
        github.com/davecgh/go-spew v1.1.1 // indirect
        github.com/dimchansky/utfbom v1.1.1 // indirect
        github.com/fatih/color v1.15.0 // indirect
+       github.com/felixge/httpsnoop v1.0.4 // indirect
+       github.com/go-logr/logr v1.3.0 // indirect
+       github.com/go-logr/stdr v1.2.2 // indirect
        github.com/golang-jwt/jwt/v4 v4.5.0 // indirect
        github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
        github.com/google/s2a-go v0.1.7 // indirect
@@ -104,19 +107,23 @@ require (
        github.com/pkg/errors v0.9.1 // indirect
        github.com/pmezard/go-difflib v1.0.0 // indirect
        github.com/posener/complete v1.2.3 // indirect
-       github.com/prometheus/client_model v0.4.1-0.20230718164431-9a2bf3000d16 // indirect
-       github.com/prometheus/procfs v0.11.1 // indirect
+       github.com/prometheus/client_model v0.5.0 // indirect
+       github.com/prometheus/procfs v0.12.0 // indirect
        github.com/shopspring/decimal v1.3.1 // indirect
        github.com/spf13/cast v1.5.0 // indirect
        github.com/tv42/httpunix v0.0.0-20150427012821-b75d8614f926 // indirect
        github.com/zclconf/go-cty v1.13.0 // indirect
        go.opencensus.io v0.24.0 // indirect
+       go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.46.1 // indirect
+       go.opentelemetry.io/otel v1.21.0 // indirect
+       go.opentelemetry.io/otel/metric v1.21.0 // indirect
+       go.opentelemetry.io/otel/trace v1.21.0 // indirect
        golang.org/x/crypto v0.17.0 // indirect
        golang.org/x/exp v0.0.0-20230728194245-b0cb94b80691 // indirect
-       golang.org/x/net v0.18.0 // indirect
-       golang.org/x/oauth2 v0.14.0 // indirect
+       golang.org/x/net v0.19.0 // indirect
+       golang.org/x/oauth2 v0.15.0 // indirect
        golang.org/x/sys v0.15.0 // indirect
-       google.golang.org/appengine v1.6.7 // indirect
-       google.golang.org/genproto/googleapis/rpc v0.0.0-20231120223509-83a465c0220f // indirect
+       google.golang.org/appengine v1.6.8 // indirect
+       google.golang.org/genproto/googleapis/rpc v0.0.0-20231127180814-3a041ad873d4 // indirect
        gopkg.in/yaml.v3 v3.0.1 // indirect
 )

The changlogs for these upgrades also don't have anything that looks particularly relevant:

Looking through the recent open issues this one may be related: aws/aws-sdk-go-v2#2449

Since I'm planning to cut a release soon-ish I think I will just revert the upgrade for now.

Thank you for flagging this!

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 15, 2024

Hi @jaltavilla 👋

When you get a chance, would you be able to try the latest nighly release which includes the reverted AWS SDK upgrade?

Thanks in advance!

@jaltavilla
Copy link
Author

Hi @lgfa29

I tested hashicorppreview/nomad-autoscaler:0.4.1-dev-2b4553a164f3bf556fd72193ef65c3c809423733. The aws plugin did not have any permissions issues. It was able to change the scale of the asg (and honored the scale protection flag).

Thanks for fixing this!

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 17, 2024

Thank you for the update. Reading the AWS issue I mentioned it seems like the fix is to upgrade all AWS modules at once, even indirect dependencies. In the diff I see some were not upgraded (like github.com/aws/aws-sdk-go-v2/service/sso) so this may be the problem.

But I will leave things as they are for now until we have time to test the upgrade ourselves.

Thank you for the help in validating the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants