Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(deadline): configure identity registration settings using RenderQueue backend security group #633

Merged
merged 1 commit into from
Nov 18, 2021

Conversation

jusiskin
Copy link
Contributor

@jusiskin jusiskin commented Nov 15, 2021

Fixes #632

Problem

See #632.

The root cause of the problem is that the internal DeploymentInstance created by the RenderQueue to configure Deadline Secrets Management identity registration settings performs a number of actions that require network access to AWS service API endpoints and PyPI to fetch the boto3 package.

When this instance is deployed in an isolated subnet (no internet access) PyPI cannot be reached. Also, when the DeploymentInstance cannot reach VPC endpoints, the user-data commands fail. The user-data is unable to send any signal to CloudFormation and the signal timeout rolls back the CloudFormation deployment.

Solution

VPC Endpoint Reachability

To deploying the RenderQueue into isolated subnets, the VPC supplied to the RenderQueue must be configured with sufficient VPC gateway/interface endpoints. The subnets specified by the vpcSubnets prop supplied to the RenderQueue must be configured with routes to the endpoints.

In the case of VPC interface endpoints, each VPC interface endpoint will have an associated security group. Those security groups must allow ingress from the security groups of the DeploymentInstance. In RFDK 0.38.0, the DeploymentInstance created by the RenderQueue was created with its own Security Group, and there was no public API for RFDK users to access it. To overcome this, the following changes were made:

  1. Modified the DeploymentInstance internal construct to accept a securityGroup propery and forward this to the AutoScalingGroup created
  2. Modified the RenderQueue to construct its DeploymentInstance using its existing backend security group (used by the RenderQueue's Auto-Scaling Group)

By having the DeploymentInstance share a security group with the other backend infrastructure of the RenderQueue (currently just the Auto-Scaling Group providing ECS capacity), users can then use the RenderQueue.backendConnections API to permit network traffic from the DeploymentInstance to the security group associated with the required VPC interface endpoints.

RFDK users who had deployed the RenderQueue into isolated subnets prior to RFDK 0.38.0 and used the RenderQueue.backendConnections API to permit traffic to their VPC interface endpoints will require no further changes. Similarly, if a user had instead used RenderQueue.asg.connections to permit access to their VPC interface endpoints will also require no change to their code.

PyPI Reachability

To allow the RenderQueue is deployed into private subnets with no route to an internet gateway, the logic in the configure_identity_registration_settings.py was changed from using boto3 to instead launch a sub-process to invoke the equivalent AWS CLI commands.

Testing

Reproduced the issue with the following CDK app: minimal-rfdk-isolated-rq-deadline-sm-reproducer.tar.gz

To reproduce the problem, you must extract the archive, then from the extracted directory, run the following commands:

npm install
npm run stage
npm run build
npx cdk deploy "*"

To test the fix, I have simply build and packaged the RFDK using this PR branch. Once built and packaged, I then installed the package using these instructions. Finally, I re-deployed with:

npx cdk deploy "*"

and confirmed that the deployment succeeds and that the identity registration settings get applied correctly..


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@jusiskin jusiskin added bug This issue is a bug. contribution/core This is a PR that came from AWS. labels Nov 15, 2021
@jusiskin jusiskin force-pushed the rq_isolated_subnets_sm_deploy_error branch from e55002a to a8ea4e6 Compare November 17, 2021 16:48
@jusiskin jusiskin marked this pull request as ready for review November 17, 2021 18:59
@jericht jericht self-requested a review November 17, 2021 22:27
@jusiskin jusiskin force-pushed the rq_isolated_subnets_sm_deploy_error branch from a8ea4e6 to 6ca9dac Compare November 18, 2021 15:59
Copy link
Contributor

@ddneilson ddneilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jusiskin jusiskin merged commit 35bb326 into aws:mainline Nov 18, 2021
@jusiskin jusiskin deleted the rq_isolated_subnets_sm_deploy_error branch November 18, 2021 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. contribution/core This is a PR that came from AWS.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deploying RenderQueue into isolated subnets and Secrets Management enabled fails
3 participants