Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(deadline): Windows WorkerInstanceFleets with Deadline 10.1.14 installed fail deployment #354

Merged
merged 1 commit into from
Mar 22, 2021

Conversation

jusiskin
Copy link
Contributor

Problem

Primarily #353, but also fixes #312.

The WorkerInstanceFleet construct fails to deploy its ASG instances when configured to deploy Windows AMIs with Deadline 10.1.14.x.

The real issue is a Deadline bug where stopping or restarting the launcher Windows service while it's still starting up can get stuck. The windows service gets stuck in a stopping state and the only way to unlock the service is to kill the process.

RFDK restarts the launcher twice when configuring workers in relatively quick succession. There is a race condition between restarting the service the first time and the second restart. Once Deadline 10.1.14 was available to RFDK, the race became more probable to fail. The result is that the user data generated by RFDK to configure Windows workers never completes to send the CloudFormation signal and deployments fail.

Solution

Modified the RenderQueueConnection sub-classes to accept an optional restartLauncher property. The default was made true to preserve backwards-compatible behavior when this API is called directly. When called by WorkerInstanceConfiguration, this flag is set to false because it needs to restart the Deadline Launcher again later after configuring the remote command listener ports. By reducing the number of launcher service restarts to one, we avoid the race condition.

Testing

Ran the integration tests using this branch and they now all pass.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@jusiskin jusiskin added the contribution/core This is a PR that came from AWS. label Mar 21, 2021
@jusiskin jusiskin changed the title fix(deadline): Windows Workers fail to deploy waiting for Deadline la… fix(deadline): Windows WorkerInstanceFleets with Deadline 10.1.14 installed fail deployment Mar 21, 2021
@jusiskin jusiskin linked an issue Mar 21, 2021 that may be closed by this pull request
@ddneilson ddneilson self-requested a review March 22, 2021 14:09
' Restart-Service "deadline10launcherservice"',
'} Else {',
' & "$DEADLINE_PATH/deadlinelauncher.exe" -shutdownall 2>&1',
' & "$DEADLINE_PATH/deadlinelauncher.exe" 2>&1',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below that you added the missing -nogui options. Add it here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - fixed

@jusiskin jusiskin force-pushed the fix_windows_workers_service branch from 8b3018f to a4409c9 Compare March 22, 2021 15:27
horsmand
horsmand previously approved these changes Mar 22, 2021
ddneilson
ddneilson previously approved these changes Mar 22, 2021
@jusiskin jusiskin dismissed stale reviews from ddneilson and horsmand via 9486877 March 22, 2021 18:03
@jusiskin jusiskin force-pushed the fix_windows_workers_service branch 2 times, most recently from 9486877 to 7e1872e Compare March 22, 2021 18:06
@horsmand horsmand merged commit a508ebb into aws:mainline Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution/core This is a PR that came from AWS.
Projects
None yet
3 participants