Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service disappeared while waiting for it to reach a steady state #179

Closed
joeduffy opened this issue Mar 21, 2018 · 9 comments
Closed

Service disappeared while waiting for it to reach a steady state #179

joeduffy opened this issue Mar 21, 2018 · 9 comments
Assignees
Labels
area/providers kind/bug Some behavior is incorrect or out of spec
Milestone

Comments

@joeduffy
Copy link
Member

joeduffy commented Mar 21, 2018

We saw an intermittent failure when updating PPCs just now:

Plan apply failed: creating urn:pulumi:learningmachine-ppc-prod::pulumi-service-ppc::cloud:service:Service$aws:ecs/service:Service::learningmachine-ppc-prod-updates: Service arn:aws:ecs:us-east-1:396386917111:service/learningmachine-ppc-prod-updates-99c2e88 disappeared while waiting for it to reach a steady state

The full log is available at https://api.travis-ci.com/v3/job/115796809/log.txt?log.token=DGdrKxgiMyHZ73itqw0oyw.

It seems likely this is related to the changes pertaining to pulumi/pulumi-cloud#312.

@joeduffy joeduffy added kind/bug Some behavior is incorrect or out of spec area/providers labels Mar 21, 2018
@joeduffy joeduffy modified the milestones: 0.14, 0.12 Mar 21, 2018
@mmdriley
Copy link
Contributor

The only difference in how we behave vs. aws ecs wait services-stable is we don't look at the failures attribute of the output. It's possible that this API request failed to return services for a reason other than the service was missing.

@lukehoban
Copy link
Contributor

Possibly an eventual consistency issue where the service was created but the API didn't return it?

Note that we'll also need pulumi/pulumi#992 to be able to recover from this sort of failure without leaking resources. We need to be able to record that the Create did actually partially create the resource, even through it "failed".

@mmdriley
Copy link
Contributor

The logs are a bit easier to read at the Travis job page: https://travis-ci.com/pulumi/pulumi-service/jobs/115796809

@mmdriley
Copy link
Contributor

Unfortunately there is evidence upstream that ECS can return "not found" for newly-created resources: hashicorp/terraform-provider-aws@7551f92

@lukehoban
Copy link
Contributor

Sounds like we need the same logic in https://github.com/terraform-providers/terraform-provider-aws/pull/3485/files?

BTW - should we add support for Partial to that PR as well? We don't support it yet in Pulumi (see pulumi/pulumi#992), but once we do, we'll want cases like this to be able to record back to the checkpoint that the AWS resource got created, even if the Create fails after that.

@mmdriley
Copy link
Contributor

This change didn't add any use of Partial, in part because I'm not sure it's applicable here -- in this case, all of the output properties we set (in particular Id) are valid, so we're okay having them all written out.

I think Terraform responds better to this than we do because they Refresh.

@lukehoban
Copy link
Contributor

This change didn't add any use of Partial, in part because I'm not sure it's applicable here -- in this case, all of the output properties we set (in particular Id) are valid, so we're okay having them all written out.

I'm not sure that's the TF semantics for Partial. I believe that if you do not enable "Partial" mode, then if there is a failure, no state will be written back - because they cannot be sure whether it is safe to do so. My understanding (though documentation is sparse and implementations are inconsistent), is that you must set "Partial" in order for any state to be written back when there is a failure.

Would love to learn that I'm wrong though.

I think Terraform responds better to this than we do because they Refresh.

That helps only if you've written the "ID" back into the state even on failures to create. Otherwise you still leak the fact that you created the resource.

@mmdriley
Copy link
Contributor

Strongly agree that Partial/SetPartial are poorly documented and very inconsistently used. I think my description above is about right, but only because I got lucky.

To confirm, I looked at ResourceData in Terraform to understand what it does with the values of partial and partialMap as set by Partial(bool) and SetPartial(string). It seems like the logic can roughly be described as:

If Partial(true) has been called, then attributes for which SetPartial("attribute") was called will be read from the new state, and other attributes will be read from the old state.

However, Id is special and Partial seems to have no effect on how it's written.

@mmdriley
Copy link
Contributor

If Partial(true) has been called, then attributes for which SetPartial("attribute") was called will be read from the new state, and other attributes will be read from the old state.

I think we can read some intent here. Partial(true) means: these are the only attributes I want to care about right now, and all the others should stay the same. Seems especially useful in Update operations where things not updated don't have to be re-read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/providers kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

No branches or pull requests

3 participants