resource with create_before_destroy is both created and destroyed before updating dependent resource when dependent resource is in a parent module #17735

samjgalbraith · 2018-03-30T00:01:18Z

Consider the below configuration. When running Terraform apply on a variant of this configuration that requires the IAM role to be replaced (the name will suffice), it performs IAM role create, IAM role destroy, then Lambda update. However, if the below configuration is changed so that the IAM role is declared within the same module as the Lambda function, the actions carried out are IAM role creation, Lambda function update, IAM role deletion, as expected. This leaves a period in which the role reference from Lambda to IAM is dangling. This is made more obvious by the race conditions in IAM that can lead to the Lambda update failing, only to succeed later on retry. In the interventing period, the Lambda is defunct as it still points to a role that has already been destroyed.

Terraform Version

Terraform v0.11.5

provider.archive v1.0.3
provider.aws v1.13.0

Terraform Configuration Files

/main.tf

provider "aws" {
  region = "ap-southeast-2"
}

module "lambda_iam_role" {
  source = "/role_module"
}

data "archive_file" "lambda_content" {
  source_file = "${path.module}/my_test_function.py"
  output_path = "${path.module}/my_test_function.zip"
  type = "zip"
}

resource "aws_lambda_function" "my_test_function" {
  function_name = "my_test_function"
  handler = "my_test_function.lambda_handler"
  role = "${module.lambda_iam_role.role_arn}"
  runtime = "python2.7"
  filename = "${data.archive_file.lambda_content.output_path}"
  source_code_hash = "${data.archive_file.lambda_content.output_base64sha256}"
  publish = true
}

/role_module/main.tf

data "aws_iam_policy_document" "trust_relationship_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "role" {
  assume_role_policy = "${data.aws_iam_policy_document.trust_relationship_policy.json}"
  name = "test2"
  lifecycle {
    create_before_destroy = true
  }
}

output "role_arn" {
  value = "${aws_iam_role.role.arn}"
}

Terraform Apply Output - For role name change "test2" -> "test1"

Please note that the IAM race condition which causes the failure of the Lambda update is not the subject of this issue (I've raise it in #3972). It's only an example of an intermediate failure that halts the infrastructure modification at the point of dangling dependency.

Terraform will perform the following actions:

  ~ aws_lambda_function.my_test_function
      role:                  "arn:aws:iam::901561229270:role/test2" => "${module.lambda_iam_role.role_arn}"

-/+ module.lambda_iam_role.aws_iam_role.role (new resource required)
      id:                    "test2" => <computed> (forces new resource)
      arn:                   "arn:aws:iam::901561229270:role/test2" => <computed>
      assume_role_policy:    "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Sid\":\"\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"lambda.amazonaws.com\"},\"Action\":\"sts:AssumeRole\"}]}" => "{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Sid\": \"\",\n      \"Effect\": \"Allow\",\n      \"Action\": \"sts:AssumeRole\",\n      \"Principal\": {\n        \"Service\": \"lambda.amazonaws.com\"\n      }\n    }\n  ]\n}"
      create_date:           "2018-03-29T23:45:03Z" => <computed>
      force_detach_policies: "false" => "false"
      name:                  "test2" => "test1" (forces new resource)
      path:                  "/" => "/"
      unique_id:             "AROAJVRUREPXPP2X3GSCC" => <computed>


Plan: 1 to add, 1 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

module.lambda_iam_role.aws_iam_role.role: Creating...
  arn:                   "" => "<computed>"
  assume_role_policy:    "" => "{\n  \"Version\": \"2012-10-17\",\n  \"Statement\": [\n    {\n      \"Sid\": \"\",\n      \"Effect\": \"Allow\",\n      \"Action\": \"sts:AssumeRole\",\n      \"Principal\": {\n        \"Service\": \"lambda.amazonaws.com\"\n      }\n    }\n  ]\n}"
  create_date:           "" => "<computed>"
  force_detach_policies: "" => "false"
  name:                  "" => "test1"
  path:                  "" => "/"
  unique_id:             "" => "<computed>"
module.lambda_iam_role.aws_iam_role.role: Creation complete after 2s (ID: test1)
module.lambda_iam_role.aws_iam_role.role.deposed: Destroying... (ID: test2)
aws_lambda_function.my_test_function: Modifying... (ID: my_test_function)
  role: "arn:aws:iam::901561229270:role/test2" => "arn:aws:iam::901561229270:role/test1"
module.lambda_iam_role.aws_iam_role.role.deposed: Destruction complete after 2s

Error: Error applying plan:

1 error(s) occurred:

* aws_lambda_function.my_test_function: 1 error(s) occurred:

* aws_lambda_function.my_test_function: Error modifying Lambda Function Configuration my_test_function: InvalidParameterValueException: The role defined for the function cannot be assumed by Lambda.
        status code: 400, request id: 6e1bb062-33ac-11e8-9cdc-0bb5c2732e67

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Crash Output

Expected Behavior

Using create_before_destroy = true on a dependency of a resource causes a new copy of the dependency to be created before the old one is destroyed. It's expected that the dependent resource's reference to it is updated between re-creation and destruction of the dependency.

Actual Behavior

The new dependency is created and the old one destroyed, both before the dependent resource's reference to it is updated.

Steps to Reproduce

Apply above configuration. Repeat if necessary to overcome failure for Lambda creation.
Change the role name, apply again.

The text was updated successfully, but these errors were encountered:

jbardin · 2018-03-30T14:58:43Z

Hi @samjgalbraith,

Thanks for the complete configuration example to replicate the issue here!

If you run this and watch the logs, you can see that the order is correct, and the new aws_lambda_function is being updated with the correct arn in the POST data. The problem is that the arn is not immediately available in the remote API, so which can be tested by adding a delay in the role creation.

This is an eventual consistency issue with the IAM role itself, and a workaround to help with the issue has already been merged into the aws provider.

Thanks!

samjgalbraith · 2018-03-30T22:07:02Z

Thanks for looking at that @jbardin It does seem like you're right about the order when looking at the logs. One thing that still bothers me about this is that it definitely is destroying the old role even though the update fails. Surely the destruction of the dependency should only happen upon successful updating of the references to it.

jbardin · 2018-03-30T23:53:56Z

I'd have to think about if it would be possible to move the destroy node for the deposed resource later in the graph without causing cycles in all cases. The problem however would be that once one of the old dependencies fail and terraform errors out, it no longer references the old destroy node, and there would be no way to remove it in the same order in a subsequent run, so the benefit is minimal.

Also, removing it later on wouldn't not change the result here at all, because it's not the removal of the old resource that is causing the failure, it's the fact that the new one hasn't fully propagated across the remote API endpoint.

samjgalbraith · 2018-03-31T12:44:53Z

You're right that the IAM race condition is what's causing the Terraform apply failure here. That's not the problem that I care about in this issue. It probably got lost in the wall of text but I mentioned it above the terraform output block. I understand the main point of create_or_destroy is to transform infrastructure such that at every step, your infrastructure is always valid and available. I regret using this example because that's a bit of a red herring. It's just an example of a failure which errors the apply partway through, revealing that the plan created allows for the infrastructure to be broken - and I mean the infrastructure is actually broken such that the Lambda function won't execute anymore until you intervene and re-apply, not just that the plan failed to complete. This is a problem bigger than the IAM race condition. The plan is bad.

I've just outputted the graphs for the plan and it's definitely consistent with my complaint: When the role is created in the same module as the Lambda, exactly as I expect for zero-downtime (even if a step fails) the role destruction waits for successful Lambda update - there's a single branch through the plan graph that goes role create| lambda update | role destroy and no failure in any step leaves the infrastructure in a broken state. Even though the apply still fails in this case because of the IAM race condition, crucially, the infrastructure is never in a broken state at any point.

When the role is declared in a child module and its ARN given as a module output, destroying the old role does not block on successful update of the Lambda - destroy role is in a plan branch parallel to the lambda update. There's something about being in a different module that's causing the plan graph to be different.

Everything in one module - Works as expected with role destroy waiting for successful lambda update

provider "aws" {
  region = "ap-southeast-2"
}

data "archive_file" "lambda_content" {
  source_file = "${path.module}/my_test_function.py"
  output_path = "${path.module}/my_test_function.zip"
  type = "zip"
}

resource "aws_lambda_function" "my_test_function" {
  function_name = "my_test_function"
  handler = "my_test_function.lambda_handler"
  role = "${aws_iam_role.role.arn}"
  runtime = "python2.7"
  filename = "${data.archive_file.lambda_content.output_path}"
  source_code_hash = "${data.archive_file.lambda_content.output_base64sha256}"
  publish = true
}


data "aws_iam_policy_document" "trust_relationship_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "role" {
  assume_role_policy = "${data.aws_iam_policy_document.trust_relationship_policy.json}"
  name = "test1"
  lifecycle {
    create_before_destroy = true
  }
}

Plan for name change of role - Labels modified for readability

digraph {
	compound = "true"
	newrank = "true"
	subgraph "root" {
		"[root] aws_iam_role.role" [label = "create role", shape = "box"]
		"[root] aws_iam_role.role (destroy)" [label = "destroy role", shape = "box"]
		"[root] aws_lambda_function.my_test_function" [label = "modify lambda", shape = "box"]
		"[root] provider.aws" [label = "provider.aws", shape = "diamond"]
		"[root] aws_iam_role.role (destroy)" -> "[root] aws_lambda_function.my_test_function"
		"[root] aws_iam_role.role" -> "[root] provider.aws"
		"[root] aws_lambda_function.my_test_function" -> "[root] aws_iam_role.role"
		"[root] meta.count-boundary (count boundary fixup)" -> "[root] aws_iam_role.role (destroy)"
		"[root] provider.aws (close)" -> "[root] aws_iam_role.role (destroy)"
		"[root] root" -> "[root] meta.count-boundary (count boundary fixup)"
		"[root] root" -> "[root] provider.aws (close)"
	}
}

Using the original HCL at the top of this thread, role name change gives this plan, which does NOT make destruction of the role wait for successful Lambda modification:

digraph {
	compound = "true"
	newrank = "true"
	subgraph "root" {
		"[root] aws_lambda_function.my_test_function" [label = "modify lambda", shape = "box"]
		"[root] module.lambda_iam_role.aws_iam_role.role" [label = "create role", shape = "box"]
		"[root] module.lambda_iam_role.aws_iam_role.role (destroy)" [label = "destroy role", shape = "box"]
		"[root] provider.aws" [label = "provider.aws", shape = "diamond"]
		"[root] aws_lambda_function.my_test_function" -> "[root] module.lambda_iam_role.output.role_arn"
		"[root] meta.count-boundary (count boundary fixup)" -> "[root] aws_lambda_function.my_test_function"
		"[root] meta.count-boundary (count boundary fixup)" -> "[root] module.lambda_iam_role.aws_iam_role.role (destroy)"
		"[root] module.lambda_iam_role.aws_iam_role.role (destroy)" -> "[root] module.lambda_iam_role.aws_iam_role.role"
		"[root] module.lambda_iam_role.aws_iam_role.role" -> "[root] provider.aws"
		"[root] module.lambda_iam_role.output.role_arn" -> "[root] module.lambda_iam_role.aws_iam_role.role"
		"[root] provider.aws (close)" -> "[root] aws_lambda_function.my_test_function"
		"[root] provider.aws (close)" -> "[root] module.lambda_iam_role.aws_iam_role.role (destroy)"
		"[root] root" -> "[root] meta.count-boundary (count boundary fixup)"
		"[root] root" -> "[root] provider.aws (close)"
	}
}

jbardin · 2018-04-02T21:00:20Z

@samjgalbraith,

Thanks for the excellent analysis! I see exactly what you mean now, and since in most cases the effect is the same, it's gone unnoticed for quite a while.

Just recording a minimal example here for reference:

# main.tf
module "mod" {
  source = "./mod"
}

variable "exit" {
  default = 0
}

resource "null_resource" "b" {
  triggers = {
    "a" = "${module.mod.a_id}"
  }
  provisioner "local-exec" {
    command = "exit ${var.exit}"
  }
}

# ./mod/main.tf
output "a_id" {
  value = "${null_resource.a.id}"
}

resource "null_resource" "a" {
  lifecycle {
    create_before_destroy = true
  }
}

After the following commands, there should still be a deposed null_resource.a in the module state.

$ tf apply
$ tf taint -module mod null_resource.a
$ tf apply -var exit=1

Since the CBD graph is not taking into account transitive dependencies through outputs and locals, the dependency inversion that CBD creates isn't extended up through modules.

danawillow · 2018-07-10T01:54:42Z

@jbardin, any thoughts on this? Maybe something that'll be fixed in 0.12?

This is the root cause of hashicorp/terraform-provider-google#1448. Happy to provide debug logs and the like, though it seems like the problem is pretty well-defined at this point.

cc @brandoconnor since we were debugging this earlier today :)

jbardin · 2018-07-10T02:18:26Z

Hi @danawillow,

Do I have thoughts!
This isn't intended to be covered at all by the the changes in 0.12, but it's entirely possible that our graph dependency resolution changes could catch this as a side-effect. However I have a feeling it's still going to require a separate fix, and because of the scope of the pending changes it's not something I can take care of until after 0.12 lands.

morgante · 2019-05-29T20:04:14Z

Are there any plans to fix this? Looks like it still exists in 0.12

ghost · 2019-11-24T01:52:24Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

samjgalbraith mentioned this issue Mar 30, 2018

IAM eventual consistency race condition for aws_iam_role leaves lambda function in defunct state hashicorp/terraform-provider-aws#3972

Closed

bflad mentioned this issue Mar 30, 2018

recreating launch configuration always error and need to rerun to delete "deposed" one hashicorp/terraform-provider-aws#3987

Closed

jbardin closed this as completed Mar 30, 2018

jbardin reopened this Apr 2, 2018

jbardin added bug core labels Apr 2, 2018

jbardin self-assigned this Apr 2, 2018

jbardin mentioned this issue May 21, 2018

Faulty resource circular dependency detection #18088

Closed

hashibot added v0.11 Issues (primarily bugs) reported against v0.11 releases v0.12 Issues (primarily bugs) reported against v0.12 releases labels Aug 29, 2019

jbardin mentioned this issue Sep 27, 2019

create_before_destroy across modules #22937

Merged

jbardin closed this as completed in #22937 Oct 24, 2019

ghost locked and limited conversation to collaborators Nov 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource with create_before_destroy is both created and destroyed before updating dependent resource when dependent resource is in a parent module #17735

resource with create_before_destroy is both created and destroyed before updating dependent resource when dependent resource is in a parent module #17735

samjgalbraith commented Mar 30, 2018 •

edited

Loading

jbardin commented Mar 30, 2018

samjgalbraith commented Mar 30, 2018

jbardin commented Mar 30, 2018

samjgalbraith commented Mar 31, 2018 •

edited

Loading

jbardin commented Apr 2, 2018 •

edited

Loading

danawillow commented Jul 10, 2018

jbardin commented Jul 10, 2018

morgante commented May 29, 2019

ghost commented Nov 24, 2019

resource with create_before_destroy is both created and destroyed before updating dependent resource when dependent resource is in a parent module #17735

resource with create_before_destroy is both created and destroyed before updating dependent resource when dependent resource is in a parent module #17735

Comments

samjgalbraith commented Mar 30, 2018 • edited Loading

Terraform Version

Terraform Configuration Files

Terraform Apply Output - For role name change "test2" -> "test1"

Crash Output

Expected Behavior

Actual Behavior

Steps to Reproduce

jbardin commented Mar 30, 2018

samjgalbraith commented Mar 30, 2018

jbardin commented Mar 30, 2018

samjgalbraith commented Mar 31, 2018 • edited Loading

Everything in one module - Works as expected with role destroy waiting for successful lambda update

Plan for name change of role - Labels modified for readability

Using the original HCL at the top of this thread, role name change gives this plan, which does NOT make destruction of the role wait for successful Lambda modification:

jbardin commented Apr 2, 2018 • edited Loading

danawillow commented Jul 10, 2018

jbardin commented Jul 10, 2018

morgante commented May 29, 2019

ghost commented Nov 24, 2019

samjgalbraith commented Mar 30, 2018 •

edited

Loading

samjgalbraith commented Mar 31, 2018 •

edited

Loading

jbardin commented Apr 2, 2018 •

edited

Loading