Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to destroy VM during first terraform destroy run: unexpected state 'RUNNING', wanted target 'DONE' #132

Closed
meise opened this issue Jul 15, 2021 · 5 comments · Fixed by #302

Comments

@meise
Copy link

meise commented Jul 15, 2021

Error Description

In my terraform/OpenNebula combination it's not possible to destroy an opennebula_virtual_machine during first terraform destroy run. Terraform always fails with error message: Error: Error waiting for virtual machine (546) to be in state DONE: unexpected state 'RUNNING', wanted target 'DONE'. last error: %!s(<nil>) (state: ACTIVE, lcmState: RUNNING). I have to run terraform destroy multiple times to delete all VM resources.

Code snippets

As far as I can understand the code, after reaching the 10s destroying wait delay, its expected to retry waitForVMState.

But in my case, vmState == vm.Active && vmLcmState == vm.EpilogFailure is always false.

if vmState == vm.Active && vmLcmState == vm.EpilogFailure {

Versions

OpenNebula 5.12.0.3
opennebula-module: 0.3.0
terraform: 1.0.1

Resource

resource "opennebula_virtual_machine" "primary" {
  count       = var.primary_nodes
  name        = "primary-${random_string.primary_node_name[count.index].result}.${var.cluster_fqdn}"
  cpu         = 1
  vcpu        = 2
  memory      = 2048

  context = {
    NETWORK = "YES"
    SET_HOSTNAME = "$NAME"
  }

  graphics {
    type   = "VNC"
    listen = "0.0.0.0"
    keymap = "de"
  }

  os {
    arch = "x86_64"
    boot = "disk0"
  }

  disk {
    image_id = var.image_id
    size     = 10000
    target   = "vda"
    driver   = "raw"
  }

  nic {
    network_id      = var.network_id
    security_groups = [opennebula_security_group.k8s-primary.id]
  }

  tags = {
    role         = "primary_node"
    node_type_id = count.index
    environment = "dev"
  }
}

Error message

Error: Error waiting for virtual machine (546) to be in state DONE: unexpected state 'RUNNING', wanted target 'DONE'. last error: %!s(<nil>) (state: ACTIVE, lcmState: RUNNING)

Log

destroying_vm.log

@meise
Copy link
Author

meise commented Jul 15, 2021

But in my case, vmState == vm.Active && vmLcmState == vm.EpilogFailure is always false.

Added some more Debug output to module. The following values are evaluated:
vmState: '3' == vm.Active: '3' && vmLcmState: '3' == vm.EpilogFailure: '40'

@jaypif
Copy link
Collaborator

jaypif commented Jul 15, 2021

Hi @meise ,

Thank you for your issue.

In the attached file I cannot see all the attempts while the default value for timeout is 3 minutes.
Is the destroy failing after 3 minutes or directly after the first 10 seconds ?

It looks like the VM is still active after the terminate hard command

Thanks

@meise
Copy link
Author

meise commented Jul 15, 2021

Thank you @jaypif for your quick response.

The destroy fails after 10 seconds. Maybe our cluster is just slower, compared to other users of the provider.
Is it possible to increase the wait delay in any way?

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days

@frousselet frousselet added this to the v0.5.1 milestone Jun 15, 2022
@treywelsh
Copy link
Collaborator

treywelsh commented Jun 15, 2022

I reproduced the problem in delaying the terminate action (I put it in a goroutine that start with a big sleep of 10 sec minimum).
The goal was to keep the VM in the RUNNING state to simulate slow OpenNebula setup, and see what happened at first
waitForVMState check.

At first glance, in the implementation of waitForVMState, there is two interesting fields:

func waitForVMState(vmc *goca.VMController, timeout int, states ...string) (interface{}, error) {

	stateConf := &resource.StateChangeConf{
		Pending: []string{"anythingelse"},
		...
		Delay:      10 * time.Second,
                ...
	}

	return stateConf.WaitForState()
}

From the doc:
Delay: Wait this time before starting checks
Pending: States that are "allowed" and will continue trying

From here, I tried to add RUNNING to Pending field and this seems to work.

However, it's a quick fix so I'll still need to investigate more to propose something better.
There is a lot of VM states and transition and I want to try a compromise between something that fix this problem in a simple way, i.e. without enumerating a bunch of states (See states ref for ONE 6.4: https://docs.opennebula.io/6.4/integration_and_development/references/vm_states.html#vm-states)

@treywelsh treywelsh mentioned this issue Jun 24, 2022
7 tasks
treywelsh added a commit to treywelsh/terraform-provider-opennebula that referenced this issue Jun 24, 2022
treywelsh added a commit that referenced this issue Jun 29, 2022
frousselet pushed a commit that referenced this issue Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants