Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very frequent freezing of FreeBSD VM during teardown #61

Closed
kobalicek opened this issue Aug 6, 2023 · 36 comments
Closed

Very frequent freezing of FreeBSD VM during teardown #61

kobalicek opened this issue Aug 6, 2023 · 36 comments

Comments

@kobalicek
Copy link

I have been experiencing a very frequent freezing during teardown lately with FreeBSD VMs.

I was using Xhyve + FreeBSD 13.2 version.

For example these two consecutive runs failed every time for the same reason:

I'm not sure what to do, because my builds are basically failing due to these issues. Temporarily I switched to QEMU virtualization and that seems to be more stable in my case.

@jacob-carlborg
Copy link
Contributor

Does this seem to be the same issue as #29?

@kobalicek
Copy link
Author

kobalicek commented Aug 6, 2023

Hard to say - but for me it was rather deterministic - basically it was a miracle if the build finished successfully.

I think the running script either doesn't know the VM got killed or the killing is blocking it up for some reason. It had to manually kill these builds after 5 hours.

I was also wondering if there is rsync that syncs back files from VM to the host - cannot this be possibly turned off as well? I don't need that synching - maybe this is what blocks forever?

@jacob-carlborg
Copy link
Contributor

jacob-carlborg commented Aug 19, 2023

I forked Blend2D to be able to easier debug the CI workflow. I disabled all matrix entries except for FreeBSD and switch to macOS runner. So far I have not been able to reproduce the issue. Here's an example of a CI run [1]. As yo can see, I've run it five times. Did I do something wrong?

I was also wondering if there is rsync that syncs back files from VM to the host

Yes, the action syncs back files to the host.

cannot this be possibly turned off as well

I guess I can add an option for that. Could you please create a separate issue for this?

maybe this is what blocks forever

In your first example, that is what seems to be happening. In your second example it gets stuck after syncing, it's the final force shutdown of the VM that times out. As an extra precaution the action kills the VM in case it fails to shutdown using the regular shutdown command. You can easily see this in the CI log by enabling timestamps.

[1] https://github.com/cross-platform-actions/blend2d/actions/runs/5899978065

@jacob-carlborg
Copy link
Contributor

BTW, I see that you run all the *BSD workflows on Linux. I recommend running on macOS instead because it supports hardware accelerated nested virtualization, which the Linux runners don't. You can force using QEMU as the hypervisor on macOS using the hypervisor input [1].

[1] https://github.com/cross-platform-actions/action#inputs

@kobalicek
Copy link
Author

I have changed the runners to use Linux and QEMU as that seemed to be more stable in my case.

@kobalicek
Copy link
Author

BTW I have changed the runners to use MacOS, but that doesn't solve the issue. It seems this doesn't really matter at all. I get a very frequent build failures on FreeBSD because of this teardown issue.

BTW I know that there is a sync process to sync files back from VM after the run, cannot this be the source of the problem? Can this be possibly disabled by an option to avoid syncing back if I don't need that functionality?

@jacob-carlborg
Copy link
Contributor

BTW I know that there is a sync process to sync files back from VM after the run, cannot this be the source of the problem?

I guess that's likely if it fails when syncing back the files.

Can this be possibly disabled by an option to avoid syncing back if I don't need that functionality?

Yeah, I guess so.

@chipsenkbeil
Copy link

chipsenkbeil commented Oct 17, 2023

@jacob-carlborg I've also noticed freezing on 0.19.1. Upon success not freezing, I get an error related to syncing files back. Is there a way to disable syncing back to the host or ignore if the teardown fails?

https://github.com/chipsenkbeil/service-manager-rs/actions/runs/6553345333/job/17798792946

image

@jacob-carlborg
Copy link
Contributor

@chipsenkbeil in your case there's a clear error message. There are some files it doesn't have access to read.

@jacob-carlborg
Copy link
Contributor

@kobalicek could you please try to enable debug output by setting the following variables: ACTIONS_RUNNER_DEBUG and ACTIONS_STEP_DEBUG. You can set them in the repository settings -> Security -> Secretes and variables -> Actions. Set the value to true. This will add the verbose flag to rsync, which might show some more information.

@chipsenkbeil
Copy link

@chipsenkbeil in your case there's a clear error message. There are some files it doesn't have access to read.

Yes, and it's interesting because these are generate files from a compiler by running a build command. As a user, I wasn't expecting to encounter an error like this. I suppose my only option is to delete them before finishing because otherwise this fails to sync.

The freezing happens the majority of the time, which is why I flagged it here as a potential, reproducible situation. Will try to delete before teardown and see if that helps.

@kobalicek
Copy link
Author

Today's failure looks like this:

Downloading disk image: https://github.com/cross-platform-actions/freebsd-builder/releases/download/v0.5.0/freebsd-13.2-x86-64.qcow2
  Downloading hypervisor: https://github.com/cross-platform-actions/resources/releases/download/v0.9.1/xhyve-macos.tar
  Downloading resources: https://github.com/cross-platform-actions/resources/releases/download/v0.9.1/resources-macos.tar
  /usr/bin/ssh-keygen -t ed25519 -f /tmp/resourcesaj07Fg/id_ed25519 -q -N 
  /usr/sbin/mkfile -n 40m /tmp/resourcesaj07Fg/res.raw
  Downloaded file: /Users/runner/work/_temp/b6800e99-af4a-4aa0-a1df-b796dbb9cdc2
  /usr/sbin/diskutil partitionDisk /dev/disk2 1 GPT fat32 RES 100%
  Started partitioning on disk2
  Unmounting disk
  Creating the partition map
  Downloaded file: /Users/runner/work/_temp/44cc5c[40](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:41)-40cd-[42](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:43)30-84e6-97ff3dc92e3c
  Waiting for partitions to activate
  Formatting disk2s1 as MS-DOS (FAT32) with name RES
  512 bytes per physical sector
  /dev/rdisk2s1: 76594 sectors in 76594 FAT32 clusters (512 bytes/cluster)
  bps=512 spc=1 res=32 nft=2 mid=0xf8 spt=32 hds=16 hid=2048 drv=0x80 bsec=77824 bspf=599 rdcl=2 infs=1 bkbs=6
  Mounting disk
  Finished partitioning on disk2
  /usr/bin/sudo umount /Volumes/RES
  /usr/bin/hdiutil detach /dev/disk2
  hdiutil: couldn't eject "disk2" - Resource busy
  
  /Users/runner/work/_actions/cross-platform-actions/action/master/webpack:/cross-platform-action/node_modules/@actions/exec/lib/toolrunner.js:574
                  error = new Error(`The process '${this.toolPath}' failed with exit code ${this.processExitCode}`);
  ^
  Error: The process '/usr/bin/hdiutil' failed with exit code 16
      at ExecState._setResult (/Users/runner/work/_actions/cross-platform-actions/action/master/webpack:/cross-platform-action/node_modules/@actions/exec/lib/toolrunner.js:574:1)
      at ExecState.CheckComplete (/Users/runner/work/_actions/cross-platform-actions/action/master/webpack:/cross-platform-action/node_modules/@actions/exec/lib/toolrunner.js:557:1)
      at ChildProcess.<anonymous> (/Users/runner/work/_actions/cross-platform-actions/action/master/webpack:/cross-platform-action/node_modules/@actions/exec/lib/toolrunner.js:[45](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:46)1:1)
      at ChildProcess.emit (node:events:[51](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:52)3:28)
      at maybeClose (node:internal/child_process:1100:16)
      at Socket.<anonymous> (node:internal/child_process:4[58](https://github.com/blend2d/blend2d/actions/runs/6585917423/job/17893260198#step:7:59):11)
      at Socket.emit (node:events:513:28)
      at Pipe.<anonymous> (node:net:301:12)

I think in the end this must be related to syncing the files.

@kobalicek
Copy link
Author

Which is related to #64

@jacob-carlborg
Copy link
Contributor

I think in the end this must be related to syncing the files.

@kobalicek if you get the error: hdiutil: couldn't eject "disk2" - Resource busy it's not related to syncing files. It occurs before even the VM has been started. The action creates a secondary hard drive with an SSH key on it. For some reason it fails to eject that hard drive before starting the VM.

@jacob-carlborg
Copy link
Contributor

@chipsenkbeil, @kobalicek I've created a new release which adds support for disabling file syncing: https://github.com/cross-platform-actions/action/releases/tag/v0.20.0.

@chipsenkbeil
Copy link

@jacob-carlborg fantastic! Thanks for rolling this out so quickly 😄

@kobalicek
Copy link
Author

I would close this one - I don't have this problem at the moment, I would open a new issue if I face a similar issue in the future.

@manxorist
Copy link

I am also seeing this issue, I think. It happens most often with FreeBSD 12.4 for me.
See https://github.com/OpenMPT/openmpt/actions/runs/6665550333/job/18142362950 or https://github.com/OpenMPT/openmpt/actions/runs/6676795368/job/18146210087 for 2 failing runs.

@jacob-carlborg
Copy link
Contributor

@kobalicek have you disabled file syncing? Ideally I would like to solve the issue without having to relying on disabling file syncing.

@jacob-carlborg
Copy link
Contributor

jacob-carlborg commented Nov 1, 2023

@kobalicek @chipsenkbeil @manxorist I wonder if this could be related to how much memory the VM is using. I got another report that there might not be enough memory for the host #68. Could you please try reducing the memory to see if there's a difference?

@manxorist
Copy link

@jacob-carlborg
Using

        memory: 4G
        sync_files: runner-to-vm

I am still seeing hangs: https://github.com/OpenMPT/openmpt/actions/runs/6731368321/job/18295891777

@jacob-carlborg
Copy link
Contributor

I am still seeing hangs: OpenMPT/openmpt/actions/runs/6731368321/job/18295891777

@manxorist that's disappointing. In this case it's hanging when shutting down the VM.

manxorist added a commit to OpenMPT/openmpt that referenced this issue Nov 4, 2023
Merged revision(s) 19903-19905 from trunk/OpenMPT:
[Mod] build: CI: GitHub: Try reducing CPA VM size to 4GB. See <cross-platform-actions/action#61 (comment)>.
........
[Mod] build: CI: GitHub: Try disabling syncing back files, which we do not need, for CPA builds. See <cross-platform-actions/action#65>.
........
[Mod] build: CI: GitHub: Update CPA to v0.21.1.
........
........


git-svn-id: https://source.openmpt.org/svn/openmpt/branches/OpenMPT-1.30@19907 56274372-70c3-4bfc-bfc3-4c3a0b034d27
manxorist added a commit to OpenMPT/openmpt that referenced this issue Nov 4, 2023
[Mod] build: CI: GitHub: Try reducing CPA VM size to 4GB. See <cross-platform-actions/action#61 (comment)>.
........
[Mod] build: CI: GitHub: Try disabling syncing back files, which we do not need, for CPA builds. See <cross-platform-actions/action#65>.
........
[Mod] build: CI: GitHub: Update CPA to v0.21.1.
........


git-svn-id: https://source.openmpt.org/svn/openmpt/branches/OpenMPT-1.31@19906 56274372-70c3-4bfc-bfc3-4c3a0b034d27
@manxorist
Copy link

manxorist commented Nov 8, 2023

I switched FreeBSD to QEMU on macOS and the first 4 runs went without any problem so far. I will continue monitoring and report back if it indeed fixes the FreeBSD issue for me.

I also tried switching OpenBSD to QEMU on macOS, and I am seeing VM startup issues there. See #73.

jacob-carlborg added a commit that referenced this issue Nov 20, 2023
This will only mitigate the issue and doesn't fix the root cause. The
action doesn't shutdown the VM anymore. Since the action is run
inside a VM itself, everything will be cleaned up automatically.
Hopefully this will make the issue less likely to occur.
@jacob-carlborg
Copy link
Contributor

@kobalicek @manxorist @chipsenkbeil I've created a branch that skips shutting down the VM and just lets the action exit: https://github.com/cross-platform-actions/action/tree/no-vm-shutdown. It would be great if anyone could give it a try to see if it helps. Unfortunately I haven't been able to find the root cause but this might mitigate some of the problem.

manxorist added a commit to OpenMPT/openmpt that referenced this issue Nov 21, 2023
manxorist added a commit to OpenMPT/openmpt that referenced this issue Nov 21, 2023
[Mod] build: CI: GitHub: Switch FreeBSD to experimental CPA no-vm-shutdown branch. See <cross-platform-actions/action#61 (comment)>.
........


git-svn-id: https://source.openmpt.org/svn/openmpt/branches/OpenMPT-1.31@19928 56274372-70c3-4bfc-bfc3-4c3a0b034d27
manxorist added a commit to OpenMPT/openmpt that referenced this issue Nov 21, 2023
[Mod] build: CI: GitHub: Switch FreeBSD to experimental CPA no-vm-shutdown branch. See <cross-platform-actions/action#61 (comment)>.
........


git-svn-id: https://source.openmpt.org/svn/openmpt/branches/OpenMPT-1.30@19929 56274372-70c3-4bfc-bfc3-4c3a0b034d27
@manxorist
Copy link

manxorist commented Nov 21, 2023

@jacob-carlborg

I've created a branch that skips shutting down the VM and just lets the action exit: https://github.com/cross-platform-actions/action/tree/no-vm-shutdown.

I had 6 runs (3 times 13.2, 3 times 12.4) for now, all successful. So it appears to be a viable work-around.

@manxorist
Copy link

manxorist commented Nov 21, 2023

Well, ignore the last comment. I got confused about the various configurations and tested macOS/QEMU instead of macOS/xhyve.

I will re-test.

manxorist added a commit to OpenMPT/openmpt that referenced this issue Nov 21, 2023
@chipsenkbeil
Copy link

chipsenkbeil commented Nov 21, 2023

@kobalicek @manxorist @chipsenkbeil I've created a branch that skips shutting down the VM and just lets the action exit: https://github.com/cross-platform-actions/action/tree/no-vm-shutdown. It would be great if anyone could give it a try to see if it helps. Unfortunately I haven't been able to find the root cause but this might mitigate some of the problem.

I'll give it a try. Even with skipping the copying back of files, it was still hanging at times. What do I need to set after switching to this branch? Any specific flag?

@manxorist
Copy link

2 times 13.2 and 2 times 12.4 for now, all successful.

@jacob-carlborg
Copy link
Contributor

@chipsenkbeil no flags, it's automatic. If you look at the output you can verify if it shuts down the VM or not. Here's an example of where it doesn't shut down the VM [1]. And in the next example [2], it shuts down the VM, you can see the output: Executing command inside VM: sudo shutdown -p now.

[1] https://github.com/cross-platform-actions/action/actions/runs/6928693321/job/18844968427#step:3:2046
[2] https://github.com/cross-platform-actions/action/actions/runs/6875647797/job/18699661036#step:3:2063

@chipsenkbeil
Copy link

@jacob-carlborg switched over to the branch. Only one run thus far and it worked fine. Will jump in if it hangs again, but the repo using it has low volume of updates, so it may be a little while.

@manxorist
Copy link

@kobalicek @manxorist @chipsenkbeil I've created a branch that skips shutting down the VM and just lets the action exit

As we already established in #67, the VMs are for the majority (or all) use cases non-persistent and throw-away anyway, so is there a reason for properly shutting them down in the first place?

I think for testability and correctness sake, there should always be a mode available with proper file syncing barriers and proper shutdown in place, but in the default case, nobody cares what happens with the VM after then build files have (optionally) been synced back.

@jacob-carlborg
Copy link
Contributor

is there a reason for properly shutting them down in the first place?

I was going to say "no, there's no reason" and I was planning to merge this branch regardless if it helps with this issue or not because it would be a good change anyway, less things for the action to do means the job finishes sooner. But now I started thinking, what if a job performs some additional major steps after the VM step, then the VM will unnecessarily occupy resources like CPU and memory.

@manxorist
Copy link

is there a reason for properly shutting them down in the first place?

But now I started thinking, what if a job performs some additional major steps after the VM step, then the VM will unnecessarily occupy resources like CPU and memory.

I guess that's a fair point that I did not consider. Still, for users who just care to run something like a test suite (my use case), it really does not matter what happens with the VM, and the whole action does nothing else after running things inside the VM. So a general option would probably a good idea to have.

@jacob-carlborg
Copy link
Contributor

So a general option would probably a good idea to have.

Yes, I agree. Perhaps default to not shutting down the VM? I think the only steps I have that are after the VM step is to upload binaries to a GitHub release.

@manxorist
Copy link

So a general option would probably a good idea to have.

Yes, I agree.

Perhaps default to not shutting down the VM?

Well, I think resource consumption for following steps is a valid concern and the default should be to properly shutdown the VM, and skipping proper shutdown should only be optional.

@jacob-carlborg
Copy link
Contributor

Well, I think resource consumption for following steps is a valid concern and the default should be to properly shutdown the VM, and skipping proper shutdown should only be optional.

Hmm, I'm thinking ahead of this feature request as well #26. Trying to figure out how the API should look like. What you're suggesting would be the safest alternative, no risk of breaking anything. But it would be more verbose if one would use the action in multiple steps. I don't know how common that would be. What to optimize for in the API.

jacob-carlborg added a commit that referenced this issue Dec 18, 2023
This is more efficient and this will hopefully mitigate very frequent
freezing of VM during teardown
([#61](#61),
[#72](#72)).
korli added a commit to korli/action that referenced this issue Mar 15, 2024
Added
- Added support for using the action in multiple steps in the same job ([cross-platform-actions#26](cross-platform-actions#26)).
    All the inputs need to be the same for all steps, except for the following
    inputs: `sync_files`, `shutdown_vm` and `run`.

- Added support for specifying that the VM should not shutdown after the action
    has run. This adds a new input parameter: `shutdown_vm`. When set to `false`,
    this will hopefully mitigate very frequent freezing of VM during teardown ([cross-platform-actions#61](cross-platform-actions#61), [cross-platform-actions#72](cross-platform-actions#72)).

Changed
- Always terminate VM instead of shutting down. This is more efficient and this
    will hopefully mitigate very frequent freezing of VM during teardown
    ([cross-platform-actions#61](cross-platform-actions#61),
    [cross-platform-actions#72](cross-platform-actions#72)).

- Use `unsafe` as the cache mode for QEMU disks. This should improve performance ([cross-platform-actions#67](cross-platform-actions#67)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants