Headscale service restart timing out #1574

lgrn · 2023-10-12T09:48:22Z

Bug description

After running systemctl restart headscale:

headscale[7497]: 2023-10-12T09:05:33Z INF Received signal to stop, shutting down gracefully signal=terminated
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF Headscale stopped
systemd[1]: Stopping headscale coordination server for Tailscale...

These types of logs show up immediately afterward:

headscale[7497]: 2023-10-12T09:05:33Z ERR Failed to fetch machine from the database with node key: nodekey: (...)
(...)
headscale[7497]: 2023-10-12T09:05:36Z ERR error getting routes error="sql: database is closed"
headscale[7497]: 2023-10-12T09:05:36Z ERR Error listing users error="sql: database is closed"

This finally ends with:

Oct 12 09:07:03 (...) systemd[1]: headscale.service: State 'stop-sigterm' timed out. Killing.
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7497 (headscale) with signal SIGKILL.
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7499 (headscale) with signal SIGKILL.
(...)
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Main process exited, code=killed, status=9/KILL
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Failed with result 'timeout'.
Oct 12 09:07:03 (...) systemd[1]: Stopped headscale coordination server for Tailscale.
(...)

So for some reason SIGKILL is finally required and the restart takes about 90 seconds.

Is this a known issue?

Environment

# headscale version
v0.22.3

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"

# uname -a
Linux (...) 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Headscale is behind a (reverse) proxy
Headscale runs in a container

To Reproduce

N/A

The text was updated successfully, but these errors were encountered:

harvey-git · 2023-10-12T10:03:49Z

Judging from the error log, the main program cannot fetch machine information from the database.
because headscale uses the sqlite3 database.
This error message means three possibilities: 1. The database is broken; 2. There are other processes locking the headscale database;3. database access The permissions have changed. If headscale can start normally after restarting the server, you can exclude 1 and 3. Then it should be that when headscale is restarted, its process cannot be closed and exited normally.
Is your server hardware configuration very low?

lgrn · 2023-10-12T10:14:09Z

Hi @harvey-git ,

Headscale restarts just fine, and it seems to work. My interpretation of the logs is that the headscale process for some reason hangs around after receiving SIGTERM from systemd and attempts to access the database, even though the database is closed (presumably because it's closed as a part of shutting down headscale?) -- until systemd loses patience and kills it.

I'm not sure if it's related but I've also noticed that operations like deleting a route (headscale route delete -r N) also takes a pretty long time, maybe 5-10 seconds, but that feels more like database performance while this issue has to do with the process not actually terminating and attempting to access the (closed) database.

Regarding the specs, this is a VM running in OpenStack with 2 cores, 4G of memory and nvme backed storage.

In terms of a workaround, I'm not sure about the impact (i.e. whether it's safe to kill it sooner) but restarts can be done faster by setting TimeoutStopSec:

systemctl edit headscale

Add:

[Service]
TimeoutStopSec=5

Save and do systemctl daemon-reload.

Having to do SIGKILL at all doesn't feel great but I'm not sure whether it matters doing it on a 5s or 90s timeout.

Tangentially I also backed up the .sqlite database, opened it up and ran:

sqlite> VACUUM;
sqlite> PRAGMA integrity_check;
ok

This seems to have made commands like headscale users list a lot faster, but I can't say for sure. It's kind of strange why it would, because there's not a lot of data there (about 60K).

bryanjhv · 2023-10-12T15:25:21Z

Same as #1461 I think, should be fixed by #1480 in v0.23.0-alpha1.

lgrn · 2023-10-16T12:02:25Z

I can confirm that headscale.service seems to restart instantly on v0.23.0-alpha1 -- thanks!

joshuataylor · 2023-12-03T10:39:55Z

Wanted to chip in to comment that a smaller timeout in this case is fine is acceptable. As noted systemd waits 120s for shutdown and other parts as well, this can be changed on a global level or by package (do it by package for a server, especially if you have other things that SHOULD NOT be killed after 5 seconds :)).

The KDE Wiki also recommends this (for desktops, not servers -- you probably do NOT want this globally on a server).

MulverineX · 2024-07-22T21:23:09Z

I'm still experiencing this issue in the 0.23 alpha

lgrn · 2024-07-22T23:22:47Z

I'm still experiencing this issue in the 0.23 alpha

It looks like the beta was released today, try that first.

SuperSandro2000 · 2024-08-09T16:38:01Z

I am on 0.23.0-beta1 and also still experience this issue.

lgrn added the bug Something isn't working label Oct 12, 2023

lgrn closed this as completed Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headscale service restart timing out #1574

Headscale service restart timing out #1574

lgrn commented Oct 12, 2023

harvey-git commented Oct 12, 2023

lgrn commented Oct 12, 2023 •

edited

Loading

bryanjhv commented Oct 12, 2023

lgrn commented Oct 16, 2023

joshuataylor commented Dec 3, 2023

MulverineX commented Jul 22, 2024

lgrn commented Jul 22, 2024

SuperSandro2000 commented Aug 9, 2024

Headscale service restart timing out #1574

Headscale service restart timing out #1574

Comments

lgrn commented Oct 12, 2023

Bug description

Environment

To Reproduce

harvey-git commented Oct 12, 2023

lgrn commented Oct 12, 2023 • edited Loading

bryanjhv commented Oct 12, 2023

lgrn commented Oct 16, 2023

joshuataylor commented Dec 3, 2023

MulverineX commented Jul 22, 2024

lgrn commented Jul 22, 2024

SuperSandro2000 commented Aug 9, 2024

lgrn commented Oct 12, 2023 •

edited

Loading