Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headscale service restart timing out #1574

Closed
2 tasks
lgrn opened this issue Oct 12, 2023 · 8 comments
Closed
2 tasks

Headscale service restart timing out #1574

lgrn opened this issue Oct 12, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@lgrn
Copy link
Contributor

lgrn commented Oct 12, 2023

Bug description

After running systemctl restart headscale:

headscale[7497]: 2023-10-12T09:05:33Z INF Received signal to stop, shutting down gracefully signal=terminated
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF The long-poll handler is shutting down handler=PollNetMapStream (...)
headscale[7497]: 2023-10-12T09:05:33Z INF Headscale stopped
systemd[1]: Stopping headscale coordination server for Tailscale...

These types of logs show up immediately afterward:

headscale[7497]: 2023-10-12T09:05:33Z ERR Failed to fetch machine from the database with node key: nodekey: (...)
(...)
headscale[7497]: 2023-10-12T09:05:36Z ERR error getting routes error="sql: database is closed"
headscale[7497]: 2023-10-12T09:05:36Z ERR Error listing users error="sql: database is closed"

This finally ends with:

Oct 12 09:07:03 (...) systemd[1]: headscale.service: State 'stop-sigterm' timed out. Killing.
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7497 (headscale) with signal SIGKILL.
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Killing process 7499 (headscale) with signal SIGKILL.
(...)
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Main process exited, code=killed, status=9/KILL
Oct 12 09:07:03 (...) systemd[1]: headscale.service: Failed with result 'timeout'.
Oct 12 09:07:03 (...) systemd[1]: Stopped headscale coordination server for Tailscale.
(...)

So for some reason SIGKILL is finally required and the restart takes about 90 seconds.

Is this a known issue?

Environment

# headscale version
v0.22.3
# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
# uname -a
Linux (...) 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

To Reproduce

N/A

@lgrn lgrn added the bug Something isn't working label Oct 12, 2023
@harvey-git
Copy link

Judging from the error log, the main program cannot fetch machine information from the database.
because headscale uses the sqlite3 database.
This error message means three possibilities: 1. The database is broken; 2. There are other processes locking the headscale database;3. database access The permissions have changed. If headscale can start normally after restarting the server, you can exclude 1 and 3. Then it should be that when headscale is restarted, its process cannot be closed and exited normally.
Is your server hardware configuration very low?

@lgrn
Copy link
Contributor Author

lgrn commented Oct 12, 2023

Hi @harvey-git ,

Headscale restarts just fine, and it seems to work. My interpretation of the logs is that the headscale process for some reason hangs around after receiving SIGTERM from systemd and attempts to access the database, even though the database is closed (presumably because it's closed as a part of shutting down headscale?) -- until systemd loses patience and kills it.

I'm not sure if it's related but I've also noticed that operations like deleting a route (headscale route delete -r N) also takes a pretty long time, maybe 5-10 seconds, but that feels more like database performance while this issue has to do with the process not actually terminating and attempting to access the (closed) database.

Regarding the specs, this is a VM running in OpenStack with 2 cores, 4G of memory and nvme backed storage.

In terms of a workaround, I'm not sure about the impact (i.e. whether it's safe to kill it sooner) but restarts can be done faster by setting TimeoutStopSec:

systemctl edit headscale

Add:

[Service]
TimeoutStopSec=5

Save and do systemctl daemon-reload.

Having to do SIGKILL at all doesn't feel great but I'm not sure whether it matters doing it on a 5s or 90s timeout.

Tangentially I also backed up the .sqlite database, opened it up and ran:

sqlite> VACUUM;
sqlite> PRAGMA integrity_check;
ok

This seems to have made commands like headscale users list a lot faster, but I can't say for sure. It's kind of strange why it would, because there's not a lot of data there (about 60K).

@bryanjhv
Copy link

Same as #1461 I think, should be fixed by #1480 in v0.23.0-alpha1.

@lgrn
Copy link
Contributor Author

lgrn commented Oct 16, 2023

I can confirm that headscale.service seems to restart instantly on v0.23.0-alpha1 -- thanks!

@lgrn lgrn closed this as completed Oct 16, 2023
@joshuataylor
Copy link
Contributor

Wanted to chip in to comment that a smaller timeout in this case is fine is acceptable. As noted systemd waits 120s for shutdown and other parts as well, this can be changed on a global level or by package (do it by package for a server, especially if you have other things that SHOULD NOT be killed after 5 seconds :)).

The KDE Wiki also recommends this (for desktops, not servers -- you probably do NOT want this globally on a server).

@MulverineX
Copy link

I'm still experiencing this issue in the 0.23 alpha

@lgrn
Copy link
Contributor Author

lgrn commented Jul 22, 2024

I'm still experiencing this issue in the 0.23 alpha

It looks like the beta was released today, try that first.

@SuperSandro2000
Copy link
Contributor

I am on 0.23.0-beta1 and also still experience this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants