Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

server: fix panic on graceful shutdown #1496

Merged
merged 10 commits into from
Mar 10, 2016

Conversation

antrik
Copy link
Contributor

@antrik antrik commented Mar 10, 2016

Retry of #1452 , because Semaphore.

@antrik antrik force-pushed the antrik/fix-shutdown-rebased branch from 4271638 to 6649c99 Compare March 10, 2016 15:31
@antrik
Copy link
Contributor Author

antrik commented Mar 10, 2016

Weird: Semaphore did check the latest commit, but it doesn't show up in the status box...

jonboulle and others added 10 commits March 10, 2016 16:40
The Server has a global stop channel which is used both internally (by
Monitor) and externally (by the Stop method) to shut down the server.
This is bad; invoking the method simultaneously by multiple goroutines
is not safe (and as seen in coreos#1044 can cause panics due to a
doubly-closed channel).

This change centralises the shutdown procedure through the Monitor, so
that when an external user of the Server wants to shut it down, it
triggers an error to propagate up from the monitor. Hence there is only
a single path in which the stopchannel (which terminates all other
Server goroutines) can be called.
There are three different paths in the main fleetd goroutine that can
access the global `srv` Server - reconfigurations, shutdowns and
statedumps. Right now there's nothing preventing racy access to this
instance, so introduce a mutex to protect it.

One potential issue with this is that it means that a reconfigure or
state dump can "block" a shutdown, but IMHO if this occurs it will
expose behaviour that is broken and needs to be fixed anyway.
- add all background server components to a WaitGroup
- when shutting down the server, wait on this group or until a timeout
  (defaulting to one minute) before restarting or exiting.
- if timeout occurs, shut down hard and let a
- move Monitor into server package
- Server.Monitor -> Server.Supervise to remove ambiguity/duplication
Channels that are just used to "broadcast" messages (e.g. they are only
ever closed) do not need a type; it is better to be more explicit about
this by using a struct{}.

Similarly, the channels can be receive-only.
To make things a little clearer for ol' man Crawford, rename the "Stop"
function to "Kill" to align better with the channel names and be a
little more explicit that it is invoked in response to a kill signal.
Before the monitor/shutdown handling rework, fleetd would panic with a
channel double-close if it was shut down while a restart (triggered by
the monitor on a failed health check) was is progress.

Added a functional test for this situation -- and also for regular
shutdown as a baseline.
The default agent_ttl setting of 30s makes TestMonitorVsShutdown run
pretty long. (And probably also other tests related to connectivity
loss, that will be added in the future...)

With no actual network delays involved, making the timeout much shorter
shouldn't cause any trouble -- so we can simply set it globally, rather
than trying to override it only for the specific test in question.
The function moved to a different package, but otherwise can be used as
before -- I presume inlining it was some kind of accident...
@antrik antrik force-pushed the antrik/fix-shutdown-rebased branch from 6649c99 to 40691b5 Compare March 10, 2016 15:41
@tixxdz
Copy link
Contributor

tixxdz commented Mar 10, 2016

Lgtm for the last 3 patches. I would perhaps argue that we increase the timeout a little bit, but @antrik if you are sure that all tests did run correctly and it does not have any other side effect, then it's probably ok.

@jonboulle I'm a bit busy with the other replace issue, but if you think that we should take a closer look again, ok. Thank you!

@jonboulle jonboulle changed the title Shutdown fixes/improvements, take 2 server: fix panic on graceful shutdown Mar 10, 2016
jonboulle added a commit that referenced this pull request Mar 10, 2016
@jonboulle jonboulle merged commit 83c4099 into coreos:master Mar 10, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants