Skip to content

Commit

Permalink
Add the ability to wait for a period of time after SIGTERM (grafana#3298
Browse files Browse the repository at this point in the history
)

* Add the ability to wait for a period of time after SIGTERM

Adds the ability to supply a "shutdown delay" to Mimir components such
that they will disable HTTP keep-alives and mark themselves as not
ready when receiving SIGTERM or SIGINT (via HTTP /ready or gRPC) but
wait a configurable amount of time before actually stopping.

Fixes an issue during rollouts on Kubernetes where the Grafana Cloud
Gateway holds on to connections to query-frontends even when they
shutdown resulting in user-facing read errors. By closing connections
during shutdown, marking the component as "not ready", and still
continuing to serve requests we ensure that:

* Users don't see any disruption
* Connections to the stopping component are not pooled
* Kubernetes service endpoints are removed before the pod

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Handle shutdown inline in `Run` goroutine.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Phrasing.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
  • Loading branch information
56quarters authored and mason committed Nov 4, 2022
1 parent 06775ea commit 6bae854
Show file tree
Hide file tree
Showing 9 changed files with 92 additions and 31 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
* [ENHANCEMENT] Ingester: reduced the memory footprint of active series custom trackers. #2568
* [ENHANCEMENT] Distributor: Include `X-Scope-OrgId` header in requests forwarded to configured forwarding endpoint. #3283
* [ENHANCEMENT] Alertmanager: reduced memory utilization in Mimir clusters with a large number of tenants. #3309
* [ENHANCEMENT] Add experimental flag `-shutdown-delay` to allow components to wait after receiving SIGTERM and before stopping. In this time the component returns 503 from /ready endpoint. #3298
* [BUGFIX] Flusher: Add `Overrides` as a dependency to prevent panics when starting with `-target=flusher`. #3151
* [BUGFIX] Updated `golang.org/x/text` dependency to fix CVE-2022-32149. #3285
* [BUGFIX] Query-frontend: properly close gRPC streams to the query-scheduler to stop memory and goroutines leak. #3302
Expand Down
11 changes: 11 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,17 @@
"fieldType": "string",
"fieldCategory": "advanced"
},
{
"kind": "field",
"name": "shutdown_delay",
"required": false,
"desc": "How long to wait between SIGTERM and shutdown. After receiving SIGTERM, Mimir will report not-ready status via /ready endpoint.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "shutdown-delay",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "block",
"name": "api",
Expand Down
2 changes: 2 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1871,6 +1871,8 @@ Usage of ./cmd/mimir/mimir:
Comma-separated list of cipher suites to use. If blank, the default Go cipher suites is used.
-server.tls-min-version string
Minimum TLS version to use. Allowed values: VersionTLS10, VersionTLS11, VersionTLS12, VersionTLS13. If blank, the Go TLS minimum version is used.
-shutdown-delay duration
[experimental] How long to wait between SIGTERM and shutdown. After receiving SIGTERM, Mimir will report not-ready status via /ready endpoint.
-store-gateway.sharding-ring.consul.acl-token string
ACL Token used to interact with Consul.
-store-gateway.sharding-ring.consul.cas-retry-delay duration
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,11 @@ where `default_value` is the value to use if the environment variable is undefin
# CLI flag: -auth.no-auth-tenant
[no_auth_tenant: <string> | default = "anonymous"]

# (experimental) How long to wait between SIGTERM and shutdown. After receiving
# SIGTERM, Mimir will report not-ready status via /ready endpoint.
# CLI flag: -shutdown-delay
[shutdown_delay: <duration> | default = 0s]

api:
# (advanced) Allows to skip label name validation via
# X-Mimir-SkipLabelNameValidation header on the http write path. Use with
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ require (
github.com/golang/snappy v0.0.4
github.com/google/gopacket v1.1.19
github.com/gorilla/mux v1.8.0
github.com/grafana/dskit v0.0.0-20221018134951-0d3fc3d6c266
github.com/grafana/dskit v0.0.0-20221026142359-210cad87c563
github.com/grafana/e2e v0.1.1-0.20221018202458-cffd2bb71c7b
github.com/hashicorp/golang-lru v0.5.4
github.com/json-iterator/go v1.1.12
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -480,8 +480,8 @@ github.com/gosimple/slug v1.1.1 h1:fRu/digW+NMwBIP+RmviTK97Ho/bEj/C9swrCspN3D4=
github.com/gosimple/slug v1.1.1/go.mod h1:ER78kgg1Mv0NQGlXiDe57DpCyfbNywXXZ9mIorhxAf0=
github.com/grafana-tools/sdk v0.0.0-20211220201350-966b3088eec9 h1:LQAhgcUPnzdjU/OjCJaLlPQI7NmQCRlfjMPSA1VegvA=
github.com/grafana-tools/sdk v0.0.0-20211220201350-966b3088eec9/go.mod h1:AHHlOEv1+GGQ3ktHMlhuTUwo3zljV3QJbC0+8o2kn+4=
github.com/grafana/dskit v0.0.0-20221018134951-0d3fc3d6c266 h1:NHL4FGXvIBZ23BVYCeWOHj0TONbgqIwQ/aawRBam0J0=
github.com/grafana/dskit v0.0.0-20221018134951-0d3fc3d6c266/go.mod h1:NTfOwhBMmR7TyG4E3RB4F1qhvk+cawoXacyN30yipVY=
github.com/grafana/dskit v0.0.0-20221026142359-210cad87c563 h1:CRkbcjG9nFSc42p1uNvAnkCCsiLb4reNL2r93tn+zzI=
github.com/grafana/dskit v0.0.0-20221026142359-210cad87c563/go.mod h1:NTfOwhBMmR7TyG4E3RB4F1qhvk+cawoXacyN30yipVY=
github.com/grafana/e2e v0.1.1-0.20221018202458-cffd2bb71c7b h1:Ha+kSIoTutf4ytlVw/SaEclDUloYx0+FXDKJWKhNbE4=
github.com/grafana/e2e v0.1.1-0.20221018202458-cffd2bb71c7b/go.mod h1:3UsooRp7yW5/NJQBlXcTsAHOoykEhNUYXkQ3r6ehEEY=
github.com/grafana/gomemcache v0.0.0-20220812141943-44b6cde200bb h1:CqfZjjd8iK3G1TV8Wf0u7WTY+0RxIEbmcgxftt9qVtw=
Expand Down
34 changes: 28 additions & 6 deletions pkg/mimir/mimir.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ import (
"path/filepath"
"strconv"
"strings"
"time"

"github.com/go-kit/log"
"github.com/go-kit/log/level"
Expand All @@ -33,6 +34,7 @@ import (
"github.com/weaveworks/common/server"
"github.com/weaveworks/common/signals"
"go.opentelemetry.io/otel"
"go.uber.org/atomic"
"google.golang.org/grpc/health/grpc_health_v1"
"gopkg.in/yaml.v3"

Expand Down Expand Up @@ -95,6 +97,7 @@ type Config struct {
Target flagext.StringSliceCSV `yaml:"target"`
MultitenancyEnabled bool `yaml:"multitenancy_enabled"`
NoAuthTenant string `yaml:"no_auth_tenant" category:"advanced"`
ShutdownDelay time.Duration `yaml:"shutdown_delay" category:"experimental"`
PrintConfig bool `yaml:"-"`
ApplicationName string `yaml:"-"`

Expand Down Expand Up @@ -143,6 +146,7 @@ func (c *Config) RegisterFlags(f *flag.FlagSet, logger log.Logger) {
f.BoolVar(&c.MultitenancyEnabled, "auth.multitenancy-enabled", true, "When set to true, incoming HTTP requests must specify tenant ID in HTTP X-Scope-OrgId header. When set to false, tenant ID from -auth.no-auth-tenant is used instead.")
f.StringVar(&c.NoAuthTenant, "auth.no-auth-tenant", "anonymous", "Tenant ID to use when multitenancy is disabled.")
f.BoolVar(&c.PrintConfig, "print.config", false, "Print the config and exit.")
f.DurationVar(&c.ShutdownDelay, "shutdown-delay", 0, "How long to wait between SIGTERM and shutdown. After receiving SIGTERM, Mimir will report not-ready status via /ready endpoint.")

c.API.RegisterFlags(f)
c.registerServerFlagsWithChangedDefaultValues(f)
Expand Down Expand Up @@ -756,10 +760,15 @@ func (t *Mimir) Run() error {
return err
}

// Used to delay shutdown but return "not ready" during this delay.
shutdownRequested := atomic.NewBool(false)

// before starting servers, register /ready handler and gRPC health check service.
// It should reflect entire Mimir.
t.Server.HTTP.Path("/ready").Handler(t.readyHandler(sm))
grpc_health_v1.RegisterHealthServer(t.Server.GRPC, grpcutil.NewHealthCheck(sm))
t.Server.HTTP.Path("/ready").Handler(t.readyHandler(sm, shutdownRequested))
grpc_health_v1.RegisterHealthServer(t.Server.GRPC, grpcutil.NewHealthCheckFrom(
grpcutil.WithShutdownRequested(shutdownRequested),
grpcutil.WithManager(sm),
))

// Let's listen for events from this manager, and log them.
healthy := func() { level.Info(util_log.Logger).Log("msg", "Application started") }
Expand All @@ -785,10 +794,18 @@ func (t *Mimir) Run() error {

sm.AddListener(services.NewManagerListener(healthy, stopped, serviceFailed))

// Setup signal handler. If signal arrives, we stop the manager, which stops all the services.
// Setup signal handler to gracefully shutdown in response to SIGTERM or SIGINT
handler := signals.NewHandler(t.Server.Log)
go func() {
handler.Loop()

shutdownRequested.Store(true)
t.Server.HTTPServer.SetKeepAlivesEnabled(false)

if t.Cfg.ShutdownDelay > 0 {
time.Sleep(t.Cfg.ShutdownDelay)
}

sm.StopAsync()
}()

Expand Down Expand Up @@ -818,8 +835,14 @@ func (t *Mimir) Run() error {
return err
}

func (t *Mimir) readyHandler(sm *services.Manager) http.HandlerFunc {
func (t *Mimir) readyHandler(sm *services.Manager, shutdownRequested *atomic.Bool) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
if shutdownRequested.Load() {
level.Debug(util_log.Logger).Log("msg", "application is stopping")
http.Error(w, "Application is stopping", http.StatusServiceUnavailable)
return
}

if !sm.IsHealthy() {
var serviceNamesStates []string
for name, s := range t.ServiceMap {
Expand All @@ -829,7 +852,6 @@ func (t *Mimir) readyHandler(sm *services.Manager) http.HandlerFunc {
}

level.Debug(util_log.Logger).Log("msg", "some services are not Running", "services", serviceNamesStates)

httpResponse := "Some services are not Running:\n" + strings.Join(serviceNamesStates, "\n")
http.Error(w, httpResponse, http.StatusServiceUnavailable)
return
Expand Down
62 changes: 41 additions & 21 deletions vendor/github.com/grafana/dskit/grpcutil/health_check.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion vendor/modules.txt

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit 6bae854

Please sign in to comment.