Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to wait for a period of time after SIGTERM #3298

Merged
merged 4 commits into from
Oct 26, 2022

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Oct 24, 2022

What this PR does

Adds the ability to supply a "shutdown delay" to Mimir components such that they will disable HTTP keep-alives and mark themselves as not ready when receiving SIGTERM or SIGINT (via HTTP /ready or gRPC) but wait a configurable amount of time before actually stopping.

Fixes an issue during rollouts on Kubernetes where the Grafana Cloud Gateway holds on to connections to query-frontends even when they shutdown resulting in user-facing read errors. By closing connections during shutdown, marking the component as "not ready", and still continuing to serve requests we ensure that:

  • Users don't see any disruption
  • Connections to the stopping component are not pooled
  • Kubernetes service endpoints are removed before the pod

Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@56quarters 56quarters force-pushed the 56quarters/shutdown-delay branch 2 times, most recently from d62015b to b0df93b Compare October 25, 2022 22:22
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in logic look good and may be useful. I think this PR suffers from too many comments. There is no need to write comments that describe exact logic in the code which is quite trivial (eg. comment for shutdownSignalReceiver type in signals.go)

pkg/mimir/signals.go Outdated Show resolved Hide resolved
pkg/mimir/mimir.go Outdated Show resolved Hide resolved
pkg/mimir/mimir.go Outdated Show resolved Hide resolved
pkg/mimir/mimir.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a nice feature, I'm up to it! I left few minor comments.

pkg/mimir/mimir.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
pkg/mimir/mimir.go Outdated Show resolved Hide resolved
pkg/mimir/mimir.go Outdated Show resolved Hide resolved
pkg/mimir/mimir.go Outdated Show resolved Hide resolved
pkg/mimir/grpc_health_check.go Outdated Show resolved Hide resolved
pkg/mimir/signals.go Outdated Show resolved Hide resolved
56quarters added a commit to grafana/dskit that referenced this pull request Oct 26, 2022
Extend the gRPC HealthCheck struct to allow it to make use of multiple
types of health checks such as a services.Manager or atomic.Bool shutdown
request.

See grafana/mimir#3298

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters
Copy link
Contributor Author

Thank you for the feedback! I've opened grafana/dskit#227 to move the gRPC health checks into dskit and I'll work through the rest of the changes.

Adds the ability to supply a "shutdown delay" to Mimir components such
that they will disable HTTP keep-alives and mark themselves as not
ready when receiving SIGTERM or SIGINT (via HTTP /ready or gRPC) but
wait a configurable amount of time before actually stopping.

Fixes an issue during rollouts on Kubernetes where the Grafana Cloud
Gateway holds on to connections to query-frontends even when they
shutdown resulting in user-facing read errors. By closing connections
during shutdown, marking the component as "not ready", and still
continuing to serve requests we ensure that:

* Users don't see any disruption
* Connections to the stopping component are not pooled
* Kubernetes service endpoints are removed before the pod

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit to grafana/dskit that referenced this pull request Oct 26, 2022
Extend the gRPC HealthCheck struct to allow it to make use of multiple
types of health checks such as a services.Manager or atomic.Bool shutdown
request.

See grafana/mimir#3298

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters marked this pull request as ready for review October 26, 2022 14:31
@56quarters 56quarters requested review from osg-grafana and a team as code owners October 26, 2022 14:31
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

pkg/mimir/mimir.go Outdated Show resolved Hide resolved
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Also the code now is even more clean 👏

@pracucci pracucci enabled auto-merge (squash) October 26, 2022 16:08
@pracucci pracucci merged commit cca6e11 into main Oct 26, 2022
@pracucci pracucci deleted the 56quarters/shutdown-delay branch October 26, 2022 16:18
masonmei pushed a commit to udmire/mimir that referenced this pull request Nov 4, 2022
)

* Add the ability to wait for a period of time after SIGTERM

Adds the ability to supply a "shutdown delay" to Mimir components such
that they will disable HTTP keep-alives and mark themselves as not
ready when receiving SIGTERM or SIGINT (via HTTP /ready or gRPC) but
wait a configurable amount of time before actually stopping.

Fixes an issue during rollouts on Kubernetes where the Grafana Cloud
Gateway holds on to connections to query-frontends even when they
shutdown resulting in user-facing read errors. By closing connections
during shutdown, marking the component as "not ready", and still
continuing to serve requests we ensure that:

* Users don't see any disruption
* Connections to the stopping component are not pooled
* Kubernetes service endpoints are removed before the pod

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Handle shutdown inline in `Run` goroutine.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Phrasing.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
masonmei pushed a commit to udmire/mimir that referenced this pull request Nov 4, 2022
)

* Add the ability to wait for a period of time after SIGTERM

Adds the ability to supply a "shutdown delay" to Mimir components such
that they will disable HTTP keep-alives and mark themselves as not
ready when receiving SIGTERM or SIGINT (via HTTP /ready or gRPC) but
wait a configurable amount of time before actually stopping.

Fixes an issue during rollouts on Kubernetes where the Grafana Cloud
Gateway holds on to connections to query-frontends even when they
shutdown resulting in user-facing read errors. By closing connections
during shutdown, marking the component as "not ready", and still
continuing to serve requests we ensure that:

* Users don't see any disruption
* Connections to the stopping component are not pooled
* Kubernetes service endpoints are removed before the pod

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Code review changes.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Handle shutdown inline in `Run` goroutine.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

* Phrasing.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this pull request Nov 10, 2022
Deprecate ingester ring specific option for delaying shutdown after a
SIGTERM. Instead use the `shutdown-delay` option which can be added to
any component (it is not ingester or ring specific).

See #3298

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants