Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prepare-shutdown endpoint to ingesters for down scaling #4718

Merged
merged 5 commits into from
Apr 14, 2023

Conversation

56quarters
Copy link
Contributor

What this PR does

This change adds a new HTTP endpoint to ingesters that changes their in-memory configuration (with an on-disk backup) such that they will:

  • Unregister from the ring when they stop
  • Flush all in-memory data to object storage when they stop

This differs from the shutdown endpoint because it does not actually stop the ingesters, just modifies their configuration in preparation for being permanently stopped. This is a requirement of using the rollout-operator for gracefully scaling down ingesters in Kubernetes.

Which issue(s) this PR fixes or relates to

See grafana/rollout-operator#47

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

This change adds a new HTTP endpoint to ingesters that changes their
in-memory configuration (with an on-disk backup) such that they will:
* Unregister from the ring when they stop
* Flush all in-memory data to object storage when they stop

This differs from the shutdown endpoint because it does not actually
stop the ingesters, just modifies their configuration in preparation
for being permanently stopped. This is a requirement of using the
rollout-operator for gracefully scaling down ingesters in Kubernetes.

See grafana/rollout-operator#47

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters marked this pull request as ready for review April 12, 2023 22:58
@56quarters 56quarters requested review from a team as code owners April 12, 2023 22:58
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester_test.go Show resolved Hide resolved
Copy link
Contributor

@colega colega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ShutdownMarker should export a metric, so we can add an alert if for some reason there's an ingester continuously running with a shutdown marker (because as I understand, the marker is only removed when we remove the PVC, so if the sts isn't shut down after all (because we upscaled quickly?) it will continue shutting down between restarts.

Which makes me think: do we need a revert mechanism?

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
@56quarters
Copy link
Contributor Author

I think ShutdownMarker should export a metric, so we can add an alert if for some reason there's an ingester continuously running with a shutdown marker (because as I understand, the marker is only removed when we remove the PVC, so if the sts isn't shut down after all (because we upscaled quickly?) it will continue shutting down between restarts.

Which makes me think: do we need a revert mechanism?

I'll add a metric. We could build a "revert" mechanism but we'd have to add that to the rollout operator as well. Let's avoid that until we need it.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters
Copy link
Contributor Author

56quarters commented Apr 13, 2023

I think ShutdownMarker should export a metric, so we can add an alert if for some reason there's an ingester continuously running with a shutdown marker (because as I understand, the marker is only removed when we remove the PVC, so if the sts isn't shut down after all (because we upscaled quickly?) it will continue shutting down between restarts.

Which makes me think: do we need a revert mechanism?

I'll plan on doing this as a follow-up to the open rollout-operator PR and in Mimir. I'd rather get something out to test before handling reverting of scale downs.

Done while adding support for GET (added DELETE).

docs/sources/mimir/references/http-api/index.md Outdated Show resolved Hide resolved
pkg/ingester/ingester_test.go Outdated Show resolved Hide resolved
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I think the direction here is fine, but I'd like to see the PR simplified before merging.

pkg/ingester/shutdown_marker.go Outdated Show resolved Hide resolved
pkg/ingester/shutdown_marker.go Outdated Show resolved Hide resolved
pkg/ingester/metrics.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
### Prepare for Shutdown

```
GET,POST /ingester/prepare-shutdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention DELETE too and explain what different methods do on this endpoint.

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my feedback!

pkg/ingester/ingester.go Outdated Show resolved Hide resolved
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters merged commit 804c1dc into main Apr 14, 2023
@56quarters 56quarters deleted the 56quarters/prepare-shutdown branch April 14, 2023 20:40
@56quarters 56quarters restored the 56quarters/prepare-shutdown branch April 17, 2023 14:49
56quarters added a commit that referenced this pull request Apr 17, 2023
Forgot to update the list of verbs supported in #4718 after review feedback.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters deleted the 56quarters/prepare-shutdown branch April 17, 2023 14:49
56quarters added a commit that referenced this pull request Apr 17, 2023
Forgot to update the list of verbs supported in #4718 after review feedback.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 👏

MichelHollands added a commit to grafana/loki that referenced this pull request Apr 19, 2023
**What this PR does / why we need it**:

This updates the PrepareShutdown method so it supports GET and DELETE
methods as well. This makes it similar to Mimir:
grafana/mimir#4718.
The status is now stored in a local file. A new config setting had to be
added for this file as there is no obvious place to store it.


**Checklist**
- [X] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [X] Documentation added
- [X] Tests updated
- [x] `CHANGELOG.md` updated
- [x] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`

---------

Signed-off-by: Michel Hollands <michel.hollands@grafana.com>
Co-authored-by: Dylan Guedes <djmgguedes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants