-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limits and errors for ephemeral storage #4004
Conversation
This is not really related to this PR, more to how the ephemeral TSDB is implemented: I realized that we currently always create a persistent TSDB, even if a user would want to run an ingester with ephemeral-only data (for example in the ephemeral matchers they could specify |
I'm thinking about whether it makes sense to add new metrics like for example I think both ways can be considered a breaking change. The current solution with the new metrics would be a breaking change if a user currently has an alert on failures to ingest samples, using the metric OTH if we'd add the I think it's safe to assume that the majority of users won't use the ephemeral store, so probably the safer choice is to just not touch the existing metrics. That way only users who want to start using the ephemeral store, so the minority of users, will have to experience a breaking change which requires them to potentially update their alerts/dashboards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! I had only very few comments on this huge PR.
I'll approve to unblock, but before merging I think at least the question about 0
to disable the limit should be addressed.
(btw the linter is failing)
cbd3aeb
to
9661bcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, LGTM! I left few minor comments. My main comment which I would love to see addressed is the runbooks doc, cause I think could be easily done with a copy-paste (still better than "TBD" 😉 ).
@@ -1361,6 +1361,10 @@ How to **fix** it: | |||
|
|||
- See [`MimirIngesterReachingSeriesLimit`](#MimirIngesterReachingSeriesLimit) runbook. | |||
|
|||
### err-mimir-ingester-max-ephemeral-series | |||
|
|||
This error refers to experimental ephemeral storage in ingesters. Documentation TBD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "Documentation TBD", what if you copy-paste the content of the persistent counterpart runbook and just replace the config names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here: 083b57f
pkg/ingester/ingester.go
Outdated
updateFirstPartial(func() error { return makeLimitError(perUserSeriesLimit, i.limiter.FormatError(userID, cause)) }) | ||
updateFirstPartial(func() error { | ||
if ephemeral { | ||
return makeLimitError(ephemeralDiscardPrefix+perUserSeriesLimit, i.limiter.FormatError(userID, cause)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] What if we define a const ephemeralPerUserSeriesLimit
instead of ephemeralDiscardPrefix+perUserSeriesLimit
? I've the feeling may be easier in the future to grep for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is what you had in mind: 8a6cda5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Based on your feedback I've updated runbooks, but they are not always just copy&pasted versions. I was hoping to move that work to different PR to avoid lengthy docs review loop here (my english language skills are generally worse than Go skills, although some people may disagree). But let's give this a shot. |
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
6cc08f8
to
e9ecb39
Compare
* Update test * Add missing changelog entries for commits since Mimir 2.5 (#4006) All other commits weren't user-facing or were helm-chart specific. See #3979 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * Add --concurrency support to 'mimirtool rules sync' command (#3996) * Add --concurrency support to 'mimirtool rules sync' command Signed-off-by: Marco Pracucci <marco@pracucci.com> * Update pkg/mimirtool/commands/rules.go Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com> Signed-off-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com> * store-gateway: ExpandedPostings shortcut: avoid LabelValues unless necessary (#3872) * Return `Canceled` rather than `Aborted` when a `Series` request to a store-gateway is cancelled by the calling querier. (#4007) * Return Canceled rather than Aborted when a Series request to a store-gateway is cancelled. * Add changelog entry. * Update mimir-prometheus, add support for align_evaluation_time_on_interval. (#4013) Signed-off-by: Peter Štibraný <pstibrany@gmail.com> * Fix title of guide in link text; reword phrase. (#4008) * Fix ExampleInitLogger to work in UTC (#4016) The test didn't pass in my time zone (tm). --- FAIL: ExampleInitLogger (0.00s) got: ts=1970-01-01T01:00:00+01:00 caller=log_test.go:31 level=info test=1 ts=1970-01-01T01:00:00+01:00 caller=log_test.go:33 level=info msg="test 3" want: ts=1970-01-01T00:00:00Z caller=log_test.go:31 level=info test=1 ts=1970-01-01T00:00:00Z caller=log_test.go:33 level=info msg="test 3" Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> * Create outline of Mimir 2.6 release notes (#4002) Includes notable features and bugfixes based on the CHANGELOG. Helm changes to be filled out later by product and engineering. See #3979 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * Fix post-merge comments from PR #4013. (#4014) Signed-off-by: Peter Štibraný <pstibrany@gmail.com> * Update CODEOWNERS to include mimir-ruler-and-alertmanager-maintainers (#4019) For those who only want notifications re the ruler or Alertmanager. * Remove internal use of store.max-query-length (#4017) Make deprecation of the option more obvious and attempt to remove any use of store.max-query-length in our documentation, jsonnet, helm, and integration tests. See #2793 See #3825 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> * [otlp] Update OTel Collector to latest release (#3852) * [otlp] Update otel collector dependecy to latest * Update code to deal with deprecated functions Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * [otlp] Docs: Highlight common issues with OTLP --> Prometheus (#3629) Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com> * make it possible to inject memberlist kv codecs (#4018) * make it possible to inject memberlist kv codecs Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * add comment Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * improve comment wording Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * Limits and errors for ephemeral storage (#4004) * Add limits for ephemeral storage. * Add new reason when ingestion of ephemeral metrics fails. * Add tests for max ephemeral series limit. * Introduce new discard reasons when ingesting ephemeral series. Signed-off-by: Peter Štibraný <pstibrany@gmail.com> * Reduce maintainership and step down as team member. (#4023) * Reduce maintainership and step down as team member. My future priorities will be on the alerting aspects of Mimir, so I think it is right to reduce my maintainership accordingly and allow others to take my place. Similarly, remove myself as a team member. * Sort previous team members. Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> Signed-off-by: Marco Pracucci <marco@pracucci.com> Signed-off-by: Peter Štibraný <pstibrany@gmail.com> Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com> Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> Co-authored-by: Nick Pillitteri <56quarters@users.noreply.github.com> Co-authored-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Patrick Oyarzun <patrick.oyarzun@grafana.com> Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Co-authored-by: Charles Korn <charleskorn@users.noreply.github.com> Co-authored-by: Peter Štibraný <pstibrany@gmail.com> Co-authored-by: Ursula Kallio <ursula.kallio@grafana.com> Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com> Co-authored-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com> Co-authored-by: Mauro Stettler <mauro.stettler@gmail.com> Co-authored-by: Steve Simpson <steve.simpson@grafana.com>
What this PR does
This PR
-ingester.instance-limits.max-ephemeral-series
)-ingester.max-ephemeral-series-per-user
)err-mimir-max-ephemeral-series-per-user
err-mimir-ingester-max-ephemeral-series
err-mimir-ephemeral-sample-timestamp-too-old
err-mimir-ephemeral-sample-out-of-order
err-mimir-ephemeral-sample-duplicate-timestamp
err-mimir-ephemeral-storage-not-enabled-for-user
This PR does not update changelog, nor does it add documentation for newly introduced errors in the error catalog. This will be handled by subsequent PRs.
(If needed, I'm happy to split the PR into smaller PRs)
Which issue(s) this PR fixes or relates to
Part of #3884
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]