Skip to content

Commit

Permalink
Limits and errors for ephemeral storage (#4004)
Browse files Browse the repository at this point in the history
* Add limits for ephemeral storage.
* Add new reason when ingestion of ephemeral metrics fails.
* Add tests for max ephemeral series limit.
* Introduce new discard reasons when ingesting ephemeral series.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
  • Loading branch information
pstibrany authored Jan 20, 2023
1 parent 6ecbf84 commit 0e281ac
Show file tree
Hide file tree
Showing 13 changed files with 656 additions and 148 deletions.
22 changes: 22 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -2780,6 +2780,17 @@
"fieldType": "int",
"fieldCategory": "advanced"
},
{
"kind": "field",
"name": "max_ephemeral_series",
"required": false,
"desc": "Max ephemeral series that this ingester can hold (across all tenants). Requests to create additional ephemeral series will be rejected. 0 = unlimited.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.instance-limits.max-ephemeral-series",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_inflight_push_requests",
Expand Down Expand Up @@ -3032,6 +3043,17 @@
"fieldFlag": "ingester.max-global-series-per-metric",
"fieldType": "int"
},
{
"kind": "field",
"name": "max_ephemeral_series_per_user",
"required": false,
"desc": "The maximum number of in-memory ephemeral series per tenant, across the cluster before replication. 0 to disable ephemeral storage.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.max-ephemeral-series-per-user",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "max_global_metadata_per_user",
Expand Down
4 changes: 4 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1025,6 +1025,8 @@ Usage of ./cmd/mimir/mimir:
Override the expected name on the server certificate.
-ingester.ignore-series-limit-for-metric-names string
Comma-separated list of metric names, for which the -ingester.max-global-series-per-metric limit will be ignored. Does not affect the -ingester.max-global-series-per-user limit.
-ingester.instance-limits.max-ephemeral-series int
[experimental] Max ephemeral series that this ingester can hold (across all tenants). Requests to create additional ephemeral series will be rejected. 0 = unlimited.
-ingester.instance-limits.max-inflight-push-requests int
Max inflight push requests that this ingester can handle (across all tenants). Additional requests will be rejected. 0 = unlimited. (default 30000)
-ingester.instance-limits.max-ingestion-rate float
Expand All @@ -1033,6 +1035,8 @@ Usage of ./cmd/mimir/mimir:
Max series that this ingester can hold (across all tenants). Requests to create additional series will be rejected. 0 = unlimited.
-ingester.instance-limits.max-tenants int
Max tenants that this ingester can hold. Requests from additional tenants will be rejected. 0 = unlimited.
-ingester.max-ephemeral-series-per-user int
[experimental] The maximum number of in-memory ephemeral series per tenant, across the cluster before replication. 0 to disable ephemeral storage.
-ingester.max-global-exemplars-per-user int
[experimental] The maximum number of exemplars in memory, across the cluster. 0 to disable exemplars ingestion.
-ingester.max-global-metadata-per-metric int
Expand Down
64 changes: 64 additions & 0 deletions docs/sources/mimir/operators-guide/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1365,6 +1365,21 @@ How to **fix** it:
- See [`MimirIngesterReachingSeriesLimit`](#MimirIngesterReachingSeriesLimit) runbook.
### err-mimir-ingester-max-ephemeral-series
This critical error occurs when an ingester rejects a write request because it reached the maximum number of ephemeral series.
How it **works**:
- The ingester keeps all ephemeral series in memory.
- The ingester has a per-instance limit on the number of ephemeral series, used to protect the ingester from overloading in case of high traffic.
- When the limit on the number of ephemeral series is reached, new ephemeral series are rejected, while samples can still be appended to existing ones.
- To configure the limit, set the `-ingester.instance-limits.max-ephemeral-series` option (or `max_ephemeral_series` in the runtime config).
How to **fix** it:
- Increase the limit, or reshard the tenants between ingesters. Please see [`MimirIngesterReachingSeriesLimit`](#MimirIngesterReachingSeriesLimit) runbook for more details (it describes persistent storage, but same principles apply to ephemeral storage).
### err-mimir-ingester-max-inflight-push-requests
This error occurs when an ingester rejects a write request because the maximum in-flight requests limit has been reached.
Expand Down Expand Up @@ -1393,6 +1408,18 @@ How to **fix** it:
- Ensure the actual number of series written by the affected tenant is legit.
- Consider increasing the per-tenant limit by using the `-ingester.max-global-series-per-user` option (or `max_global_series_per_user` in the runtime configuration).
### err-mimir-max-ephemeral-series-per-user
This error occurs when the number of ephemeral series for a given tenant exceeds the configured limit.
The limit is used to protect ingesters from overloading in case a tenant writes a high number of ephemeral series, as well as to protect the whole system’s stability from potential abuse or mistakes.
To configure the limit on a per-tenant basis, use the `-ingester.max-ephemeral-series-per-user` option (or `max_ephemeral_series_per_user` in the runtime configuration).
How to **fix** it:
- Ensure the actual number of ephemeral series written by the affected tenant is legit.
- Consider increasing the per-tenant limit by using the `-ingester.max-ephemeral-series-per-user` option (or `max_ephemeral_series_per_user` in the runtime configuration).
### err-mimir-max-series-per-metric
This error occurs when the number of in-memory series for a given tenant and metric name exceeds the configured limit.
Expand Down Expand Up @@ -1558,6 +1585,14 @@ How it **works**:
> **Note**: If the out-of-order sample ingestion is enabled, then this error is similar to `err-mimir-sample-out-of-order` below with a difference that the sample is older than the out-of-order time window as it relates to the latest sample for that particular time series or the TSDB.
### err-mimir-ephemeral-sample-timestamp-too-old
This error occurs when the ingester rejects a sample because its timestamp older than configured retention of ephemeral storage.
How it **works**:
- Ephemeral storage in ingesters can only hold samples that not older than `-blocks-storage.ephemeral-tsdb.retention-period` value. If the incoming timestamp is older than "now - retention", it is rejected.
### err-mimir-sample-out-of-order
This error occurs when the ingester rejects a sample because another sample with a more recent timestamp has already been ingested.
Expand All @@ -1577,6 +1612,14 @@ Common **causes**:
> **Note**: You can learn more about out of order samples in Prometheus, in the blog post [Debugging out of order samples](https://www.robustperception.io/debugging-out-of-order-samples/).
### err-mimir-ephemeral-sample-out-of-order
This error occurs when the ingester rejects a sample because another sample with a more recent timestamp has already been ingested for the same series in the ephemeral storage.
Please refer to [err-mimir-sample-out-of-order](#err-mimir-sample-out-of-order) for possible reasons.
> **Note**: It is not possible to enable out-of-order sample ingestion for ephemeral storage.
### err-mimir-sample-duplicate-timestamp
This error occurs when the ingester rejects a sample because it is a duplicate of a previously received sample with the same timestamp but different value in the same time series.
Expand All @@ -1586,6 +1629,15 @@ Common **causes**:
- Multiple endpoints are exporting the same metrics, or multiple Prometheus instances are scraping different metrics with identical labels.
- Prometheus relabelling has been configured and it causes series to clash after the relabelling. Check the error message for information about which series has received a duplicate sample.
### err-mimir-ephemeral-sample-duplicate-timestamp
This error occurs when the ingester rejects a sample because it is a duplicate of a previously received sample with the same timestamp but different value for the same ephemeral series.
Common **causes**:
- Multiple endpoints are exporting the same metrics, or multiple Prometheus instances are scraping different metrics with identical labels.
- Prometheus relabelling has been configured and it causes series to clash after the relabelling. Check the error message for information about which series has received a duplicate sample.
### err-mimir-exemplar-series-missing
This error occurs when the ingester rejects an exemplar because its related series has not been ingested yet.
Expand Down Expand Up @@ -1641,6 +1693,18 @@ How to **fix** it:
- Increase the allowed limit by using the `-distributor.max-recv-msg-size` option.
### err-mimir-ephemeral-storage-not-enabled-for-user
Ingester returns this error when a write request contains ephemeral series, but ephemeral storage is disabled for user.
Ephemeral storage is disabled when `-ingester.max-ephemeral-series-per-user` (or corresponding `max_ephemeral_series_per_user` limit in runtime configuration) is set to 0 for given tenant.
How to **fix** it:
- Disable support for ephemeral series in distributor by setting `-distributor.ephemeral-series-enabled` to `false`.
- Remove rules for marking incoming series as ephemeral for given tenant by removing `-distributor.ephemeral-series-matchers` (or `ephemeral_series_matchers` in runtime configuration).
- Enable ephemeral storage for tenant by setting the `-ingester.max-ephemeral-series-per-user` (or corresponding `max_ephemeral_series_per_user` limit in runtime configuration) to positive number.
## Mimir routes by path
**Write path**:
Expand Down
11 changes: 11 additions & 0 deletions docs/sources/mimir/reference-configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -934,6 +934,12 @@ instance_limits:
# CLI flag: -ingester.instance-limits.max-series
[max_series: <int> | default = 0]
# (experimental) Max ephemeral series that this ingester can hold (across all
# tenants). Requests to create additional ephemeral series will be rejected. 0
# = unlimited.
# CLI flag: -ingester.instance-limits.max-ephemeral-series
[max_ephemeral_series: <int> | default = 0]
# (advanced) Max inflight push requests that this ingester can handle (across
# all tenants). Additional requests will be rejected. 0 = unlimited.
# CLI flag: -ingester.instance-limits.max-inflight-push-requests
Expand Down Expand Up @@ -2539,6 +2545,11 @@ The `limits` block configures default and per-tenant limits imposed by component
# CLI flag: -ingester.max-global-series-per-metric
[max_global_series_per_metric: <int> | default = 0]

# (experimental) The maximum number of in-memory ephemeral series per tenant,
# across the cluster before replication. 0 to disable ephemeral storage.
# CLI flag: -ingester.max-ephemeral-series-per-user
[max_ephemeral_series_per_user: <int> | default = 0]

# The maximum number of in-memory metrics with metadata per tenant, across the
# cluster. 0 to disable.
# CLI flag: -ingester.max-global-metadata-per-user
Expand Down
Loading

0 comments on commit 0e281ac

Please sign in to comment.