-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metric to track out-of-space errors #8237
Conversation
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (assuming you've tested the error is actually captured when disk is out of space... which may not be super difficult to test in the local dev env using a very small volume for the compactor). Also remember to add a CHANGELOG entry.
@@ -665,6 +674,9 @@ func (c *MultitenantCompactor) compactUsers(ctx context.Context) { | |||
// We don't want to count shutdowns as failed compactions because we will pick up with the rest of the compaction after the restart. | |||
level.Info(c.logger).Log("msg", "compaction for user was interrupted by a shutdown", "user", userID) | |||
return | |||
case errors.Is(err, syscall.ENOSPC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you done any manual test to ensure the error is correctly captured, given it's something we can't easily / reliably assert with a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment documenting it
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
* add metric for out-of-space errors Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * syntax Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * better comment Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * PR feedback Co-authored-by: Marco Pracucci <marco@pracucci.com> * add CHANGELOG entry Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> --------- Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
This adds a separate metric to count how many times a compaction failed due to an out of space error.
We need to be able to alert on out of space conditions with higher criticality than we alert on compaction failures because generic compaction failures can happen due to various transient issues such as network lag and we wouldn't want to get alerted on every network lag, an out of space condition is usually not transient and it usually requires an operator to resolve the problem so it makes sense to separate that metric from other compaction failures.