-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce -validation.create-grace-period in the ingester too #5712
Enforce -validation.create-grace-period in the ingester too #5712
Conversation
@@ -907,6 +912,8 @@ func (i *Ingester) pushSamplesToAppender(userID string, timeseries []mimirpb.Pre | |||
|
|||
// Return true if handled as soft error, and we can ingest more series. | |||
handleAppendError := func(err error, timestamp int64, labels []mimirpb.LabelAdapter) bool { | |||
stats.failedSamplesCount++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: I moved it here because it was always incremented right before calling handleAppendError()
.
@@ -101,6 +102,11 @@ func (id ID) LabelValue() string { | |||
return strings.ReplaceAll(string(id), "-", "_") | |||
} | |||
|
|||
// Error implements error. | |||
func (id ID) Error() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: made it implement error
interface so that I can use it in the ingester's handleAppendError()
.
pkg/util/validation/limits.go
Outdated
@@ -199,7 +199,7 @@ func (l *Limits) RegisterFlags(f *flag.FlagSet) { | |||
f.IntVar(&l.MaxMetadataLength, maxMetadataLengthFlag, 1024, "Maximum length accepted for metric metadata. Metadata refers to Metric Name, HELP and UNIT. Longer metadata is dropped except for HELP which is truncated.") | |||
f.IntVar(&l.MaxNativeHistogramBuckets, maxNativeHistogramBucketsFlag, 0, "Maximum number of buckets per native histogram sample. 0 to disable the limit.") | |||
_ = l.CreationGracePeriod.Set("10m") | |||
f.Var(&l.CreationGracePeriod, creationGracePeriodFlag, "Controls how far into the future incoming samples are accepted compared to the wall clock. Any sample with timestamp `t` will be rejected if `t > (now + validation.create-grace-period)`. Also used by query-frontend to avoid querying too far into the future. 0 to disable.") | |||
f.Var(&l.CreationGracePeriod, creationGracePeriodFlag, fmt.Sprintf("Controls how far into the future incoming samples are accepted compared to the wall clock. Any sample will be rejected if its timestamp is greater than '(now + -%s)'. This configuration is enforced in the distributor, ingester and query-frontend (to avoid querying too far into the future).", creationGracePeriodFlag)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: removed the backticks usage because they're not rendered as intended in our doc. The entire auto-generated doc is within triple-backtick quotes, so the single backtick is just rendered as a backtick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good, but I wonder if this is the right approach. Bit flips may happen for other parts of the request (labels, sample value), and we don't protect against those.
I am also curious where exactly do we think that "bit flip" happened? At TCP level packets are checksummed, so I guess we can rule that out. Is it in distributor, after passing the checks? Is it in ingester after receiving the request? How can we protect against that?
One more note: query-frontend still checks if grace period is greater than zero. Shall we remove that check and always adjust "end" time? mimir/pkg/frontend/querymiddleware/limits.go Lines 157 to 171 in ff4a68a
|
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…eadUpUntilNowMinusActiveSeriesMetricsIdleTimeout Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Definitely. This is just an easy (and inexpensive) validation we can do in ingesters, to avoid screwing TSDB with a sample very far in the future. It's not intended to prevent any invalid series / sample case.
Despite several engineers spent days on it, we still don't have any clue on the actual root cause, but just theories. So I can't answer these questions. The new check introduced in this PR should also help to understand if the corruption happens before or after appending samples to TSDB. |
I will double check why we did the check in the first place (I'm wondering if was done to offer a smoother config transition) and eventually open a PR. |
Signed-off-by: Marco Pracucci <marco@pracucci.com>
9151b3e
to
4655e73
Compare
See: #5829 |
What this PR does
In this PR I propose to enforce the configured
-validation.create-grace-period
in the ingester too. The reason is that we've experienced an issue where a sample with a timestamp very far in the future has reached the ingester, and so screwed up the TSDB. We're still investigating the root cause (apparently looks like caused by few bit flipped) but in the meanwhile I would like to enforce the-validation.create-grace-period
in the ingester too.Note to reviewers:
-validation.create-grace-period
CLI flag description because it wasn't true. 0 didn't disabled it. It's always enforced.-validation.create-grace-period
is not checked against exemplar timestamp in the distributor. I will do it in a follow up PR.load-generator
to generate samples in the future and commenting the sample validation code run in the distributor. The samples with timestamp in the future reached ingesters and ingesters rejected them.Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]