-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements around testing Chunking behavior #1006
Conversation
assert.Equal(t, sinks.MetricFlushResult{MetricsFlushed: 4, MetricsDropped: 3, MetricsSkipped: 0}, flushResult) | ||
assert.Equal(t, sinks.MetricFlushResult{MetricsFlushed: 3, MetricsDropped: 9, MetricsSkipped: 0}, flushResult) | ||
|
||
// we're cancelling after 2 so we should only see 2 chunks written | ||
assert.Equal(t, 2, len(server.History())) | ||
assert.Equal(t, 3, len(server.History()[0].data.GetTimeseries())) | ||
assert.Equal(t, 3, len(server.History()[1].data.GetTimeseries())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to verify expected behavior here for "TestChunkedWriteRespectsContextCancellation":
There are 12 total samples in the chunked_input.json. The write_batch_size is 3. The context gets expired on the second batch AFTER receiving the data. We have the following 'events':
- We write 3 successfully.
- We write 3 more, but context ends up getting canceled. I think there's some inherent ambiguity about whether the write finished or was interrupted for an IRL scenario. We will consider these 3 as dropped. (Q: Are we ok with this behavior? These 3 will be considered dropped, despite them possibly having already gotten forwarded and processed. In the test case, they DO get forwarded successfully, but context is expired before the client gets a response and can bookkeep properly.)
- The last 6 are not written at all. I've added these under dropped. (Q: Should these be considered dropped? Or skipped?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Qs:
-
Anytime we can't get a hard confirmation metrics have been written we should consider them dropped. If/when a retry mechanism is implemented sending these "dropped" metrics again would do nothing, they'd just be rejected for already being written.
-
Dropped, we didn't intentionally skip over the metrics (like, say, we filtered out metrics that we can't send anything for w/e reason) but dropped them on the floor because we took too long.
All skipped metrics are dropped metrics but I consider 'dropped metrics' a result of an error and skipped metrics are intentional (and I'd call any metrics not submitted after the context deadline an error scenario)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will consider these 3 as dropped. (Q: Are we ok with this behavior? These 3 will be considered dropped, despite them possibly having already gotten forwarded and processed. In the test case, they DO get forwarded successfully, but context is expired before the client gets a response and can bookkeep properly.)
Correct, these should be considered dropped. We cannot confirm success therefore they are dropped.
The last 6 are not written at all. I've added these under dropped. (Q: Should these be considered dropped? Or skipped?)
Dropped should be appropriate here for the same reason
@@ -244,8 +244,7 @@ func (s *CortexMetricSink) Flush(ctx context.Context, metrics []samplers.InterMe | |||
}) | |||
|
|||
if err != nil { | |||
s.logger.Error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think removing this would take away some visibility we're reliant on at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this back, but idk git diff not picking it up: 3b1e56d
err := doIfNotDone(func() error { | ||
batch = append(batch, metric) | ||
if i > 0 && i%s.batchWriteSize == 0 { | ||
if len(batch)%s.batchWriteSize == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a flush w/ a single batch, will this logic change not cause duplicate flush due to this subsequent logic:
Lines 252 to 264 in 5ba47fc
var err error | |
if len(batch) > 0 { | |
err = doIfNotDone(func() error { | |
return s.writeMetrics(ctx, batch) | |
}) | |
if err == nil { | |
flushedMetrics += len(batch) | |
} else { | |
s.logger.Error(err) | |
droppedMetrics += len(batch) | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic handles a "leftover" batch. The main loop is meant to consume as many "batchWriteSize" batches as possible, with this picking up the remainder, so this should not get called. I added a test case to assert that this is the behavior when "batchWriteSize==numMetrics" (only writes metrics once)
Summary
len(batch)
to determine when we have reachedwrite_batch_size
instead of the iteration variable.Motivation
I was investigating the test cases and noted that the first batch written was always 1 more than the
write_batch_size
. In particular, I noticed that forTestChunkedWritesRespectContextCancellation
the metricsFlushed was 4, despite the batch_size being 3. This seems incorrect / unintended behavior to me.Test plan
Updated the test cases to assert on the batch size. Just using unit tests.
Rollout/monitoring/revert plan