Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements around testing Chunking behavior #1006

Merged
merged 11 commits into from
Nov 14, 2022

Conversation

mimran-stripe
Copy link
Contributor

@mimran-stripe mimran-stripe commented Nov 10, 2022

Summary

  1. Changes the logic to use len(batch) to determine when we have reached write_batch_size instead of the iteration variable.
  2. Adds assertions to how large each batch is that was written.
  3. A few changes to comments.

Motivation

I was investigating the test cases and noted that the first batch written was always 1 more than the write_batch_size. In particular, I noticed that for TestChunkedWritesRespectContextCancellation the metricsFlushed was 4, despite the batch_size being 3. This seems incorrect / unintended behavior to me.

Test plan

Updated the test cases to assert on the batch size. Just using unit tests.

Rollout/monitoring/revert plan

Comment on lines -284 to +297
assert.Equal(t, sinks.MetricFlushResult{MetricsFlushed: 4, MetricsDropped: 3, MetricsSkipped: 0}, flushResult)
assert.Equal(t, sinks.MetricFlushResult{MetricsFlushed: 3, MetricsDropped: 9, MetricsSkipped: 0}, flushResult)

// we're cancelling after 2 so we should only see 2 chunks written
assert.Equal(t, 2, len(server.History()))
assert.Equal(t, 3, len(server.History()[0].data.GetTimeseries()))
assert.Equal(t, 3, len(server.History()[1].data.GetTimeseries()))
Copy link
Contributor Author

@mimran-stripe mimran-stripe Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to verify expected behavior here for "TestChunkedWriteRespectsContextCancellation":
There are 12 total samples in the chunked_input.json. The write_batch_size is 3. The context gets expired on the second batch AFTER receiving the data. We have the following 'events':

  1. We write 3 successfully.
  2. We write 3 more, but context ends up getting canceled. I think there's some inherent ambiguity about whether the write finished or was interrupted for an IRL scenario. We will consider these 3 as dropped. (Q: Are we ok with this behavior? These 3 will be considered dropped, despite them possibly having already gotten forwarded and processed. In the test case, they DO get forwarded successfully, but context is expired before the client gets a response and can bookkeep properly.)
  3. The last 6 are not written at all. I've added these under dropped. (Q: Should these be considered dropped? Or skipped?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Qs:

  1. Anytime we can't get a hard confirmation metrics have been written we should consider them dropped. If/when a retry mechanism is implemented sending these "dropped" metrics again would do nothing, they'd just be rejected for already being written.

  2. Dropped, we didn't intentionally skip over the metrics (like, say, we filtered out metrics that we can't send anything for w/e reason) but dropped them on the floor because we took too long.

All skipped metrics are dropped metrics but I consider 'dropped metrics' a result of an error and skipped metrics are intentional (and I'd call any metrics not submitted after the context deadline an error scenario)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will consider these 3 as dropped. (Q: Are we ok with this behavior? These 3 will be considered dropped, despite them possibly having already gotten forwarded and processed. In the test case, they DO get forwarded successfully, but context is expired before the client gets a response and can bookkeep properly.)

Correct, these should be considered dropped. We cannot confirm success therefore they are dropped.

The last 6 are not written at all. I've added these under dropped. (Q: Should these be considered dropped? Or skipped?)

Dropped should be appropriate here for the same reason

@mimran-stripe mimran-stripe changed the title Investigating off by one Improvements around testing Bucketing behavior Nov 10, 2022
@@ -244,8 +244,7 @@ func (s *CortexMetricSink) Flush(ctx context.Context, metrics []samplers.InterMe
})

if err != nil {
s.logger.Error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think removing this would take away some visibility we're reliant on at the moment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this back, but idk git diff not picking it up: 3b1e56d

err := doIfNotDone(func() error {
batch = append(batch, metric)
if i > 0 && i%s.batchWriteSize == 0 {
if len(batch)%s.batchWriteSize == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a flush w/ a single batch, will this logic change not cause duplicate flush due to this subsequent logic:

var err error
if len(batch) > 0 {
err = doIfNotDone(func() error {
return s.writeMetrics(ctx, batch)
})
if err == nil {
flushedMetrics += len(batch)
} else {
s.logger.Error(err)
droppedMetrics += len(batch)
}
}

Copy link
Contributor Author

@mimran-stripe mimran-stripe Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic handles a "leftover" batch. The main loop is meant to consume as many "batchWriteSize" batches as possible, with this picking up the remainder, so this should not get called. I added a test case to assert that this is the behavior when "batchWriteSize==numMetrics" (only writes metrics once)

@mimran-stripe mimran-stripe changed the title Improvements around testing Bucketing behavior Improvements around testing Chunking behavior Nov 10, 2022
@mimran-stripe mimran-stripe merged commit febb360 into master Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants