Improvements around testing Chunking behavior #1006

mimran-stripe · 2022-11-10T04:04:23Z

Summary

Changes the logic to use len(batch) to determine when we have reached write_batch_size instead of the iteration variable.
Adds assertions to how large each batch is that was written.
A few changes to comments.

Motivation

I was investigating the test cases and noted that the first batch written was always 1 more than the write_batch_size. In particular, I noticed that for TestChunkedWritesRespectContextCancellation the metricsFlushed was 4, despite the batch_size being 3. This seems incorrect / unintended behavior to me.

Test plan

Updated the test cases to assert on the batch size. Just using unit tests.

Rollout/monitoring/revert plan

mimran-stripe · 2022-11-10T05:05:20Z

sinks/cortex/cortex_test.go

-	assert.Equal(t, sinks.MetricFlushResult{MetricsFlushed: 4, MetricsDropped: 3, MetricsSkipped: 0}, flushResult)
+	assert.Equal(t, sinks.MetricFlushResult{MetricsFlushed: 3, MetricsDropped: 9, MetricsSkipped: 0}, flushResult)

 	// we're cancelling after 2 so we should only see 2 chunks written
 	assert.Equal(t, 2, len(server.History()))
+	assert.Equal(t, 3, len(server.History()[0].data.GetTimeseries()))
+	assert.Equal(t, 3, len(server.History()[1].data.GetTimeseries()))


Wanted to verify expected behavior here for "TestChunkedWriteRespectsContextCancellation":
There are 12 total samples in the chunked_input.json. The write_batch_size is 3. The context gets expired on the second batch AFTER receiving the data. We have the following 'events':

We write 3 successfully.

We write 3 more, but context ends up getting canceled. I think there's some inherent ambiguity about whether the write finished or was interrupted for an IRL scenario. We will consider these 3 as dropped. (Q: Are we ok with this behavior? These 3 will be considered dropped, despite them possibly having already gotten forwarded and processed. In the test case, they DO get forwarded successfully, but context is expired before the client gets a response and can bookkeep properly.)

The last 6 are not written at all. I've added these under dropped. (Q: Should these be considered dropped? Or skipped?)

For Qs:

Anytime we can't get a hard confirmation metrics have been written we should consider them dropped. If/when a retry mechanism is implemented sending these "dropped" metrics again would do nothing, they'd just be rejected for already being written.

Dropped, we didn't intentionally skip over the metrics (like, say, we filtered out metrics that we can't send anything for w/e reason) but dropped them on the floor because we took too long.

All skipped metrics are dropped metrics but I consider 'dropped metrics' a result of an error and skipped metrics are intentional (and I'd call any metrics not submitted after the context deadline an error scenario)

We will consider these 3 as dropped. (Q: Are we ok with this behavior? These 3 will be considered dropped, despite them possibly having already gotten forwarded and processed. In the test case, they DO get forwarded successfully, but context is expired before the client gets a response and can bookkeep properly.)

Correct, these should be considered dropped. We cannot confirm success therefore they are dropped.

The last 6 are not written at all. I've added these under dropped. (Q: Should these be considered dropped? Or skipped?)

Dropped should be appropriate here for the same reason

andresgalindo-stripe · 2022-11-10T17:31:02Z

sinks/cortex/cortex.go

@@ -244,8 +244,7 @@ func (s *CortexMetricSink) Flush(ctx context.Context, metrics []samplers.InterMe
 		})

 		if err != nil {
-			s.logger.Error(err)


I think removing this would take away some visibility we're reliant on at the moment

Added this back, but idk git diff not picking it up: 3b1e56d

rma-stripe · 2022-11-10T17:49:55Z

sinks/cortex/cortex.go

 		err := doIfNotDone(func() error {
 			batch = append(batch, metric)
-			if i > 0 && i%s.batchWriteSize == 0 {
+			if len(batch)%s.batchWriteSize == 0 {


For a flush w/ a single batch, will this logic change not cause duplicate flush due to this subsequent logic:

veneur/sinks/cortex/cortex.go

Lines 252 to 264 in 5ba47fc

var err error

if len(batch) > 0 {

err = doIfNotDone(func() error {

return s.writeMetrics(ctx, batch)

})

if err == nil {

flushedMetrics += len(batch)

} else {

s.logger.Error(err)

droppedMetrics += len(batch)

}

}

This logic handles a "leftover" batch. The main loop is meant to consume as many "batchWriteSize" batches as possible, with this picking up the remainder, so this should not get called. I added a test case to assert that this is the behavior when "batchWriteSize==numMetrics" (only writes metrics once)

mimran-stripe added 6 commits November 9, 2022 19:52

tests include assertions on each batch length

c0a3c03

remove logs

c747770

remove unrelated config field

dfb56c1

Trigger CI

df95bab

account for other dropped metrics when context gets canceled

36fd632

update comments to be more accurate

08e3af1

mimran-stripe commented Nov 10, 2022

View reviewed changes

mimran-stripe added 2 commits November 9, 2022 21:50

should return err not nil

85d4c50

assert all of the other test cases only have 1 write

5ba47fc

mimran-stripe changed the title ~~Investigating off by one~~ Improvements around testing Bucketing behavior Nov 10, 2022

andresgalindo-stripe reviewed Nov 10, 2022

View reviewed changes

rma-stripe reviewed Nov 10, 2022

View reviewed changes

mimran-stripe added 2 commits November 10, 2022 10:27

add log for error

3b1e56d

add test for batch_size==num_metrics

d3f3f05

mimran-stripe requested review from rma-stripe and andresgalindo-stripe November 10, 2022 19:49

mimran-stripe changed the title ~~Improvements around testing Bucketing behavior~~ Improvements around testing Chunking behavior Nov 10, 2022

Merge branch master

f7298aa

rma-stripe approved these changes Nov 11, 2022

View reviewed changes

mimran-stripe merged commit febb360 into master Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements around testing Chunking behavior #1006

Improvements around testing Chunking behavior #1006

mimran-stripe commented Nov 10, 2022 •

edited

Loading

mimran-stripe Nov 10, 2022 •

edited

Loading

andresgalindo-stripe Nov 10, 2022

rma-stripe Nov 10, 2022

andresgalindo-stripe Nov 10, 2022

mimran-stripe Nov 10, 2022

rma-stripe Nov 10, 2022

mimran-stripe Nov 10, 2022 •

edited

Loading

	var err error
	if len(batch) > 0 {
	err = doIfNotDone(func() error {
	return s.writeMetrics(ctx, batch)
	})

	if err == nil {
	flushedMetrics += len(batch)
	} else {
	s.logger.Error(err)
	droppedMetrics += len(batch)
	}
	}

Improvements around testing Chunking behavior #1006

Improvements around testing Chunking behavior #1006

Conversation

mimran-stripe commented Nov 10, 2022 • edited Loading

Summary

Motivation

Test plan

Rollout/monitoring/revert plan

mimran-stripe Nov 10, 2022 • edited Loading

Choose a reason for hiding this comment

andresgalindo-stripe Nov 10, 2022

Choose a reason for hiding this comment

rma-stripe Nov 10, 2022

Choose a reason for hiding this comment

andresgalindo-stripe Nov 10, 2022

Choose a reason for hiding this comment

mimran-stripe Nov 10, 2022

Choose a reason for hiding this comment

rma-stripe Nov 10, 2022

Choose a reason for hiding this comment

mimran-stripe Nov 10, 2022 • edited Loading

Choose a reason for hiding this comment

mimran-stripe commented Nov 10, 2022 •

edited

Loading

mimran-stripe Nov 10, 2022 •

edited

Loading

mimran-stripe Nov 10, 2022 •

edited

Loading