Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Metric livelock by replacing potential infinate loop in MetricValuesBuffer.GetAndResetValue #2612

Merged
merged 10 commits into from
Jun 17, 2022

Conversation

TimothyMothra
Copy link
Member

@TimothyMothra TimothyMothra commented Jun 16, 2022

Fix Issue #1186.

This PR mitigates the potential infinite while loop in MetricValuesBuffer by adding an exit condition.

Changes

  • added an exit condition to MetricValuesBuffer.GetAndResetValue().

Explanation

MetricValuesBuffer has a section that could get stuck in an infinate loop here:

value = this.GetAndResetValueOnce(this.values, index);
while (this.IsInvalidValue(value))
{
spinWait.SpinOnce();
if (spinWait.Count % 100 == 0)
{
// In tests (including stress tests) we always finished wating before 100 cycles.
// However, this is a protection against en extreme case on a slow machine.
Task.Delay(10).ConfigureAwait(continueOnCapturedContext: false).GetAwaiter().GetResult();
}
value = this.GetAndResetValueOnce(this.values, index);
}

GetAndResetValueOnce will read a value from the buffer at an index and set reset that to Double.NaN.
This loop has no way to breakout if another thread has already reset this value.

This would also affect the lock in MetricSeriesAggregatorBase:

lock (buffer)
{
int maxFlushIndex = Math.Min(buffer.PeekLastWriteIndex(), buffer.Capacity - 1);
int minFlushIndex = buffer.NextFlushIndex;
if (minFlushIndex > maxFlushIndex)
{
return;
}
stage1Result = this.UpdateAggregate_Stage1(buffer, minFlushIndex, maxFlushIndex);
buffer.NextFlushIndex = maxFlushIndex + 1;
}

Alternatives considered

My previous PR replaced the lock in MetricSeriesAggregatorBase. #2595.

However, when considering what happens when we break out...
Breaking out of MetricSeriesAggregatorBase may cause the SDK to lose a complete batch of metrics.
Instead, we should only drop a single metric if breaking out of MetricValuesBuffer.

Checklist

  • I ran Unit Tests locally.
  • CHANGELOG.md updated with one line description of the fix, and a link to the original issue if available.

For significant contributions please make sure you have completed the following items:

  • Design discussion issue #
  • Changes in public surface reviewed

The PR will trigger build, unit tests, and functional tests automatically. Please follow these instructions to build and test locally.

Notes for authors:

  • FxCop and other analyzers will fail the build. To see these errors yourself, compile localy using the Release configuration.

Notes for reviewers:

  • We support comment build triggers
    • /AzurePipelines run will queue all builds
    • /AzurePipelines run <pipeline-name> will queue a specific build

@TimothyMothra TimothyMothra requested a review from cijothomas June 16, 2022 00:08
@TimothyMothra TimothyMothra marked this pull request as ready for review June 16, 2022 20:40
@TimothyMothra TimothyMothra changed the title fix Metric deadlock by replacing potential infinate loop in MetricValuesBuffer.GetAndResetValue fix Metric livelock by replacing potential infinate loop in MetricValuesBuffer.GetAndResetValue Jun 16, 2022
Copy link
Contributor

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Need some validation before stable releasing this.

@TimothyMothra TimothyMothra enabled auto-merge (squash) June 17, 2022 18:37
@TimothyMothra TimothyMothra merged commit e9d4974 into main Jun 17, 2022
@TimothyMothra TimothyMothra deleted the tilee/1186_GetAndResetValue branch June 17, 2022 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants