Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Instrumentation.Process] Implement threadSafe Refresh() #664

Merged
merged 13 commits into from
Oct 10, 2022

Conversation

Yun-Ting
Copy link
Contributor

@Yun-Ting Yun-Ting commented Sep 28, 2022

Changes

Working towards: #447
Background: #335
Following OTel spec.

Assumptions:

  1. For callbacks of the instruments from the same collection of an exporter, the instrument values should come from the same snapshot of the current process. In this PR, this was achieved by calling Process.Refresh() exactly once for each collection and use the idea of a state machine to check whether the snapshot is expired or not.
    Without this guarantee, there will be issues with the following cases:
    If users are trying to use the data from the dashboard to debug OOM of their application, it will be misleading if physical memory usage and virtual memory usage were not taken from a single snapshot.
    (Also, not in the scope of this PR yet, when calculating CPU utilization:
    Process.GetTotalProcessorTime() / elapsedTimeSinceTheProcessStart * (number of processors)
    Process.GetTotalProcessorTime() should be coming from the same snapshot for accuracy.

  2. ProcessMetrics should be threadSafe for multi-reader scenarios.

  3. Since .NET currently does not support multiple-instrument callback System.Diagnostics (Observability) work planned for .NET 7 dotnet/runtime#62027 (comment), this PR uses the state machine workaround to keep track of whether the instrument callbacks were called or not within one collection of a reader.

Attached the console output for 2-collection-cycle from #663 to elaborate the flow.

Console output for 1 reader case:

threadId: 11 invoke GetMemoryUsage() without the latest snapshot.
threadId: 11 Refresh()
threadId: 11 retrieve memory usage from snapshot.
threadId: 11 retrieve virtual memory usage from snapshot.
Resource associated with Metric:
    service.name: unknown_service:runtime-instrumentation

Export process.memory.usage, The amount of physical memory in use., Unit: By, Meter: OpenTelemetry.Instrumentation.Process/0.0.0.0
(2022-09-28T22:21:35.3619754Z, 2022-09-28T22:21:45.3608093Z] DoubleGauge
Value: 29290496

Export process.memory.virtual, The amount of committed virtual memory., Unit: By, Meter: OpenTelemetry.Instrumentation.Process/0.0.0.0
(2022-09-28T22:21:35.3643951Z, 2022-09-28T22:21:45.3608338Z] DoubleGauge
Value: 14835712
threadId: 11 invoke GetMemoryUsage() without the latest snapshot.
threadId: 11 Refresh()
threadId: 11 retrieve memory usage from snapshot.
threadId: 11 retrieve virtual memory usage from snapshot.

Export process.memory.usage, The amount of physical memory in use., Unit: By, Meter: OpenTelemetry.Instrumentation.Process/0.0.0.0
(2022-09-28T22:21:35.3619754Z, 2022-09-28T22:21:55.3418369Z] DoubleGauge
Value: 29736960

Export process.memory.virtual, The amount of committed virtual memory., Unit: By, Meter: OpenTelemetry.Instrumentation.Process/0.0.0.0
(2022-09-28T22:21:35.3643951Z, 2022-09-28T22:21:55.3418375Z] DoubleGauge
Value: 14942208

Console output for 2 readers (I preprocessed the output with the correct time sequence from console for eye-friendly reading experience)

image

@Yun-Ting Yun-Ting requested a review from a team September 28, 2022 00:53
@codecov
Copy link

codecov bot commented Sep 28, 2022

Codecov Report

Merging #664 (ec9a430) into main (46bad33) will increase coverage by 0.06%.
The diff coverage is 100.00%.

❗ Current head ec9a430 differs from pull request most recent head 6a90e75. Consider uploading reports for the commit 6a90e75 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #664      +/-   ##
==========================================
+ Coverage   77.59%   77.65%   +0.06%     
==========================================
  Files         175      175              
  Lines        5222     5237      +15     
==========================================
+ Hits         4052     4067      +15     
  Misses       1170     1170              
Impacted Files Coverage Δ
...entation.Process/MeterProviderBuilderExtensions.cs 100.00% <100.00%> (ø)
...elemetry.Instrumentation.Process/ProcessMetrics.cs 100.00% <100.00%> (ø)

CurrentProcess.Refresh();
return CurrentProcess.WorkingSet64;
},
() => CurrentThreadInstrumentsValues.Value.GetMemoryUsage(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with multiple MeterProvider setup?

In the scenario where you have multiple meterProviders and all of them add Process instrumentation, this would create the same set of instruments for each MeterProvider since they are all using the same Meter instance.

Copy link
Contributor Author

@Yun-Ting Yun-Ting Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure for what intent would the user want to do .AddProcessInstrumentation() for multiple meterProviders to a given application/process. Everything in ProcessMetrics should ideally be static in the scope of a Process.

Could you share with me the use case you have in mind that ProcessMetrics was being added to more than one meterProvider?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One use case could be where the user wants to send data from different meters to different exporters along with the process metrics.

Copy link
Contributor Author

@Yun-Ting Yun-Ting Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there will not be a problem if we mark the ctor of ProcessMetrics as static.
I still don't quite get why would the user want to monitor a static object via different exporters?
One case I could think of now is role-based access control - authorized users of the application can see a certain field like for example process.disk.io while non-authorized users cannot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use instance members for meter and InstrumentValues with this commit 13434f5.

One caveat with this implementation is in Snapshot(), threadA calls Process.Refresh(), and before threadA does the assignment to this.memoryUsage, threadB calls Process.Refresh(). The odds are small as assignments should be very fast.
Even if we hit this situation, it should not be a big deal because the time difference between these 2 refreshes is tiny.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One caveat with this implementation is in Snapshot(), threadA calls Process.Refresh(), and before threadA does the assignment to this.memoryUsage, threadB calls Process.Refresh(). The odds are small as assignments should be very fast. Even if we hit this situation, it should not be a big deal because the time difference between these 2 refreshes is tiny.

When would this happen? Is there a concrete example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

The case I have in mind is:
At t3, the value got assigned to this.virtualMemoryUsage for threadA is from this.currentProcessRefresh() from t2 by threadB;
because GetCurrentProcess() is a static method that returns "A new Process component associated with the process resource that is running the calling application."

Even though we maintain different copies for currentProcess here:

private readonly Diagnostics.Process currentProcess = Diagnostics.Process.GetCurrentProcess();

Those are referencing the same object.

But again, I don't think this is a case that could be hit frequently. And even it did occur, the time difference between the 2 snapshots are negligible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot is an instance method, under which condition would the same instance see concurrent calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah.... Snapshot() will not be accessed concurrently because it is an instance method.
What I was worried about is that whether those 2 instances are refreshing on the same object.

After further investigation, that is not the case:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work with multiple MeterProvider setup?

In the scenario where you have multiple meterProviders and all of them add Process instrumentation, this would create the same set of instruments for each MeterProvider since they are all using the same Meter instance.

Please take a look:
https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/664/files#r989475625

@utpilla utpilla added the comp:instrumentation.process Things related to OpenTelemetry.Instrumentation.Process label Sep 29, 2022
{
internal static readonly AssemblyName AssemblyName = typeof(ProcessMetrics).Assembly.GetName();
internal static readonly Meter MeterInstance = new(AssemblyName.Name, AssemblyName.Version.ToString());
private static readonly Diagnostics.Process CurrentProcess = Diagnostics.Process.GetCurrentProcess();
internal readonly Meter MeterInstance = new(AssemblyName.Name, AssemblyName.Version.ToString());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, if creating 2 instances of ProcessMetrics class, do we expect every instance create its own meter with the same name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that in the SDK, MetricStreamIdentity is maintained, so there will be only one MetricStreamIdentity in this case.

What would happen is both of the callbacks from the instruments would be firing, but there will only be one export.
The second one will hit this https://github.com/open-telemetry/opentelemetry-dotnet/blob/7f71283c0bf2ee0e2c5e3c9617295ddaf20c7a0c/src/OpenTelemetry/Metrics/MetricReaderExt.cs#L51

Please refer to the testcases for the experiments I've done.
Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will be only one MetricStreamIdentity in this case.

I doubt that. MetricStreamIdentity is maintained at a MetricReader level within the SDK. In the scenario, where we have two instances of ProcessMetrics class due to multiple MeterProviders we would also have multiple readers each maintaining their own set of MetricStreamIdentity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I think each reader has its own storage with the current implementation - usage of instance for ProcessMetrics;
so it may not be required to maintain MetricStreamIdentity at the reader level.

Copy link
Member

@CodeBlanch CodeBlanch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

var metricA = exportedItemsA.FirstOrDefault(i => i.Name == "myGaugeName");
var metricB = exportedItemsB.FirstOrDefault(i => i.Name == "myGaugeName");

Assert.Equal(GetValue(metricA), GetValue(metricB));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect them to be equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for counters as the MetricStreamIdentity will be the same if the meter name matches, the value will be equal.

While, for gauges, my understanding is that since they do a spontaneous measurement at the time they were being collected, so it is hard to use InMemoryExporter to create a unit test for that.

But I checked that the running values in the aggregationStore match:
image

((OpenTelemetry.Metrics.Metric)(new System.Linq.SystemCore_EnumerableDebugView<System.Diagnostics.Metrics.ListenerSubscription>((new System.Collections.Generic.ICollectionDebugView<System.Diagnostics.Metrics.Instrument>(((OpenTelemetry.Instrumentation.Process.ProcessMetrics)(new System.Collections.Generic.ICollectionDebugView<object>(((OpenTelemetry.Metrics.MeterProviderSdk)meterProviderA).Instrumentations).Items[0])).MeterInstance._instruments).Items[0])._subscriptions).Items[0]).State).aggStore.metricPoints[0].runningValue.AsDouble



((OpenTelemetry.Metrics.Metric)(new System.Linq.SystemCore_EnumerableDebugView<System.Diagnostics.Metrics.ListenerSubscription>((new System.Collections.Generic.ICollectionDebugView<System.Diagnostics.Metrics.Instrument>(((OpenTelemetry.Instrumentation.Process.ProcessMetrics)(new System.Collections.Generic.ICollectionDebugView<object>(((OpenTelemetry.Metrics.MeterProviderSdk)meterProviderB).Instrumentations).Items[0])).MeterInstance._instruments).Items[0])._subscriptions).Items[0]).State).aggStore.metricPoints[0].runningValue.AsDouble

Copy link
Contributor

@utpilla utpilla Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding MetricStreamIdentity: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/664/files#r991558434

You have two ObservableCounters here: one always returns 1 and the other always returns 2 yet this line: Assert.Equal(GetValue(metricA), GetValue(metricB)); succeeds. This means that only one of the ObservableCounters callbacks is being called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... what I see is that both callbacks will be triggered for the meterProviders, but the second callback will hit this:
image

test code here:
https://github.com/open-telemetry/opentelemetry-dotnet/compare/main...Yun-Ting:yunl/instrumentNames?expand=1

Copy link
Contributor

@utpilla utpilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

The case of verifying correctness for multiple MeterProviders could be investigated further. We don't have to block this PR for that.

@utpilla
Copy link
Contributor

utpilla commented Oct 10, 2022

Merging this PR to unblock progress. Still need to verify the correctness in case of multiple MeterProviders.

@utpilla utpilla merged commit b9e3d88 into open-telemetry:main Oct 10, 2022
@Yun-Ting Yun-Ting deleted the yunl/threadSafeRefresh branch October 10, 2022 21:42
},
unit: "By",
description: "The amount of committed virtual memory.");
description: "The amount of virtual memory allocated for this process that cannot be shared with other processes.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting "process.memory.virtual" to be Process.VirtualMemorySize64. I saw you linked to #673 which had a very nice explanation for why you made the change. I think you were attempting to be very thorough and precise in making this choice so I want to offer something equally detailed to give an alternate perspective on the problem.

When we get into details, I think 'committed memory' is unfortunately an ambiguous term and the best way to understand what the OTel spec writers intended will be for us to ask them and clarify the spec. In Mark Russinovich's blog post he repeatedly refers to memory that "counts against the commit limit", but he never actually says "Memory is 'committed memory' if and only if it counts against the commit limit". I think that would be an intuitive definition in context of his article and probably it is the definition he intended the reader to apply despite never stating it pedantically. For simplicity we can just call that his intended definition. His definition is probably technically precise for how some portion of the Windows Kernel uses the term, but I suspect it is not how most people understand and use it.

IMO the common definition of committed memory that most developers would use is that it is all the memory addresses a process can read from without triggering an access violation. This includes (at least) three categories of memory:

  • memory backed solely by RAM
  • memory backed by a swapfile/pagefile
  • memory backed by some other mapped file on disk

The first two agree with Mark's definition above so the difference is entirely whether the third bullet is included. To justify that it is common for most people to include file backed memory as part of 'committed memory' this article has a screenshot of the tool VMMap at the top. I consider VMMap to be the de-facto standard for investigating and categorizing virtual memory usage (incidentally its also a tool MarkR made). Looking at the table in that image is a column called 'Committed' and the first two rows are 'Image' and 'Mapped File'. If committed virtual memory did not include memory backed by files on disk then we'd expect those columns to be 0, but they aren't.

As a second example, this link is a help page for the !address command in windbg which shows information about the virtual address space. There is an example output part way down the page showing:

0:000> !address 75831234
Usage:                  Image
Base Address:           75831000
End Address:            758f6000
Region Size:            000c5000
Type:                   01000000MEM_IMAGE
State:                  00001000MEM_COMMIT
Protect:                00000020PAGE_EXECUTE_READ
More info:              lmv m kernel32
More info:              !lmi kernel32
More info:              ln 0x75831234

The debugger says that this address is part of the kernel32 image and it is on a memory page that is marked both as being MEM_IMAGE and MEM_COMMIT. Mark's blog post says specifically that "The virtual memory in Testlimit's address space where its executable and system DLL images are mapped therefore don't count toward the commit limit" and yet windbg describes this exact type of memory as being MEM_COMMIT.

I think the way VMMap and Windbg refer to committed memory is the way most developers would use the term. That definition corresponds to using Process.VirtualMemorySize64 as the metric value. Of course we can always ask the OTel spec writers if that is the definition they intended. Hope that helps!

"process.memory.usage",
() =>
{
CurrentProcess.Refresh();
return CurrentProcess.WorkingSet64;
if (!this.memoryUsage.HasValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these callbacks to be thread-safe they need to be inside a lock. For example:

() =>
{
    lock(this)
    {
        if (!this.memoryUsage.HasValue)
        {
            this.Snapshot();
        }
        var value = this.memoryUsage.Value;
        this.memoryUsage = null;
        return value;
    }
}

/// </summary>
/// <param name="options">The options to define the metrics.</param>
public ProcessMetrics(ProcessInstrumentationOptions options)
private void Snapshot()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen any practical examples where the lack of a snapshot caused trouble? For comparison, the CPU usage, GC memory usage, and total VM usage EventCounters have been present in .NET Core since .NET Core 3.1, there is nothing that forces them to be read as a single snapshot, and I am not aware of any user feedback saying that it has ever caused someone trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing the comment here:
#718

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:instrumentation.process Things related to OpenTelemetry.Instrumentation.Process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants