[Instrumentation.Process] Implement threadSafe Refresh() #664

Yun-Ting · 2022-09-28T00:53:04Z

Changes

Working towards: #447
Background: #335
Following OTel spec.

Assumptions:

For callbacks of the instruments from the same collection of an exporter, the instrument values should come from the same snapshot of the current process. In this PR, this was achieved by calling Process.Refresh() exactly once for each collection and use the idea of a state machine to check whether the snapshot is expired or not.
Without this guarantee, there will be issues with the following cases:
If users are trying to use the data from the dashboard to debug OOM of their application, it will be misleading if physical memory usage and virtual memory usage were not taken from a single snapshot.
(Also, not in the scope of this PR yet, when calculating CPU utilization:
Process.GetTotalProcessorTime() / elapsedTimeSinceTheProcessStart * (number of processors)
Process.GetTotalProcessorTime() should be coming from the same snapshot for accuracy.
ProcessMetrics should be threadSafe for multi-reader scenarios.
Since .NET currently does not support multiple-instrument callback System.Diagnostics (Observability) work planned for .NET 7 dotnet/runtime#62027 (comment), this PR uses the state machine workaround to keep track of whether the instrument callbacks were called or not within one collection of a reader.

Attached the console output for 2-collection-cycle from #663 to elaborate the flow.

Console output for 1 reader case:

threadId: 11 invoke GetMemoryUsage() without the latest snapshot.
threadId: 11 Refresh()
threadId: 11 retrieve memory usage from snapshot.
threadId: 11 retrieve virtual memory usage from snapshot.
Resource associated with Metric:
    service.name: unknown_service:runtime-instrumentation

Export process.memory.usage, The amount of physical memory in use., Unit: By, Meter: OpenTelemetry.Instrumentation.Process/0.0.0.0
(2022-09-28T22:21:35.3619754Z, 2022-09-28T22:21:45.3608093Z] DoubleGauge
Value: 29290496

Export process.memory.virtual, The amount of committed virtual memory., Unit: By, Meter: OpenTelemetry.Instrumentation.Process/0.0.0.0
(2022-09-28T22:21:35.3643951Z, 2022-09-28T22:21:45.3608338Z] DoubleGauge
Value: 14835712
threadId: 11 invoke GetMemoryUsage() without the latest snapshot.
threadId: 11 Refresh()
threadId: 11 retrieve memory usage from snapshot.
threadId: 11 retrieve virtual memory usage from snapshot.

Export process.memory.usage, The amount of physical memory in use., Unit: By, Meter: OpenTelemetry.Instrumentation.Process/0.0.0.0
(2022-09-28T22:21:35.3619754Z, 2022-09-28T22:21:55.3418369Z] DoubleGauge
Value: 29736960

Export process.memory.virtual, The amount of committed virtual memory., Unit: By, Meter: OpenTelemetry.Instrumentation.Process/0.0.0.0
(2022-09-28T22:21:35.3643951Z, 2022-09-28T22:21:55.3418375Z] DoubleGauge
Value: 14942208

Console output for 2 readers (I preprocessed the output with the correct time sequence from console for eye-friendly reading experience)

src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs

codecov · 2022-09-28T22:48:31Z

Codecov Report

Merging #664 (ec9a430) into main (46bad33) will increase coverage by 0.06%.
The diff coverage is 100.00%.

❗ Current head ec9a430 differs from pull request most recent head 6a90e75. Consider uploading reports for the commit 6a90e75 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #664      +/-   ##
==========================================
+ Coverage   77.59%   77.65%   +0.06%     
==========================================
  Files         175      175              
  Lines        5222     5237      +15     
==========================================
+ Hits         4052     4067      +15     
  Misses       1170     1170

Impacted Files	Coverage Δ
...entation.Process/MeterProviderBuilderExtensions.cs	`100.00% <100.00%> (ø)`
...elemetry.Instrumentation.Process/ProcessMetrics.cs	`100.00% <100.00%> (ø)`

src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs

utpilla · 2022-09-29T01:40:29Z

src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs

-                CurrentProcess.Refresh();
-                return CurrentProcess.WorkingSet64;
-            },
+            () => CurrentThreadInstrumentsValues.Value.GetMemoryUsage(),


How does this work with multiple MeterProvider setup?

In the scenario where you have multiple meterProviders and all of them add Process instrumentation, this would create the same set of instruments for each MeterProvider since they are all using the same Meter instance.

I am not sure for what intent would the user want to do .AddProcessInstrumentation() for multiple meterProviders to a given application/process. Everything in ProcessMetrics should ideally be static in the scope of a Process.

Could you share with me the use case you have in mind that ProcessMetrics was being added to more than one meterProvider?

One use case could be where the user wants to send data from different meters to different exporters along with the process metrics.

I think there will not be a problem if we mark the ctor of ProcessMetrics as static.
I still don't quite get why would the user want to monitor a static object via different exporters?
One case I could think of now is role-based access control - authorized users of the application can see a certain field like for example process.disk.io while non-authorized users cannot.

Updated to use instance members for meter and InstrumentValues with this commit 13434f5.

One caveat with this implementation is in Snapshot(), threadA calls Process.Refresh(), and before threadA does the assignment to this.memoryUsage, threadB calls Process.Refresh(). The odds are small as assignments should be very fast.
Even if we hit this situation, it should not be a big deal because the time difference between these 2 refreshes is tiny.

One caveat with this implementation is in Snapshot(), threadA calls Process.Refresh(), and before threadA does the assignment to this.memoryUsage, threadB calls Process.Refresh(). The odds are small as assignments should be very fast. Even if we hit this situation, it should not be a big deal because the time difference between these 2 refreshes is tiny.

When would this happen? Is there a concrete example?

The case I have in mind is:
At t3, the value got assigned to this.virtualMemoryUsage for threadA is from this.currentProcessRefresh() from t2 by threadB;
because GetCurrentProcess() is a static method that returns "A new Process component associated with the process resource that is running the calling application."

Even though we maintain different copies for currentProcess here:

opentelemetry-dotnet-contrib/src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs

Line 51 in 13434f5

private readonly Diagnostics.Process currentProcess = Diagnostics.Process.GetCurrentProcess();

Those are referencing the same object.

But again, I don't think this is a case that could be hit frequently. And even it did occur, the time difference between the 2 snapshots are negligible.

Snapshot is an instance method, under which condition would the same instance see concurrent calls?

yeah.... Snapshot() will not be accessed concurrently because it is an instance method.
What I was worried about is that whether those 2 instances are refreshing on the same object.

After further investigation, that is not the case:

How does this work with multiple MeterProvider setup?

In the scenario where you have multiple meterProviders and all of them add Process instrumentation, this would create the same set of instruments for each MeterProvider since they are all using the same Meter instance.

Please take a look:
https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/664/files#r989475625

tarekgh · 2022-10-06T20:56:10Z

src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs

 {
    internal static readonly AssemblyName AssemblyName = typeof(ProcessMetrics).Assembly.GetName();
-    internal static readonly Meter MeterInstance = new(AssemblyName.Name, AssemblyName.Version.ToString());
-    private static readonly Diagnostics.Process CurrentProcess = Diagnostics.Process.GetCurrentProcess();
+    internal readonly Meter MeterInstance = new(AssemblyName.Name, AssemblyName.Version.ToString());


I am wondering, if creating 2 instances of ProcessMetrics class, do we expect every instance create its own meter with the same name?

My understanding is that in the SDK, MetricStreamIdentity is maintained, so there will be only one MetricStreamIdentity in this case.

What would happen is both of the callbacks from the instruments would be firing, but there will only be one export.
The second one will hit this https://github.com/open-telemetry/opentelemetry-dotnet/blob/7f71283c0bf2ee0e2c5e3c9617295ddaf20c7a0c/src/OpenTelemetry/Metrics/MetricReaderExt.cs#L51

Please refer to the testcases for the experiments I've done.
Thank you.

there will be only one MetricStreamIdentity in this case.

I doubt that. MetricStreamIdentity is maintained at a MetricReader level within the SDK. In the scenario, where we have two instances of ProcessMetrics class due to multiple MeterProviders we would also have multiple readers each maintaining their own set of MetricStreamIdentity.

Hmm... I think each reader has its own storage with the current implementation - usage of instance for ProcessMetrics;
so it may not be required to maintain MetricStreamIdentity at the reader level.

…opentelemetry-dotnet-contrib into yunl/threadSafeRefresh

CodeBlanch

LGTM

test/OpenTelemetry.Instrumentation.Process.Tests/ProcessMetricsTests.cs

utpilla · 2022-10-10T18:37:13Z

test/OpenTelemetry.Instrumentation.Process.Tests/ProcessMetricsTests.cs

+        var metricA = exportedItemsA.FirstOrDefault(i => i.Name == "myGaugeName");
+        var metricB = exportedItemsB.FirstOrDefault(i => i.Name == "myGaugeName");
+
+        Assert.Equal(GetValue(metricA), GetValue(metricB));


Do we expect them to be equal?

yes, for counters as the MetricStreamIdentity will be the same if the meter name matches, the value will be equal.

While, for gauges, my understanding is that since they do a spontaneous measurement at the time they were being collected, so it is hard to use InMemoryExporter to create a unit test for that.

But I checked that the running values in the aggregationStore match:

((OpenTelemetry.Metrics.Metric)(new System.Linq.SystemCore_EnumerableDebugView<System.Diagnostics.Metrics.ListenerSubscription>((new System.Collections.Generic.ICollectionDebugView<System.Diagnostics.Metrics.Instrument>(((OpenTelemetry.Instrumentation.Process.ProcessMetrics)(new System.Collections.Generic.ICollectionDebugView<object>(((OpenTelemetry.Metrics.MeterProviderSdk)meterProviderA).Instrumentations).Items[0])).MeterInstance._instruments).Items[0])._subscriptions).Items[0]).State).aggStore.metricPoints[0].runningValue.AsDouble ((OpenTelemetry.Metrics.Metric)(new System.Linq.SystemCore_EnumerableDebugView<System.Diagnostics.Metrics.ListenerSubscription>((new System.Collections.Generic.ICollectionDebugView<System.Diagnostics.Metrics.Instrument>(((OpenTelemetry.Instrumentation.Process.ProcessMetrics)(new System.Collections.Generic.ICollectionDebugView<object>(((OpenTelemetry.Metrics.MeterProviderSdk)meterProviderB).Instrumentations).Items[0])).MeterInstance._instruments).Items[0])._subscriptions).Items[0]).State).aggStore.metricPoints[0].runningValue.AsDouble

Regarding MetricStreamIdentity: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/664/files#r991558434

You have two ObservableCounters here: one always returns 1 and the other always returns 2 yet this line: Assert.Equal(GetValue(metricA), GetValue(metricB)); succeeds. This means that only one of the ObservableCounters callbacks is being called.

Hmm... what I see is that both callbacks will be triggered for the meterProviders, but the second callback will hit this:

test code here:
https://github.com/open-telemetry/opentelemetry-dotnet/compare/main...Yun-Ting:yunl/instrumentNames?expand=1

utpilla

LGTM overall.

The case of verifying correctness for multiple MeterProviders could be investigated further. We don't have to block this PR for that.

…opentelemetry-dotnet-contrib into yunl/threadSafeRefresh

utpilla · 2022-10-10T20:58:49Z

Merging this PR to unblock progress. Still need to verify the correctness in case of multiple MeterProviders.

noahfalk · 2022-10-12T12:22:53Z

src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs

            },
            unit: "By",
-            description: "The amount of committed virtual memory.");
+            description: "The amount of virtual memory allocated for this process that cannot be shared with other processes.");


I was expecting "process.memory.virtual" to be Process.VirtualMemorySize64. I saw you linked to #673 which had a very nice explanation for why you made the change. I think you were attempting to be very thorough and precise in making this choice so I want to offer something equally detailed to give an alternate perspective on the problem.

When we get into details, I think 'committed memory' is unfortunately an ambiguous term and the best way to understand what the OTel spec writers intended will be for us to ask them and clarify the spec. In Mark Russinovich's blog post he repeatedly refers to memory that "counts against the commit limit", but he never actually says "Memory is 'committed memory' if and only if it counts against the commit limit". I think that would be an intuitive definition in context of his article and probably it is the definition he intended the reader to apply despite never stating it pedantically. For simplicity we can just call that his intended definition. His definition is probably technically precise for how some portion of the Windows Kernel uses the term, but I suspect it is not how most people understand and use it.

IMO the common definition of committed memory that most developers would use is that it is all the memory addresses a process can read from without triggering an access violation. This includes (at least) three categories of memory:

memory backed solely by RAM

memory backed by a swapfile/pagefile

memory backed by some other mapped file on disk

The first two agree with Mark's definition above so the difference is entirely whether the third bullet is included. To justify that it is common for most people to include file backed memory as part of 'committed memory' this article has a screenshot of the tool VMMap at the top. I consider VMMap to be the de-facto standard for investigating and categorizing virtual memory usage (incidentally its also a tool MarkR made). Looking at the table in that image is a column called 'Committed' and the first two rows are 'Image' and 'Mapped File'. If committed virtual memory did not include memory backed by files on disk then we'd expect those columns to be 0, but they aren't.

As a second example, this link is a help page for the !address command in windbg which shows information about the virtual address space. There is an example output part way down the page showing:

0:000> !address 75831234 Usage: Image Base Address: 75831000 End Address: 758f6000 Region Size: 000c5000 Type: 01000000MEM_IMAGE State: 00001000MEM_COMMIT Protect: 00000020PAGE_EXECUTE_READ More info: lmv m kernel32 More info: !lmi kernel32 More info: ln 0x75831234

The debugger says that this address is part of the kernel32 image and it is on a memory page that is marked both as being MEM_IMAGE and MEM_COMMIT. Mark's blog post says specifically that "The virtual memory in Testlimit's address space where its executable and system DLL images are mapped therefore don't count toward the commit limit" and yet windbg describes this exact type of memory as being MEM_COMMIT.

I think the way VMMap and Windbg refer to committed memory is the way most developers would use the term. That definition corresponds to using Process.VirtualMemorySize64 as the metric value. Of course we can always ask the OTel spec writers if that is the definition they intended. Hope that helps!

noahfalk · 2022-10-12T12:30:34Z

src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs

            "process.memory.usage",
            () =>
            {
-                CurrentProcess.Refresh();
-                return CurrentProcess.WorkingSet64;
+                if (!this.memoryUsage.HasValue)


For these callbacks to be thread-safe they need to be inside a lock. For example:

() => { lock(this) { if (!this.memoryUsage.HasValue) { this.Snapshot(); } var value = this.memoryUsage.Value; this.memoryUsage = null; return value; } }

noahfalk · 2022-10-12T12:37:58Z

src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs

-    /// </summary>
-    /// <param name="options">The options to define the metrics.</param>
-    public ProcessMetrics(ProcessInstrumentationOptions options)
+    private void Snapshot()


Have you seen any practical examples where the lack of a snapshot caused trouble? For comparison, the CPU usage, GC memory usage, and total VM usage EventCounters have been present in .NET Core since .NET Core 3.1, there is nothing that forces them to be read as a single snapshot, and I am not aware of any user feedback saying that it has ever caused someone trouble.

Addressing the comment here:
#718

initial commit

d451059

Yun-Ting requested a review from a team September 28, 2022 00:53

github-actions bot assigned Yun-Ting Sep 28, 2022

Yun-Ting commented Sep 28, 2022

View reviewed changes