-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Instrumentation.Process] Implement threadSafe Refresh() #664
[Instrumentation.Process] Implement threadSafe Refresh() #664
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #664 +/- ##
==========================================
+ Coverage 77.59% 77.65% +0.06%
==========================================
Files 175 175
Lines 5222 5237 +15
==========================================
+ Hits 4052 4067 +15
Misses 1170 1170
|
CurrentProcess.Refresh(); | ||
return CurrentProcess.WorkingSet64; | ||
}, | ||
() => CurrentThreadInstrumentsValues.Value.GetMemoryUsage(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work with multiple MeterProvider
setup?
In the scenario where you have multiple meterProviders and all of them add Process instrumentation, this would create the same set of instruments for each MeterProvider
since they are all using the same Meter
instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure for what intent would the user want to do .AddProcessInstrumentation()
for multiple meterProviders to a given application/process. Everything in ProcessMetrics should ideally be static in the scope of a Process.
Could you share with me the use case you have in mind that ProcessMetrics was being added to more than one meterProvider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One use case could be where the user wants to send data from different meters to different exporters along with the process metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there will not be a problem if we mark the ctor of ProcessMetrics as static.
I still don't quite get why would the user want to monitor a static object via different exporters?
One case I could think of now is role-based access control - authorized users of the application can see a certain field like for example process.disk.io while non-authorized users cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use instance members for meter and InstrumentValues with this commit 13434f5.
One caveat with this implementation is in Snapshot()
, threadA calls Process.Refresh(), and before threadA does the assignment to this.memoryUsage, threadB calls Process.Refresh(). The odds are small as assignments should be very fast.
Even if we hit this situation, it should not be a big deal because the time difference between these 2 refreshes is tiny.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One caveat with this implementation is in
Snapshot()
, threadA calls Process.Refresh(), and before threadA does the assignment to this.memoryUsage, threadB calls Process.Refresh(). The odds are small as assignments should be very fast. Even if we hit this situation, it should not be a big deal because the time difference between these 2 refreshes is tiny.
When would this happen? Is there a concrete example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case I have in mind is:
At t3, the value got assigned to this.virtualMemoryUsage
for threadA is from this.currentProcessRefresh() from t2 by threadB;
because GetCurrentProcess() is a static method that returns "A new Process component associated with the process resource that is running the calling application."
Even though we maintain different copies for currentProcess here:
opentelemetry-dotnet-contrib/src/OpenTelemetry.Instrumentation.Process/ProcessMetrics.cs
Line 51 in 13434f5
private readonly Diagnostics.Process currentProcess = Diagnostics.Process.GetCurrentProcess(); |
Those are referencing the same object.
But again, I don't think this is a case that could be hit frequently. And even it did occur, the time difference between the 2 snapshots are negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshot
is an instance method, under which condition would the same instance see concurrent calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work with multiple
MeterProvider
setup?In the scenario where you have multiple meterProviders and all of them add Process instrumentation, this would create the same set of instruments for each
MeterProvider
since they are all using the sameMeter
instance.
Please take a look:
https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/664/files#r989475625
{ | ||
internal static readonly AssemblyName AssemblyName = typeof(ProcessMetrics).Assembly.GetName(); | ||
internal static readonly Meter MeterInstance = new(AssemblyName.Name, AssemblyName.Version.ToString()); | ||
private static readonly Diagnostics.Process CurrentProcess = Diagnostics.Process.GetCurrentProcess(); | ||
internal readonly Meter MeterInstance = new(AssemblyName.Name, AssemblyName.Version.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering, if creating 2 instances of ProcessMetrics
class, do we expect every instance create its own meter with the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that in the SDK, MetricStreamIdentity is maintained, so there will be only one MetricStreamIdentity
in this case.
What would happen is both of the callbacks from the instruments would be firing, but there will only be one export.
The second one will hit this https://github.com/open-telemetry/opentelemetry-dotnet/blob/7f71283c0bf2ee0e2c5e3c9617295ddaf20c7a0c/src/OpenTelemetry/Metrics/MetricReaderExt.cs#L51
Please refer to the testcases for the experiments I've done.
Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there will be only one MetricStreamIdentity in this case.
I doubt that. MetricStreamIdentity
is maintained at a MetricReader
level within the SDK. In the scenario, where we have two instances of ProcessMetrics
class due to multiple MeterProviders we would also have multiple readers each maintaining their own set of MetricStreamIdentity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I think each reader has its own storage with the current implementation - usage of instance for ProcessMetrics
;
so it may not be required to maintain MetricStreamIdentity at the reader level.
…opentelemetry-dotnet-contrib into yunl/threadSafeRefresh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
test/OpenTelemetry.Instrumentation.Process.Tests/ProcessMetricsTests.cs
Outdated
Show resolved
Hide resolved
var metricA = exportedItemsA.FirstOrDefault(i => i.Name == "myGaugeName"); | ||
var metricB = exportedItemsB.FirstOrDefault(i => i.Name == "myGaugeName"); | ||
|
||
Assert.Equal(GetValue(metricA), GetValue(metricB)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect them to be equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, for counters as the MetricStreamIdentity will be the same if the meter name matches, the value will be equal.
While, for gauges, my understanding is that since they do a spontaneous measurement at the time they were being collected, so it is hard to use InMemoryExporter to create a unit test for that.
But I checked that the running values in the aggregationStore match:
((OpenTelemetry.Metrics.Metric)(new System.Linq.SystemCore_EnumerableDebugView<System.Diagnostics.Metrics.ListenerSubscription>((new System.Collections.Generic.ICollectionDebugView<System.Diagnostics.Metrics.Instrument>(((OpenTelemetry.Instrumentation.Process.ProcessMetrics)(new System.Collections.Generic.ICollectionDebugView<object>(((OpenTelemetry.Metrics.MeterProviderSdk)meterProviderA).Instrumentations).Items[0])).MeterInstance._instruments).Items[0])._subscriptions).Items[0]).State).aggStore.metricPoints[0].runningValue.AsDouble
((OpenTelemetry.Metrics.Metric)(new System.Linq.SystemCore_EnumerableDebugView<System.Diagnostics.Metrics.ListenerSubscription>((new System.Collections.Generic.ICollectionDebugView<System.Diagnostics.Metrics.Instrument>(((OpenTelemetry.Instrumentation.Process.ProcessMetrics)(new System.Collections.Generic.ICollectionDebugView<object>(((OpenTelemetry.Metrics.MeterProviderSdk)meterProviderB).Instrumentations).Items[0])).MeterInstance._instruments).Items[0])._subscriptions).Items[0]).State).aggStore.metricPoints[0].runningValue.AsDouble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding MetricStreamIdentity
: https://github.com/open-telemetry/opentelemetry-dotnet-contrib/pull/664/files#r991558434
You have two ObservableCounters here: one always returns 1
and the other always returns 2
yet this line: Assert.Equal(GetValue(metricA), GetValue(metricB));
succeeds. This means that only one of the ObservableCounters callbacks is being called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... what I see is that both callbacks will be triggered for the meterProviders, but the second callback will hit this:
test code here:
https://github.com/open-telemetry/opentelemetry-dotnet/compare/main...Yun-Ting:yunl/instrumentNames?expand=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall.
The case of verifying correctness for multiple MeterProviders could be investigated further. We don't have to block this PR for that.
Merging this PR to unblock progress. Still need to verify the correctness in case of multiple MeterProviders. |
}, | ||
unit: "By", | ||
description: "The amount of committed virtual memory."); | ||
description: "The amount of virtual memory allocated for this process that cannot be shared with other processes."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting "process.memory.virtual" to be Process.VirtualMemorySize64. I saw you linked to #673 which had a very nice explanation for why you made the change. I think you were attempting to be very thorough and precise in making this choice so I want to offer something equally detailed to give an alternate perspective on the problem.
When we get into details, I think 'committed memory' is unfortunately an ambiguous term and the best way to understand what the OTel spec writers intended will be for us to ask them and clarify the spec. In Mark Russinovich's blog post he repeatedly refers to memory that "counts against the commit limit", but he never actually says "Memory is 'committed memory' if and only if it counts against the commit limit". I think that would be an intuitive definition in context of his article and probably it is the definition he intended the reader to apply despite never stating it pedantically. For simplicity we can just call that his intended definition. His definition is probably technically precise for how some portion of the Windows Kernel uses the term, but I suspect it is not how most people understand and use it.
IMO the common definition of committed memory that most developers would use is that it is all the memory addresses a process can read from without triggering an access violation. This includes (at least) three categories of memory:
- memory backed solely by RAM
- memory backed by a swapfile/pagefile
- memory backed by some other mapped file on disk
The first two agree with Mark's definition above so the difference is entirely whether the third bullet is included. To justify that it is common for most people to include file backed memory as part of 'committed memory' this article has a screenshot of the tool VMMap at the top. I consider VMMap to be the de-facto standard for investigating and categorizing virtual memory usage (incidentally its also a tool MarkR made). Looking at the table in that image is a column called 'Committed' and the first two rows are 'Image' and 'Mapped File'. If committed virtual memory did not include memory backed by files on disk then we'd expect those columns to be 0, but they aren't.
As a second example, this link is a help page for the !address command in windbg which shows information about the virtual address space. There is an example output part way down the page showing:
0:000> !address 75831234
Usage: Image
Base Address: 75831000
End Address: 758f6000
Region Size: 000c5000
Type: 01000000MEM_IMAGE
State: 00001000MEM_COMMIT
Protect: 00000020PAGE_EXECUTE_READ
More info: lmv m kernel32
More info: !lmi kernel32
More info: ln 0x75831234
The debugger says that this address is part of the kernel32 image and it is on a memory page that is marked both as being MEM_IMAGE and MEM_COMMIT. Mark's blog post says specifically that "The virtual memory in Testlimit's address space where its executable and system DLL images are mapped therefore don't count toward the commit limit" and yet windbg describes this exact type of memory as being MEM_COMMIT.
I think the way VMMap and Windbg refer to committed memory is the way most developers would use the term. That definition corresponds to using Process.VirtualMemorySize64 as the metric value. Of course we can always ask the OTel spec writers if that is the definition they intended. Hope that helps!
"process.memory.usage", | ||
() => | ||
{ | ||
CurrentProcess.Refresh(); | ||
return CurrentProcess.WorkingSet64; | ||
if (!this.memoryUsage.HasValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these callbacks to be thread-safe they need to be inside a lock. For example:
() =>
{
lock(this)
{
if (!this.memoryUsage.HasValue)
{
this.Snapshot();
}
var value = this.memoryUsage.Value;
this.memoryUsage = null;
return value;
}
}
/// </summary> | ||
/// <param name="options">The options to define the metrics.</param> | ||
public ProcessMetrics(ProcessInstrumentationOptions options) | ||
private void Snapshot() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen any practical examples where the lack of a snapshot caused trouble? For comparison, the CPU usage, GC memory usage, and total VM usage EventCounters have been present in .NET Core since .NET Core 3.1, there is nothing that forces them to be read as a single snapshot, and I am not aware of any user feedback saying that it has ever caused someone trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing the comment here:
#718
Changes
Working towards: #447
Background: #335
Following OTel spec.
Assumptions:
For callbacks of the instruments from the same collection of an exporter, the instrument values should come from the same snapshot of the current process. In this PR, this was achieved by calling Process.Refresh() exactly once for each collection and use the idea of a state machine to check whether the snapshot is expired or not.
Without this guarantee, there will be issues with the following cases:
If users are trying to use the data from the dashboard to debug OOM of their application, it will be misleading if physical memory usage and virtual memory usage were not taken from a single snapshot.
(Also, not in the scope of this PR yet, when calculating CPU utilization:
Process.GetTotalProcessorTime() / elapsedTimeSinceTheProcessStart * (number of processors)
Process.GetTotalProcessorTime() should be coming from the same snapshot for accuracy.
ProcessMetrics should be threadSafe for multi-reader scenarios.
Since .NET currently does not support multiple-instrument callback System.Diagnostics (Observability) work planned for .NET 7 dotnet/runtime#62027 (comment), this PR uses the state machine workaround to keep track of whether the instrument callbacks were called or not within one collection of a reader.
Attached the console output for 2-collection-cycle from #663 to elaborate the flow.
Console output for 1 reader case:
Console output for 2 readers (I preprocessed the output with the correct time sequence from console for eye-friendly reading experience)