-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low overhead heap profiling #49424
Comments
Tagging subscribers to this area: @dotnet/gc Issue DetailsWe do have a use-case for low overhead heap profiling (production ready, continuously on) that is currently hard to archive in .NET. I would like to start a conversation about what would be needed in .NET to achieve this and if there is a way forward to get such support in future .NET versions. Use-CaseWe do have a .NET Profiler (for APM and cpu-profiling use-cases) and want to extend it with memory/allocation profiling capabilities. We want to be able to tell our users, what code leads to expensive allocations, in production environment, continuously-on, low-overhead. What we would like to capture:
It shall have the following properties:
Here is an example of how such data could be visualized: https://www.dynatrace.com/support/help/how-to-use-dynatrace/transactions-and-services/analysis/memory-profiling/ Status QuoWe have researched multiple approaches, but none of them fully satisfied our requirements. One approach is to use the Since .NET 5 we can also use the Problems of the event pipe approach:
Array SizeAs mentioned in #43345, it is not possible to obtain the size of allocated array objects in the callback of the We can obtain the size in the The
Comparable solutionsSince JDK 11, there are callbacks that provide the necessary information with minimal overhead. It is possible to monitor allocated objects with the Additionally, there is the A detailed description of this can be found at https://openjdk.java.net/jeps/331 SummaryCurrently it looks like our use-case cannot be fulfilled in .NET. With this ticket, we're hoping to have a discussion if such a capability makes sense in a future .NET version. If this isn't the right place/form to have such a discussion, please let us know :).
|
Hi @discostu105, Thanks for reaching out to start the discussion. The GC profiling available over ICorProfiler has historically been extremely performance impactful, and we have done some work over the last couple releases to try and make it better. I'm happy to work with you to identify ways we can iterate even more. Just to make sure you're aware of the recent work I'll mention it here: We first added the concept of "lightweight GC Profiling" that enables just GC start and end and updates of generational bounds: dotnet/coreclr#22866 Then we added "medium weight GC profiling" providing some APIs to be able to track objects more efficiently: dotnet/coreclr#24156 And then as you mention in 5.0 we added the ability to get EventPipe events over ICorProfiler. For the specific issues you point out, I have a couple follow up questions.
Have you tried getting GC start events with lightweight GC profiling enabled? Hopefully that should be very little overhead. There also is the option of using only the EventPipe events and doing GC profiling the same way the dotnet team's tools (e.g. PerfView) already do GC profiling. I am not an expert in them, but can help wade through the details if you want to go that route.
Can you give specific numbers for what the overhead difference is when collecting the same type of data? |
Hi @davmason, thank you for the link to the lightweight GC profiling. It is indeed possible to obtain the size of array objects in the GC started callback with the However, array-sizes can only be obtained at collection-time, not at allocation-time. There are situations where few garbage collections occur. In that case we must wait for the next garbage collection before we can obtain the size of array objects. This makes it more complex if we want to report the amount of allocated memory in a certain timeframe, since it is possible that we don’t know the sizes of allocated arrays at the end of the reporting-timeframe, if there was no GC run. So, while it’s not ideal, it certainly already helps us to get array-sizes most of the time, especially in times with high GC activity, which are the interesting situations anyway. In an ideal world though, it would be preferable if we could obtain the object size of the last allocated object during the Concerning overhead, there are two main differences between Java and .NET that can influence it:
Our Java solution generally adds very low overhead (lower than 1%). Compared to that, just enabling the necessary event pipe events described in the original post has a performance overhead of ~1-20%, depending on the number of allocations, amount of allocated memory and number of GC runs. |
@d-schneider I'd definitely be interested to see how we can make this better for you. re your questions
this can definitely be a config instead of hard coded 100k.
if you meant could we give you the size of the object that happened to trigger the AllocTick event, that's totally doable - GC has this info when an allocation triggered a GC.
can you please tell me a bit about your usage of this callback? I presume you are doing this all in native code, just like with .NET. do you normally register for this callback with say a few user specified objects? I can see how the overhead would totally go up if there were many objects that registered for this callback. |
From a native ICorProfiler implementation point of view, the issue is that the event is fired before the MethodTable/array size is set on the object, so you can't call GetArrayObjectInfo on it from the callback to the AllocationTick event (since EventPipe events are synchronous for ICorProfiler). Either including the size in the event or moving the event so it is fired after the object is published should work for this case. |
Thanks for the replies!
It would be great for us if the sampling rate for the AllocationTick event is configurable at runtime. As mentioned, this would help us to dynamically reduce overhead or increase accuracy in situations with few allocations.
Yes, I meant the size of the object that triggered the AllocationTick event.
Both solutions would be ok for us.
Our Java implementation is also done in native code. We register each object that triggered a SampledObjectAlloc callback for the ObjectFree callback. This way, the number of ObjectFree callbacks sent is at most as high as the number of SampledObjectAlloc callbacks. |
ahh, my question was if you just wanted the size, or if you needed it to be an object that's already constructed. if it's just the size that's trivial 'cause GC already knows the size. but if you need this to be a constructed object (eg, you can call some method on this object), that would require the event to be moved as @davmason mentioned - the place where it's fired now is in GC before the methodtable is filled in. regarding the ObjectFree callback, you could implement this via GC handles. you can allocate a weak GC handle to hold onto objects of interest and during the GC done callback check if they are nulled by the GC, if so you know they are dead. obviously this requires you to be able to allocate a GC handle in your code. so if you currently already have some way to do that (ie, you already have managed code running and can pass a delegate back to native code to reverse pinvoke to create a GC handle to hold onto these objects), that's great; if not, it'd be some work to get this managed code infra running first. it's possible to make the profiling API provide this plumbing for you (but that'd be work on the diagnostics team :)). |
@Maoni0 are the first two items (reporting size in the alloc event and configurable alloc tick frequency) things the GC team would take on? I'm happy to work with @d-schneider on how to best achieve the object tracking from the profiler. |
@davmason yes, I don't feel like I have a confirmation whether this event would need to be moved though (it'd be really great to avoid it 'cause it means the code has to move from the GC side to the VM side). |
@Maoni0 Just the size is sufficient for our use case. Concerning the AllocationTick sampling rate: The Java SampledObjectAlloc callback also uses a random variation for the sampling frequency, as described in JEP-331. If possible, a similar feature for the AllocationTick event would also be interesting for us. Description for this from https://openjdk.java.net/jeps/331: "Note that the sampling interval is not precise. Each time a sample occurs, the number of bytes before the next sample will be chosen will be pseudo-random with the given average interval. This is to avoid sampling bias; for example, if the same allocations happen every 512KB, a 512KB sampling interval will always sample the same allocations. Therefore, though the sampling interval will not always be the selected interval, after a large number of samples, it will tend towards it." |
thanks for the info @d-schneider. have you observed that the random interval is needed often? in theory it sounds like a useful thing but in practice it should be completely rare that "same allocations happen every 512KB" - even if that happened, since we are almost always in a multi-threaded environment this means to the GC it won't see the same allocation every 512kb (ie, one thread could be doing the same alloc every 512kb but since it shares the same heap with another thread, GC won't see that alloc every 512kb on that heap). |
@Maoni0 We don't really have data on how significant the bias would be without the random interval, as this is a built-in JVM feature that cannot be disabled. |
@discostu105 then I would vote to not include this in our system 'cause I simply don't see it having a practical usage. |
@discostu105 or @d-schneider, do you want to talk about the object tracking portion of your request? Maoni has a great idea to use weak references, if you are already doing IL rewriting then it would be not that much work. I'm also happy to discuss adding a new API to ICorProfiler*, but then it would only be available in .net 6 or 7 and newer, depending on when it lands. |
As I understand it, we would have to do a reverse pinvoke in the AllocationTick and GarbageCollectionFinished callbacks to allocate the GC handles and to check if they were nulled by the GC respectively. However, when trying this I ran into problems when calling the delegate:
Is there anything special to consider for the reverse pinvoke in those cases that I might have missed? The reverse pinvoke does work in a native worker thread, but then there could be race conditions e.g., a GC run between the allocation and when the worker thread creates a GC handle. We would have to wait in the AllocationTick callback for the worker thread to finish creating a GC handle, but this is not an optimal solution. I haven't tried this yet, but another question is if there could be any problems when creating the GC handle for the object that triggered the AllocationTick event, considering that we currently can't get the size of the object in that callback?
We think the native ICorProfiler API would be the better approach, as it would be simpler to consume. Preferably similar to Java, e.g., we can register an object for a callback when it's freed by the GC. Getting this added in a future .NET release would be great! We are happy to answer any questions regarding a possible ICorProfiler API. |
Yeah, that makes sense. The GC is still considered in progress during the GarbageCollectionFinished callback, so managed code won't be able to run until you return from it and let the GC complete.
This makes sense, the AllocationTick event is going to be fired in the middle of the allocation, which would be in managed code. So even though your profiler is native code, there is managed code on the stack so it triggers that error.
I hadn't thought through exactly how you would have to accomplish this, but you're right that there are a lot of potential race conditions and deadlocks. I think the only way you could accomplish it right now is how you describe it, you would have to spin up a separate thread that has no managed code on it, pass the object to the thread and then block in the AllocationTick event callback until the other thread is done allocating a handle to it. If you go that route, you would have to be very careful to not do any allocations, and not call any methods that allocate. Since you would blocking inside an allocation, it would prevent a GC from running and any allocation can trigger a GC (that would lead to a deadlock).
I don't think there will be any issues with that.
After thinking about this for a while, I think it would make sense to add a general purpose GC handle API to ICorProfiler - profilers could allocated weak handles to track object lifetime like you want to do, but then could also allocate a strong handle to keep objects alive that they want to keep alive. It wouldn't give you a callback, but it would be more general purpose and provide benefit to more scenarios.
|
@davmason Thanks for the API proposal. The described API would be great for our use case. How high is the expected performance impact of calling the GetObjectFromHandle method multiple times per GC run? Is it possible that a variant of the GetObjectFromHandle method that allows us to get multiple objects at once would be better from a performance perspective for this use-case? |
Unless you are at the point where you are trying to micro-optimize at instruction level, the cost is negligible. |
Thanks for the info about the |
Thanks for the confirmation @d-schneider. I don't think I said this explictly so far, there is about a month or two left to get features in for 6.0 and we are already completely booked on the diagnostics team. This feature would be scheduled for 7.0 at the earliest as it stands. That being said, we always welcome PRs from the community and this is probably one of the easier ones to implement. If you or anyone on your team is feeling up for it I would be more than happy to guide you through the process of implementing it. |
@d-schneider while reviewing my instrumentation change #55888, @noahfalk brought up something that I hadn't thought of and wanted to check with you. in my PR I made the alloc tick threshold configurable via a runtime config (which can also be set as an env var), but he pointed out that it may not produce desirable effect for you because a profiler wouldn't have the freedom to do this config on the user's behalf and you probably meant a profiling API for you to set this threshold instead? could you please confirm which is your preference? I presume you still would like the object size as part of the alloc tick regardless, right? which the new version of the event provides. |
@Maoni0 Thanks for implementing this! We would prefer a profiling API for this configuration. It is also important for us that we can adjust the allocation tick threshold with this API while the application is running. Yes, we would still like the object size as part of the allocation tick event. The new AllocationTick_V4 event looks great in this regard! |
@d-schneider thanks for confirming! that's the same as what @noahfalk told me. I've pulled out the runtime config and kept the new AllocationTick_V4 event in my PR. for adjusting the threshold with profiling API, the diagnostics team will handle that (@davmason @noahfalk). it shouldn't be hard to add it and allow it to change the threshold while the process is running. |
We've never had ETW events that had fields about time - we rely on the the timestamps of the ETW events themselves to calculate time. This checkin introduces some new events/event fields that will include time info instead of firing individual events that otherwise don't carry much info, ie, they'd just be there so we could use their timestamps, which would be a nuisance when we have many heaps. The rationale behind firing events with time already calculated is 1) it does reduce overhead since we don't fire as many events so we can fire these in fewer events for informational level 2) firing individual vents and having the tools interpret them isn't very useful unlike events such as GCStart/GCEnd which can be correlated with other events (eg, you get GCStart, and then get a bunch of other events so you know those happened after a GC started) wheras things like very GC internal things don't have this property, ie, we are not gonna care that "these other events happened during a GC and specifically during the relocaton phase". --- Added MarkWithType events for marking due to dependent handles, newly promoted due to dead finalizable objects and mark_steal. Perfview needs to be updated to work with this otherwise you can't see the GCStats view (I'm submitting a PR for that). Recorded time for marking roots (but sizedref is separate), short weak, ScanForFinalization, long weak, relocate, compact and sweep. Added a new version that includes the size of the object that triggered the event. This is for a request from #49424 (comment). Provided a new rundown GCSettings event that has info on settings hard to get from traces. Added a GCLOHCompact event which is fired for all heaps (heaps that didn't actually do LOH compact would have values of all 0s). I'm trying to add events that don't require a lot of correlation with other events to interpret. This is to help get an idea how long it takes to compact LOH and how reference rich it is. Added a verbose level GCFitBucketInfo event which helps us with FL fitting investigation. I'm firing this for 2 things in a gen1 GC: 1) for plugs that allocated with allocate_in_condemned_generations the event will capture all of them with the same bucketing as we do for gen2 FL; 2) for gen2 FL we look at the largest free items that take up 25% of the FL space, or if there are too many of them we stop after walking a certain number of free items as we have to limit the amount of time we are spending here. --- Fixed issues - For BGC we were reporting the pinned object count the same as the last FGC..and that causes confusion so fixed that. Fixed #45375 While fixing #45375, I noticed we have another bug related to alloc tick which is we are not firing the alloc tick events correctly for LOH and POH since the ETW alloc tracking didn't seperate them... fixed this too. Added the POH type for GCSegmentTypeMap which was missing in the manifest. --- Did some cleanup in eventtrace.h - we don't need the info that's not used which means we just ended up duplicating things like _GC_ROOT_KIND in more places than needed. --- Note, I realize that I do have some inconsistency with the FEAETURE_EVENT_TRACE here, as in, some code should be under an #ifdef check but is not. I will look into a remedy for that with a separate PR.
Just a heads up that over in #98167 I am starting to look into low overhead randomized heap sampling again. |
We do have a use-case for low overhead heap profiling (production ready, continuously on) that is currently hard to archive in .NET. I would like to start a conversation about what would be needed in .NET to achieve this and if there is a way forward to get such support in future .NET versions.
Use-Case
We do have a .NET Profiler (for APM and cpu-profiling use-cases) and want to extend it with memory/allocation profiling capabilities. We want to be able to tell our users, what code leads to expensive allocations, in production environment, continuously-on, low-overhead.
What we would like to capture:
It shall have the following properties:
Here is an example of how such data could be visualized: https://www.dynatrace.com/support/help/how-to-use-dynatrace/transactions-and-services/analysis/memory-profiling/
Status Quo
We have researched multiple approaches, but none of them fully satisfied our requirements.
One approach is to use the
ObjectAllocated
,MovedReferences
,SurvivedReferences
andGarbageCollectionFinished
profiler callbacks. However, this is not viable for production scenarios, since the performance overhead for just enabling these callbacks is extremely high (more than 100%).Since .NET 5 we can also use the
EventPipeEventDelivered
profiler callback. There are theAllocationTick_V3
,GCBulkMovedObjectRanges
andGCBulkSurvivedObjectRanges
event pipe events that provide similar data as the profiler callbacks mentioned above. The measured overhead for this was significantly lower (Between ~1% to ~20%, depending on the number of allocations and GC runs. The 20% overhead was measured with a sample that allocates large arrays in a loop. For more realistic applications this overhead is closer to ~2%).Problems of the event pipe approach:
AllocationTick_V3
sampling rate fixed at ~100KB (problematic for applications that allocate very low/high amounts of memory)Array Size
Array size is critical for our use-case, as arrays can make up a significant portion of overall allocations. As mentioned in #43345, it is not possible to obtain the size of allocated array objects in the callback of the
AllocationTick_V3
event.We can obtain the size in the
GarbageCollectionStarted
profiler callback with theICorProfilerInfo::GetObjectSize
method if we track theObjectId
. However, enabling this profiler callback increases the overhead significantly.The
GCStart
event pipe event would have less overhead, however it is not possible to reliably obtain the object size in that callback, since theICorProfilerInfo::GetObjectSize
method sometimes fails with a read access violation at:coreclr.dll!Object::GetSize() Line 44
coreclr.dll!ProfToEEInterfaceImpl::GetObjectSize(unsigned __int64 objectId, unsigned long * pcSize) Line 1586
Comparable solutions
Since JDK 11, there are callbacks that provide the necessary information with minimal overhead. It matches our use-case really well.
It is possible to monitor allocated objects with the
SampledObjectAlloc
callback (https://docs.oracle.com/en/java/javase/11/docs/specs/jvmti.html#SampledObjectAlloc). The sampling rate for this callback can be configured with theSetHeapSamplingInterval
method.Additionally, there is the
ObjectFree
callback that is sent when a tagged object is freed by the garbage collector (https://docs.oracle.com/en/java/javase/11/docs/specs/jvmti.html#ObjectFree).A detailed description of this can be found at https://openjdk.java.net/jeps/331
Summary
Currently it looks like our use-case cannot be fulfilled in .NET. With this ticket, we're hoping to have a discussion if such a capability makes sense in a future .NET version. If this isn't the right place/form to have such a discussion, please let us know :).
@discostu105 @d-schneider
The text was updated successfully, but these errors were encountered: