Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The AllocationTick threshold is computed by a Poisson process with a 100 KB mean. #85750

Closed

Conversation

chrisnas
Copy link
Contributor

@chrisnas chrisnas commented May 4, 2023

The fixed 100 KB threshold to trigger the AllocationTick event does not allow good ways to estimate the real allocations. The main issue is not really the sampling by itself but the fact that it is not possible to upscale the sampled size in order to get back sizes with the same order of magnitude than the real ones. Also, keeping the relative size differences between allocated types is important.

As a solution, it is possible to a Poisson process because each sample we take has no influence on any other sample. The samples are exponentially distributed in a Poisson process, meaning that the possibility of the next sample happening is calculated by the exponential cumulative distribution function CDF = (1 - e^(-lambda*x)).

In our context, it means that we need to compute x (= the allocated size to wait for the next sample) based on the mean of the distribution (lambda would the odds to sample a byte - in our case 1/100.000 for the current 100 KB threshold). With this sampling, it is possible to upscale/estimate the "real" allocated sizes based on the sampled size with the following formula:
upscaled size = sampled size / (1 - e^(- average size / 100.000))

3 additional threshold sampling scenarios are implemented to compare them against the fixed 100 KB one:

  • 100 KB +/- 50KB random
  • exponential
  • exponential triggered within an allocation context
    The last one seems to provide the best results once upscaled.
    Note that using the Poisson process to upscale the current fixed scenario gives better results than the simple upscaling mechanism based on the per type allocation size / total allocated size.

Relates to #49424.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label May 4, 2023
@ghost
Copy link

ghost commented May 4, 2023

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

The fixed 100 KB threshold to trigger the AllocationTick event does not allow good ways to estimate the real allocations. Using a Poisson process to calculate the threshold provides much better statistical results.

This first step changes the fixed threshold to an exponential distribution for the different SOH/LOH/POH threshold/counters.

The next steps will be to:

  • add a new GC keyword to enable that new behaviour without changing the AllocationTick_V4 payload. The check should be made by calling GCEventStatus::IsEnabled(provider, keyword, level) instead of EVENT_ENABLED(GCAllocationTick_V4).
  • try to check for variable threshold when it ends within an allocation context

Relates to #49424.

Author: chrisnas
Assignees: -
Labels:

area-GC-coreclr

Milestone: -

if (EVENT_ENABLED(GCAllocationTick_V4))
#endif
{
// compute the next threshold based on a Poisson process with a etw_allocation_tick_mean average
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many of us aren't greatly familiar with statistics so making this explanation not so vague would be helpful. instead of saying "based on a Possion process" it'd be much more helpful to start with something like "we are treating this as a Possion process because each sample we take has no influence on any other sample. the samples are exponentially distributed in a Possion process, meaning that the possibility of the next sample happening is calculated by (1 - e^(-lambda*x)). and then explain what lambda and x would be in this particular context so the readers know how the formula you are using came to be.

also -ln (1 - uniformly_random_number_between_0_and_1), is the same as -ln (uniformly_random_number_between_0_and_1). so I don't think you need the 1 - part.

can you please show the results of running this on some workloads where this is much better compared to the current implementation? also have you tried with just a uniformly random distribution instead of an exponential distribution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many of us aren't greatly familiar with statistics so making this explanation not so vague would be helpful. instead of saying "based on a Possion process" it'd be much more helpful to start with something like "we are treating this as a Possion process because each sample we take has no influence on any other sample. the samples are exponentially distributed in a Possion process, meaning that the possibility of the next sample happening is calculated by (1 - e^(-lambda*x)). and then explain what lambda and x would be in this particular context so the readers know how the formula you are using came to be.

I updated the description accordingly with also additional information about the upscaling formula

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please show the results of running this on some workloads where this is much better compared to the current implementation? also have you tried with just a uniformly random distribution instead of an exponential distribution?

I'm currently simulating the results based on a web application for which I'm recording ALL allocations using ICorProfilerCallback::ObjectAllocated() and check against the sampled then upscaled sizes. The variance of the results shows almost random results for fixed threshold, much better for variable threshold as in the first commit and a little better if sampling could happen within allocation context.
Since the recorder is available in the Datadog profiler only, it will be complicated to generate the corresponding .balloc files (i.e. list of allocations - type+size) used by the simulation to show result on any application. BTW, is there any sample application that you would like to see used as example?

Copy link
Contributor Author

@chrisnas chrisnas May 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also -ln (1 - uniformly_random_number_between_0_and_1), is the same as -ln (uniformly_random_number_between_0_and_1). so I don't think you need the 1 - part.

This sticks to the mathematical way to derive the formula. Since the result should be the same, I would recommend to keep it as it is but no problem to change it.

@Maoni0
Copy link
Member

Maoni0 commented Jun 6, 2023

Based on the GC team’s current schedule and offline discussion with @chrisnas, we have decided to close this PR and re-evaluate during our .NET 9 planning time. We definitely recognize the value of this idea and would like to pursue it in the future but right now the GC team simply does not have the time to see this through. We really want to thank you for your contribution and will absolutely let you know about our plan in .NET 9.

@Maoni0 Maoni0 closed this Jun 6, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants