Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inlined GC poll for methods marked with SuppressGCTransitionAttribute #13582

Closed
AaronRobinsonMSFT opened this issue Oct 14, 2019 · 33 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Oct 14, 2019

The code responsible for creating and inserting GC_Poll instances appears to have degraded since porting over from .NET Framework.

void Compiler::fgCreateGCPolls()

Calling the above function when either GCPOLL_CALL or GCPOLL_INLINE is set results in multiple asserts being fired. This was discovered during the suppress GC work - dotnet/coreclr#26458. The result of this was to manually insert GCPOLL_CALL instances instead of GCPOLL_INLINE.

Additional issues:

/cc @briansull @dotnet/jit-contrib

category:cq
theme:optimization
skill-level:beginner
cost:medium

@jkotas jkotas changed the title The GC_Poll Create/Insert logic appears to be flawed Inlined GC poll for methods marked with SuppressGCTransitionAttribute Oct 15, 2019
@jkotas
Copy link
Member

jkotas commented Oct 15, 2019

The SuppressGCTransitionAttribute optimization leaves performance on the table because of this issue.

@briansull
Copy link
Contributor

This will analyze loops to prove that a managed call always occurs in the execution of one iteration.
If it proves this it will omit the GC_POLL call for the loop.
When we discussed this behavior during dotnet/coreclr#26458, you advocated for always calling the GC_POLL rather than optimizing it away.

@jkotas
Copy link
Member

jkotas commented Oct 15, 2019

How you recommend that we implement the inlined GC poll for SuppressGCTransitionAttribute? Ignore what is there today and just do it from scratch?

@AaronRobinsonMSFT
Copy link
Member Author

@briansull The issue here was that during some scenarios no assert was hit with GCPOLL_INLINE, but there were cases when it would assert. Asserts also occurred when setting GCPOLL_CALL. There are obviously multiple scenarios here.

Inline PInvoke in an inlineable function.

   public static int GetTime()
   {
       return InlineablePInvokeWithGCSuppression();
   }

PInvoke in non-inlined function

   public static int GetAccTime()
   {
       int acc = 0;
       foreach (...)
       {
          acc += InlineablePInvokeWithGCSuppression();
       }

       return acc;
   }

Then there are scenarios involving CrossGen which can't use GCPOLL_INLINE at the moment, but requires GCPOLL_CALL - see https://github.com/dotnet/coreclr/issues/27206.

@briansull
Copy link
Contributor

How you recommend that we implement the inlined GC poll for SuppressGCTransitionAttribute?

Yes, We should fix the codepath that we use for SuppressGCTransitionAttribute to support GCPOLL_INLINE in all cases.

The existing code fgCreateGCPolls() was last used to support the MacOS during the pre-RyuJIT time frame. That code should be deleted as it tries to eliminate the GCPOLL call when it can prove that a managed call will be made in every loop iteration, which is not what we want with this new feature.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@BruceForstall
Copy link
Member

@briansull You fixed some of the GCPOLL code with #34837.

Did you try changing the code in morph that generates the GCPOLL_CALL for the SuppressGCTransitionAttribute case to GCPOLL_INLINE? That is, here:

    if (fgGlobalMorph && call->IsUnmanaged() && call->IsSuppressGCTransition())
    {
        // Insert a GC poll.
        bool insertedBB = fgCreateGCPoll(GCPOLL_CALL, compCurBB, compCurStmt);
        assert(!insertedBB); // No new block should be inserted
    }

Presumably you could do this and see asm diffs where the polls are inlined in the tests Interop\PInvoke\Attributes\SuppressGCTransition\SuppressGCTransitionTest.cs and JIT\Methodical\gc_poll\InsertGCPoll.cs.

@briansull
Copy link
Contributor

I will take a look at making this change and seeing what asm diffs there are with it.

@BruceForstall
Copy link
Member

@AaronRobinsonMSFT @jkotas Is there a benchmark where we could measure a performance improvement by implementing inlined GC polls for this scenario?

@AaronRobinsonMSFT
Copy link
Member Author

I do not know of one at present. @VSadov was investigating using this attribute and it wasn't as fast as needed for him. He may have a scenario to validate.

@VSadov
Copy link
Member

VSadov commented Apr 28, 2020

I did not notice that this has merged.

One important scenario that is blocked right now is the Math functions. - #13820

That could use SuppressGCTransition, but would require that:

  • GC poll is inlined
  • We do something smart with GC polls in a block
    No need to poll for every function when user does:
Math.Acos(Math.Cos(Math.Acos(Math.Cos(Math.Acos(Math.Cos(Math.Cos(x)))))))

I assume the first part is now fixed?

@VSadov
Copy link
Member

VSadov commented Apr 28, 2020

A simpler scenario to validate the improvements is switch something very fast to use SuppressGCTransition and measure.

Environment.TickCount cold be a good candidate - it is a very fast call on all platforms. It must be.
A lot of code uses this for cheap timestamps when precision/frequency is not a concern, so if overhead could be reduced by skipping VM, then it would be a nice change.

@erozenfeld
Copy link
Member

@jkotas Is it acceptable not to insert a GC poll if the call to a method marked with SuppressGCTransitionAttribute is not in a loop?

@jkotas
Copy link
Member

jkotas commented Jul 2, 2020

I do not think it would be good enough. The loop can be in the caller and we would still end up with same problem. For example:

using System;
using System.Threading;
using System.Diagnostics;

class Test
{
    static void GCThread()
    {
        for (;;)
        {
            var sw = new Stopwatch();
            sw.Start();
            GC.Collect();
            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
            Thread.Sleep(100);
        }
    }

    static double Work(double x)
    {
        return
        Math.Pow(x,2.2)+Math.Pow(x,4.3)+Math.Pow(x,8.8)+
        Math.Pow(x,3.2)+Math.Pow(x,5.3)+Math.Pow(x,7.8)+
        Math.Pow(x,4.2)+Math.Pow(x,6.3)+Math.Pow(x,6.8)+
        Math.Pow(x,5.2)+Math.Pow(x,7.3)+Math.Pow(x,5.8)+
        Math.Pow(x,6.2)+Math.Pow(x,8.3)+Math.Pow(x,4.8)+
        Math.Pow(x,7.2)+Math.Pow(x,9.3)+Math.Pow(x,3.8);
    }

    static void Main()
    {
        new Thread(GCThread).Start();

        for (;;)
            Work(48929.47982);
    }
}

The GC pause times in milliseconds I get with this test today are:

697
1519
116
1148
214
1288
254
...

I would like to be able to convert Math.Pow to use DllImport w/ SuppressGCTransitionAttribute and get GC pause times under 1ms for this test.

@VSadov
Copy link
Member

VSadov commented Jul 2, 2020

In theory it seems acceptable, but what is exactly "not in a loop"? What if the call is the only statement in the containing method and that itself is in a loop? If the SGCT method is very fast, this could still be a challenge to hijack.

To elide polls, it may be necessary to consider what else is happening in the containing method (or block for simplicity).

If there is always a poll on the path leading to the call or on the path after, then it might be ok to not poll.

@VSadov
Copy link
Member

VSadov commented Jul 2, 2020

What I mean is - In the Jan's example one poll would be sufficient, but there should be at least one in that method.

It is also possible to think about scenario with 1000 SGCT calls in a sequence. - do we need polls every X calls?
I am not sure. That is a situation similar to just a long sequence of computation. - do we interleave those with polls too?

@jkotas
Copy link
Member

jkotas commented Jul 2, 2020

Right, giant functions have number of potential problems with GC suspension and performance that go beyond SuppressGCTransitionAttribute. I am less worried about that. It can be reasonably explained, we can build analyzers for it, etc.

@erozenfeld
Copy link
Member

erozenfeld commented Jul 2, 2020

There are several things that need to be addressed:

  1. Currently we are inserting non-inlined GC polls for methods with SuppressGCTransitionAttribute and inlined GC pols (which require creation of new basic blocks) for legacy (x86) helper-based tail calls in morph. The latter is a horrible hack, morph is not the right place to introduce new basic blocks. I definitely don't want to extend that hack for inserting inlined GC polls for calls to methods with SuppressGCTransitionAttribute.

  2. We have an fgCreateGCPolls phase that works in tandem with fgMarkGCPollBlocks. These phases currently do nothing unless COMPLUS_GCPollType is not 0. We are not testing this code path. I believe this option is for platforms that don't support hijaking. @jkotas Do we need to keep this code or should it be deleted?

  3. The inlined GC polls should be inserted after morph. fgCreateGCPolls as it exists today will avoid inserting GC polls outside of loops. If we decide that we don't need to support COMPLUS_GCPollType scenarios (see Get core-setup building in the consolidated repo. #2 above), we can re-purpose this phase to do GC poll insertion for calls to methods with SuppressGCTransitionAttribute and for x86 helper-based tail calls. Another option is to insert these GC polls in lower, which already has precedents for adding basic blocks.

  4. To be able to convert Math.Pow to use DllImport w/ SuppressGCTransitionAttribute we need the ability to emit inline GC polls with R2R crossgen. That is currently not supported (ZapInfo::getAddrOfCaptureThreadGlobal always returns NULL for R2R) and is tracked by CrossGen can't support GCPOLL_INLINE #13591. @jkotas, @davidwrighton What needs to be done to enable that?

@VSadov
Copy link
Member

VSadov commented Jul 2, 2020

platforms that don't support hijaking.

Last time I asked if there are such platforms, I think NetBSD was identified as one.

@jkotas
Copy link
Member

jkotas commented Jul 2, 2020

NetBSD was identified as one.

NetBSD was barely building. I do not think we need to worry about it.

Feel free to add this code under ifdef that is always off, or even delete it (in a dedicated PR).

tracked by #13591. @jkotas, @davidwrighton What needs to be done to enable that?

That is a bit of plumbing work in crossgen/crossgen2. I will be happy to take care of it myself once the JITed case works.

@erozenfeld
Copy link
Member

Two more questions:

  1. Is one GC poll per basic block sufficient even if there are multiple calls to methods with SuppressGCTransitionAttribute in that basic block?
  2. Is it important whether the GC poll is inserted before or after the call(s) to methods with SuppressGCTransitionAttribute?

@VSadov
Copy link
Member

VSadov commented Jul 2, 2020

1 - Yes. One per BB should be enough.
2- no, before/after is unimportant.

We mostly care about cases where these calls are fast.
Slow SGCT can trivially starve GC and it is a misuse for which we cant do much.

When properly used, only loops really matter, thus before/after does not matter much.

@EgorBo
Copy link
Member

EgorBo commented Jul 6, 2020

Another option is to insert these GC polls in lower, which already has precedents for adding basic blocks.

will fgReorderBlocks work in lower? in order to move "poll gc" as rarely-used (in case of inlined-poll).

if I understand the discussion correctly the final idea is to have something like this ?

if (g_TrapReturningThreads) // unlikely()
    goto bb_with_CORINFO_HELP_POLL_GC(); // slow path
Math.Pow(x);

@erozenfeld
Copy link
Member

I have a change that re-purposes fgCreateGCPolls for inserting GC polls for blocks that have calls to methods with SuppressGCTransitionAttributes as well as for x86 helper-based tail calls. So, fgReorderBlocks will be called. And yes, inlined GC polls look like what you describe. The call to Math.Pow(x) may appear before or after the inlined GC poll.

@EgorBo
Copy link
Member

EgorBo commented Jul 6, 2020

btw, Math.Pow is hoistable (from loops) so indeed would be nice to insert safepoints somewhere after basic optimizations 🙂 e.g.

double x = ...;
for (int i=0; i<n; i++)
{
    Foo(Math.Pow(x, 10));
}

^ should not call GCPoll inside the loop.

we also do it like this in mono-llvm, "place safepoints" happens after all optimizations.

@AaronRobinsonMSFT
Copy link
Member Author

may appear before or after the inlined GC poll.

@erozenfeld Will the placement deterministic?

@erozenfeld
Copy link
Member

Yes, the placement will be deterministic.

@erozenfeld
Copy link
Member

erozenfeld commented Jul 7, 2020

@jkotas @VSadov @AaronRobinsonMSFT Is it important to support emiiting gc polls when the call to a SuppressGCTransitionAttributes method is in a catch, fault, filter, or finally block? If so, does it have to be inlined?

@AaronRobinsonMSFT
Copy link
Member Author

AaronRobinsonMSFT commented Jul 7, 2020

@jkotas @VSadov @AaronRobinsonMSFT Is it important to support emiiting gc polls when the call to a SuppressGCTransitionAttributes method is in a catch, fault, filter, or finally block? If so, does it have to be inlined?

I don't think it really matters where the call is made. I could get behind the argument that in a catch, fault, or filter block it doesn't need to be inlined since the exception is so costly anyways - all of those are exceptions right? I believe finally blocks are used in non-exceptional cases so that seems like we want to make as fast as possible.

Does handling all these blocks complicate the implementation?

@jkotas
Copy link
Member

jkotas commented Jul 7, 2020

It would be nice to emit gc polls for these. I do not think the gc polls have to be inlined in the EH handler blocks. Regular PInvokes are not always inlined inside EH handler blocks either, so not inlining gc polls in EH handler blocks would be consistent.

@VSadov
Copy link
Member

VSadov commented Jul 7, 2020

Just wonder - "catch block" here means the basic block that is the body of the catch handler, in which case polling is unnecessary, but not a big deal if it is there, or call just happens to be in a catch handler - i.e. could be in a loop in a catch, then we need to poll, just for suspension pause guarantee.

For finallies, ideally, there should be no difference, running them typically does not require an exception.

I agree inlining is not very important if that simplifies the design or makes emit smaller.
Many things get slower around exception handling and it is generally possible to factor hot code out if needed.

@erozenfeld
Copy link
Member

The current implementation of GC polls that @AaronRobinsonMSFT added only inserts GC polls for inlined pinvokes to methods with [SuppressGCTransitionAttribute]. Was that intentional or should GC polls be emitted for all pinvokes to methods with [SuppressGCTransitionAttributes]?

Note that we currently don't inline pinvokes in handlers and, on 64 bit platforms, in try regions: see

bool Compiler::impCanPInvokeInlineCallSite(BasicBlock* block)

@jkotas
Copy link
Member

jkotas commented Jul 7, 2020

Yes, it is intentional. The non-inlined PInvoke is going to have the GC poll in the PInvoke stub.

@erozenfeld
Copy link
Member

Ok, then I don't have to worry about inserting polls in handlers since we'll never have inlined pinvokes there.

@erozenfeld
Copy link
Member

Fixed by #39111.

erozenfeld added a commit to erozenfeld/runtime that referenced this issue Sep 24, 2020
This is a follow-up to
dotnet#13582 (comment)
and
dotnet#13582 (comment)

The code to insert gc polls was added in desktop for gc suspension not based
on hijaking. All platforms we target support hijaking so this code is not exercised
or tested. It also clutters other code and adds a bit of runtime overhead.
This change removes all that code.

There are minimal asm diffs because of a removed call to `fgRenumberBlocks`.
erozenfeld added a commit that referenced this issue Sep 25, 2020
This is a follow-up to
#13582 (comment)
and
#13582 (comment)

The code to insert gc polls was added in desktop for gc suspension not based
on hijaking. All platforms we target support hijaking so this code is not exercised
or tested. It also clutters other code and adds a bit of runtime overhead.
This change removes all that code.

There are minimal asm diffs because of a removed call to `fgRenumberBlocks`.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

8 participants