Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: Improving multi-dimensional array accesses #60785

Closed
BruceForstall opened this issue Oct 22, 2021 · 12 comments · Fixed by #70271
Closed

RyuJIT: Improving multi-dimensional array accesses #60785

BruceForstall opened this issue Oct 22, 2021 · 12 comments · Fixed by #70271
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization tenet-performance Performance related issue
Milestone

Comments

@BruceForstall
Copy link
Member

It is well known that the RyuJIT generated multi-dimensional (hereafter, MD) array access is inefficient. Specifically, true MD array access, not "jagged" array access. This is a query of the properly "theme:md-arrays" tagged issues. For example, #8569 and #5481.

Here's a simple example of a function copying a 3-dimensional array, assuming zero-based indices and a length in each dimension of n:

static void copy(int[,,] a, int[,,] b, int n)
{
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k];
            }
        }
    }
}

The x64 code generated for the inner loop is currently:

G_M4797_IG05:
       mov      r11d, eax
       sub      r11d, dword ptr [rcx+28]
       cmp      r11d, dword ptr [rcx+16]
       jae      G_M4797_IG09
       mov      esi, r9d
       sub      esi, dword ptr [rcx+32]
       cmp      esi, dword ptr [rcx+20]
       jae      G_M4797_IG09
       mov      edi, dword ptr [rcx+20]
       imul     rdi, r11
       mov      r11, rsi
       add      r11, rdi
       mov      esi, r10d
       sub      esi, dword ptr [rcx+36]
       cmp      esi, dword ptr [rcx+24]
       jae      G_M4797_IG09
       mov      edi, dword ptr [rcx+24]
       imul     rdi, r11
       mov      r11, rsi
       add      r11, rdi
       lea      r11, bword ptr [rcx+4*r11+40]
       mov      esi, eax
       sub      esi, dword ptr [rdx+28]
       cmp      esi, dword ptr [rdx+16]
       jae      SHORT G_M4797_IG09
       mov      edi, r9d
       sub      edi, dword ptr [rdx+32]
       cmp      edi, dword ptr [rdx+20]
       jae      SHORT G_M4797_IG09
       mov      ebx, dword ptr [rdx+20]
       imul     rbx, rsi
       mov      rsi, rdi
       add      rsi, rbx
       mov      edi, r10d
       sub      edi, dword ptr [rdx+36]
       cmp      edi, dword ptr [rdx+24]
       jae      SHORT G_M4797_IG09
       mov      ebx, dword ptr [rdx+24]
       imul     rbx, rsi
       mov      rsi, rdi
       add      rsi, rbx
       mov      esi, dword ptr [rdx+4*rsi+40]
       mov      dword ptr [r11], esi
       inc      r10d
       cmp      r10d, r8d
       jl       G_M4797_IG05

Things to note:

  1. There are 6 bounds checks, one for each dimension of each array. 4 of these (for i and j) should be hoisted out of the inner k loop. The other 2 could be removed by loop cloning.
  2. MD arrays, in contrast to single-dimension so-called "SZ" arrays, can have a non-zero lower bound in each dimension. Thus, the index must subtract that before comparing against the dimension length. This feature is rarely used (C# doesn't have syntax to create it; you need to use the one System.Array.CreateInstance function that allows for specifying the lower bounds), but must be supported.
  3. The base address of the k dimension data (a byref into the array) could be hoisted out of the inner loop.

Thus, we should be able to generate something like this in the inner loop:

G_M4797_IG05:
       mov      esi, dword ptr [rdx+4*r10]
       mov      dword ptr [rcx+4*r10], esi
       inc      r10d
       cmp      r10d, r8d
       jl       G_M4797_IG05

Currently, after importation, the array access looks like:

[000033] -A-X--------              *  ASG       int                     
[000032] ---X---N----              +--*  IND       int                  
[000031] ---X--------              |  \--*  ARR_ELEM[,,] byref          
[000021] ------------              |     +--*  LCL_VAR   ref    V00 arg0
[000022] ------------              |     +--*  LCL_VAR   int    V03 loc0
[000023] ------------              |     +--*  LCL_VAR   int    V04 loc1
[000024] ------------              |     \--*  LCL_VAR   int    V05 loc2
[000030] ---X--------              \--*  IND       int                  
[000029] ---X--------                 \--*  ARR_ELEM[,,] byref          
[000025] ------------                    +--*  LCL_VAR   ref    V01 arg1
[000026] ------------                    +--*  LCL_VAR   int    V03 loc0
[000027] ------------                    +--*  LCL_VAR   int    V04 loc1
[000028] ------------                    \--*  LCL_VAR   int    V05 loc2

This form persists until lowering, when it is expanded to:

N001 (  1,  1) [000021] ------------        t21 =    LCL_VAR   ref    V00 arg0         u:1 $80
N002 (  1,  1) [000022] ------------        t22 =    LCL_VAR   int    V03 loc0         u:3 $140
N003 (  1,  1) [000023] ------------        t23 =    LCL_VAR   int    V04 loc1         u:3 $141
N004 (  1,  1) [000024] ------------        t24 =    LCL_VAR   int    V05 loc2         u:3 $142
               [000092] -c----------        t92 =    CNS_INT   long   0
                                                  /--*  t21    ref    
                                                  +--*  t22    int    
               [000093] ---X--------        t93 = *  ARR_INDEX[i, , ] int   
               [000094] ------------        t94 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t92    long   
                                                  +--*  t93    int    
                                                  +--*  t94    ref    
               [000095] ---X--------        t95 = *  ARR_OFFSET[i, , ] long  
               [000096] ------------        t96 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t96    ref    
                                                  +--*  t23    int    
               [000097] ---X--------        t97 = *  ARR_INDEX[*,j, ] int   
               [000098] ------------        t98 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t95    long   
                                                  +--*  t97    int    
                                                  +--*  t98    ref    
               [000099] ---X--------        t99 = *  ARR_OFFSET[*,j, ] long  
               [000100] ------------       t100 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t100   ref    
                                                  +--*  t24    int    
               [000101] ---X--------       t101 = *  ARR_INDEX[*,*,k] int   
               [000102] ------------       t102 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t99    long   
                                                  +--*  t101   int    
                                                  +--*  t102   ref    
               [000103] ---X--------       t103 = *  ARR_OFFSET[*,*,k] long  
               [000104] ------------       t104 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t104   ref    
                                                  +--*  t103   long   
               [000105] ---X--------       t105 = *  LEA(b+(i*4)+40) byref 
N007 (  1,  1) [000025] ------------        t25 =    LCL_VAR   ref    V01 arg1         u:1 $81
N008 (  1,  1) [000026] ------------        t26 =    LCL_VAR   int    V03 loc0         u:3 $140
N009 (  1,  1) [000027] ------------        t27 =    LCL_VAR   int    V04 loc1         u:3 $141
N010 (  1,  1) [000028] ------------        t28 =    LCL_VAR   int    V05 loc2         u:3 $142
               [000106] -c----------       t106 =    CNS_INT   long   0
                                                  /--*  t25    ref    
                                                  +--*  t26    int    
               [000107] ---X--------       t107 = *  ARR_INDEX[i, , ] int   
               [000108] ------------       t108 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t106   long   
                                                  +--*  t107   int    
                                                  +--*  t108   ref    
               [000109] ---X--------       t109 = *  ARR_OFFSET[i, , ] long  
               [000110] ------------       t110 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t110   ref    
                                                  +--*  t27    int    
               [000111] ---X--------       t111 = *  ARR_INDEX[*,j, ] int   
               [000112] ------------       t112 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t109   long   
                                                  +--*  t111   int    
                                                  +--*  t112   ref    
               [000113] ---X--------       t113 = *  ARR_OFFSET[*,j, ] long  
               [000114] ------------       t114 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t114   ref    
                                                  +--*  t28    int    
               [000115] ---X--------       t115 = *  ARR_INDEX[*,*,k] int   
               [000116] ------------       t116 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t113   long   
                                                  +--*  t115   int    
                                                  +--*  t116   ref    
               [000117] ---X--------       t117 = *  ARR_OFFSET[*,*,k] long  
               [000118] ------------       t118 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t118   ref    
                                                  +--*  t117   long   
               [000119] -c-X--------       t119 = *  LEA(b+(i*4)+40) byref 
                                                  /--*  t119   byref  
N012 ( 21, 14) [000030] ---X--------        t30 = *  IND       int    <l:$103, c:$102>
                                                  /--*  t105   byref  
                                                  +--*  t30    int    
               [000084] -A-X--------              *  STOREIND  int   

There are several things to note, here:

  1. This expansion happens in Lower, which is after all the optimization phases (VN, CSE, hoisting). Thus, this directly represents the code that will be generated. The expansion is done this way mostly to ensure the register allocator has visibility into all the required register lifetimes.
  2. The ARR_INDEX node both creates an "effective index" (the program index minus the dimension's lower bounds) as well as performs a bounds check. This means that the bounds check can't be eliminated, or hoisted (assuming any optimization was done on this form), as the ARR_INDEX node is still required to create the effective index value.

In addition, not shown here, System.Array has a number of methods used on multi-dimensional arrays, that are not well optimized, e.g., are not all inlined or treated as invariant. Specifically, Rank, GetLowerBound, GetUpperBound, GetLength.

Also, loop cloning does not currently implement support for cloning loops with MD indexing expressions. It's interesting to note that MD arrays are simpler in one way compared to jagged arrays: the bounds for each dimension is invariant on an array object, whereas for jagged arrays, the array bounds for "child" arrays might change. That is, for a[i][j], an array bounds check on a[i].Length must be done for every varying i in a loop before accessing a[i][j]. However, for a[i,j] it may be possible to hoist the a[i,] dimension check.

For example, in the above example, cloning could generate a single cloned "slow path" loop with all the cloning condition checks (essentially, bounds checks) outside the outer loop. In pseudo-code, this would be (ignoring non-zero lower bounds):

if ((n <= a.GetLength(0)) && (n <= a.GetLength(1)) && (n <= a.GetLength(2))) {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k]; // no bounds check in loop
            }
        }
    }
} else {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k]; // ALL bounds check
            }
        }
    }
}

Note that the slow path might be slower than optimal if only some cloning conditions fail, but code expansion would be less, and we assume that the slow path is rarely if ever run.

Currently, with jagged arrays, loop cloning might clone at every level of the loop nest, which creates far too much code expansion for nested loops.

Thus, to improve code generation:

  1. MD array expansion should happen earlier, either in morph, or possibly as a separate phase after all the loop optimization phases. We don't want the loop optimizer (or any phase, for that matter) to do pattern matching on these complex lowered forms to reconstruct an understanding of the operation.
  2. Split the ARR_INDEX node into one that does a bounds check, and one that computes the effective index. To avoid duplicating the effective index calculation (needed by both parts), this might require a temp and comma, e.g.,
comma(
  tmp1 = dim-effective-index(array, dim, index),
     // tmp1 = index - array.GetLowerBound(dim)
     // if the lower bound is known to be zero, this is just `tmp1 = index`. In that case, if `index` is simple
     // (lcl_var/lcl_field/constant), we don't need tmp1 or this comma.
  comma(
      dim-bounds-check(array, tmp1),
          // compare tmp1 against array.GetLength(dim). If we know it's in range, it can be removed. It can be CSE'ed/hoisted.
      tmp1 // result is tmp1, the effective index
  )
)
  1. (1) and (2) should allow CSE and hoisting to kick in on common and loop-invariant sub-expressions.
  2. Loop cloning should implement cloning of loops with MD array accesses, to eliminate unnecessary bounds checks. It should leverage the invariance of all loop dimension bounds to reduce code expansion.
  3. Consider marking MD array helpers as AggressiveInlining to improve their optimization (as prototyped here).

You can see a description of the internal data layout format in the comments for the RawArrayData type and the CORINFO_Array type.

category:cq
theme:md-arrays
skill-level:expert
cost:large

@BruceForstall BruceForstall added tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Oct 22, 2021
@BruceForstall BruceForstall added this to the Future milestone Oct 22, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 22, 2021
@ghost
Copy link

ghost commented Oct 22, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

It is well known that the RyuJIT generated multi-dimensional (hereafter, MD) array access is inefficient. Specifically, true MD array access, not "jagged" array access. This is a query of the properly "theme:md-arrays" tagged issues. For example, #8569 and #5481.

Here's a simple example of a function copying a 3-dimensional array, assuming zero-based indices and a length in each dimension of n:

static void copy(int[,,] a, int[,,] b, int n)
{
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k];
            }
        }
    }
}

The x64 code generated for the inner loop is currently:

G_M4797_IG05:
       mov      r11d, eax
       sub      r11d, dword ptr [rcx+28]
       cmp      r11d, dword ptr [rcx+16]
       jae      G_M4797_IG09
       mov      esi, r9d
       sub      esi, dword ptr [rcx+32]
       cmp      esi, dword ptr [rcx+20]
       jae      G_M4797_IG09
       mov      edi, dword ptr [rcx+20]
       imul     rdi, r11
       mov      r11, rsi
       add      r11, rdi
       mov      esi, r10d
       sub      esi, dword ptr [rcx+36]
       cmp      esi, dword ptr [rcx+24]
       jae      G_M4797_IG09
       mov      edi, dword ptr [rcx+24]
       imul     rdi, r11
       mov      r11, rsi
       add      r11, rdi
       lea      r11, bword ptr [rcx+4*r11+40]
       mov      esi, eax
       sub      esi, dword ptr [rdx+28]
       cmp      esi, dword ptr [rdx+16]
       jae      SHORT G_M4797_IG09
       mov      edi, r9d
       sub      edi, dword ptr [rdx+32]
       cmp      edi, dword ptr [rdx+20]
       jae      SHORT G_M4797_IG09
       mov      ebx, dword ptr [rdx+20]
       imul     rbx, rsi
       mov      rsi, rdi
       add      rsi, rbx
       mov      edi, r10d
       sub      edi, dword ptr [rdx+36]
       cmp      edi, dword ptr [rdx+24]
       jae      SHORT G_M4797_IG09
       mov      ebx, dword ptr [rdx+24]
       imul     rbx, rsi
       mov      rsi, rdi
       add      rsi, rbx
       mov      esi, dword ptr [rdx+4*rsi+40]
       mov      dword ptr [r11], esi
       inc      r10d
       cmp      r10d, r8d
       jl       G_M4797_IG05

Things to note:

  1. There are 6 bounds checks, one for each dimension of each array. 4 of these (for i and j) should be hoisted out of the inner k loop. The other 2 could be removed by loop cloning.
  2. MD arrays, in contrast to single-dimension so-called "SZ" arrays, can have a non-zero lower bound in each dimension. Thus, the index must subtract that before comparing against the dimension length. This feature is rarely used (C# doesn't have syntax to create it; you need to use the one System.Array.CreateInstance function that allows for specifying the lower bounds), but must be supported.
  3. The base address of the k dimension data (a byref into the array) could be hoisted out of the inner loop.

Thus, we should be able to generate something like this in the inner loop:

G_M4797_IG05:
       mov      esi, dword ptr [rdx+4*r10]
       mov      dword ptr [rcx+4*r10], esi
       inc      r10d
       cmp      r10d, r8d
       jl       G_M4797_IG05

Currently, after importation, the array access looks like:

[000033] -A-X--------              *  ASG       int                     
[000032] ---X---N----              +--*  IND       int                  
[000031] ---X--------              |  \--*  ARR_ELEM[,,] byref          
[000021] ------------              |     +--*  LCL_VAR   ref    V00 arg0
[000022] ------------              |     +--*  LCL_VAR   int    V03 loc0
[000023] ------------              |     +--*  LCL_VAR   int    V04 loc1
[000024] ------------              |     \--*  LCL_VAR   int    V05 loc2
[000030] ---X--------              \--*  IND       int                  
[000029] ---X--------                 \--*  ARR_ELEM[,,] byref          
[000025] ------------                    +--*  LCL_VAR   ref    V01 arg1
[000026] ------------                    +--*  LCL_VAR   int    V03 loc0
[000027] ------------                    +--*  LCL_VAR   int    V04 loc1
[000028] ------------                    \--*  LCL_VAR   int    V05 loc2

This form persists until lowering, when it is expanded to:

N001 (  1,  1) [000021] ------------        t21 =    LCL_VAR   ref    V00 arg0         u:1 $80
N002 (  1,  1) [000022] ------------        t22 =    LCL_VAR   int    V03 loc0         u:3 $140
N003 (  1,  1) [000023] ------------        t23 =    LCL_VAR   int    V04 loc1         u:3 $141
N004 (  1,  1) [000024] ------------        t24 =    LCL_VAR   int    V05 loc2         u:3 $142
               [000092] -c----------        t92 =    CNS_INT   long   0
                                                  /--*  t21    ref    
                                                  +--*  t22    int    
               [000093] ---X--------        t93 = *  ARR_INDEX[i, , ] int   
               [000094] ------------        t94 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t92    long   
                                                  +--*  t93    int    
                                                  +--*  t94    ref    
               [000095] ---X--------        t95 = *  ARR_OFFSET[i, , ] long  
               [000096] ------------        t96 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t96    ref    
                                                  +--*  t23    int    
               [000097] ---X--------        t97 = *  ARR_INDEX[*,j, ] int   
               [000098] ------------        t98 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t95    long   
                                                  +--*  t97    int    
                                                  +--*  t98    ref    
               [000099] ---X--------        t99 = *  ARR_OFFSET[*,j, ] long  
               [000100] ------------       t100 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t100   ref    
                                                  +--*  t24    int    
               [000101] ---X--------       t101 = *  ARR_INDEX[*,*,k] int   
               [000102] ------------       t102 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t99    long   
                                                  +--*  t101   int    
                                                  +--*  t102   ref    
               [000103] ---X--------       t103 = *  ARR_OFFSET[*,*,k] long  
               [000104] ------------       t104 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t104   ref    
                                                  +--*  t103   long   
               [000105] ---X--------       t105 = *  LEA(b+(i*4)+40) byref 
N007 (  1,  1) [000025] ------------        t25 =    LCL_VAR   ref    V01 arg1         u:1 $81
N008 (  1,  1) [000026] ------------        t26 =    LCL_VAR   int    V03 loc0         u:3 $140
N009 (  1,  1) [000027] ------------        t27 =    LCL_VAR   int    V04 loc1         u:3 $141
N010 (  1,  1) [000028] ------------        t28 =    LCL_VAR   int    V05 loc2         u:3 $142
               [000106] -c----------       t106 =    CNS_INT   long   0
                                                  /--*  t25    ref    
                                                  +--*  t26    int    
               [000107] ---X--------       t107 = *  ARR_INDEX[i, , ] int   
               [000108] ------------       t108 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t106   long   
                                                  +--*  t107   int    
                                                  +--*  t108   ref    
               [000109] ---X--------       t109 = *  ARR_OFFSET[i, , ] long  
               [000110] ------------       t110 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t110   ref    
                                                  +--*  t27    int    
               [000111] ---X--------       t111 = *  ARR_INDEX[*,j, ] int   
               [000112] ------------       t112 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t109   long   
                                                  +--*  t111   int    
                                                  +--*  t112   ref    
               [000113] ---X--------       t113 = *  ARR_OFFSET[*,j, ] long  
               [000114] ------------       t114 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t114   ref    
                                                  +--*  t28    int    
               [000115] ---X--------       t115 = *  ARR_INDEX[*,*,k] int   
               [000116] ------------       t116 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t113   long   
                                                  +--*  t115   int    
                                                  +--*  t116   ref    
               [000117] ---X--------       t117 = *  ARR_OFFSET[*,*,k] long  
               [000118] ------------       t118 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t118   ref    
                                                  +--*  t117   long   
               [000119] -c-X--------       t119 = *  LEA(b+(i*4)+40) byref 
                                                  /--*  t119   byref  
N012 ( 21, 14) [000030] ---X--------        t30 = *  IND       int    <l:$103, c:$102>
                                                  /--*  t105   byref  
                                                  +--*  t30    int    
               [000084] -A-X--------              *  STOREIND  int   

There are several things to note, here:

  1. This expansion happens in Lower, which is after all the optimization phases (VN, CSE, hoisting). Thus, this directly represents the code that will be generated. The expansion is done this way mostly to ensure the register allocator has visibility into all the required register lifetimes.
  2. The ARR_INDEX node both creates an "effective index" (the program index minus the dimension's lower bounds) as well as performs a bounds check. This means that the bounds check can't be eliminated, or hoisted (assuming any optimization was done on this form), as the ARR_INDEX node is still required to create the effective index value.

In addition, not shown here, System.Array has a number of methods used on multi-dimensional arrays, that are not well optimized, e.g., are not all inlined or treated as invariant. Specifically, Rank, GetLowerBound, GetUpperBound, GetLength.

Also, loop cloning does not currently implement support for cloning loops with MD indexing expressions. It's interesting to note that MD arrays are simpler in one way compared to jagged arrays: the bounds for each dimension is invariant on an array object, whereas for jagged arrays, the array bounds for "child" arrays might change. That is, for a[i][j], an array bounds check on a[i].Length must be done for every varying i in a loop before accessing a[i][j]. However, for a[i,j] it may be possible to hoist the a[i,] dimension check.

For example, in the above example, cloning could generate a single cloned "slow path" loop with all the cloning condition checks (essentially, bounds checks) outside the outer loop. In pseudo-code, this would be (ignoring non-zero lower bounds):

if ((n <= a.GetLength(0)) && (n <= a.GetLength(1)) && (n <= a.GetLength(2))) {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k]; // no bounds check in loop
            }
        }
    }
} else {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k]; // ALL bounds check
            }
        }
    }
}

Note that the slow path might be slower than optimal if only some cloning conditions fail, but code expansion would be less, and we assume that the slow path is rarely if ever run.

Currently, with jagged arrays, loop cloning might clone at every level of the loop nest, which creates far too much code expansion for nested loops.

Thus, to improve code generation:

  1. MD array expansion should happen earlier, either in morph, or possibly as a separate phase after all the loop optimization phases. We don't want the loop optimizer (or any phase, for that matter) to do pattern matching on these complex lowered forms to reconstruct an understanding of the operation.
  2. Split the ARR_INDEX node into one that does a bounds check, and one that computes the effective index. To avoid duplicating the effective index calculation (needed by both parts), this might require a temp and comma, e.g.,
comma(
  tmp1 = dim-effective-index(array, dim, index),
     // tmp1 = index - array.GetLowerBound(dim)
     // if the lower bound is known to be zero, this is just `tmp1 = index`. In that case, if `index` is simple
     // (lcl_var/lcl_field/constant), we don't need tmp1 or this comma.
  comma(
      dim-bounds-check(array, tmp1),
          // compare tmp1 against array.GetLength(dim). If we know it's in range, it can be removed. It can be CSE'ed/hoisted.
      tmp1 // result is tmp1, the effective index
  )
)
  1. (1) and (2) should allow CSE and hoisting to kick in on common and loop-invariant sub-expressions.
  2. Loop cloning should implement cloning of loops with MD array accesses, to eliminate unnecessary bounds checks. It should leverage the invariance of all loop dimension bounds to reduce code expansion.
  3. Consider marking MD array helpers as AggressiveInlining to improve their optimization (as prototyped here).

You can see a description of the internal data layout format in the comments for the RawArrayData type and the CORINFO_Array type.

category:cq
theme:md-arrays
skill-level:expert
cost:large

Author: BruceForstall
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, optimization

Milestone: Future

@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Oct 22, 2021
@BruceForstall BruceForstall self-assigned this Oct 22, 2021
@BruceForstall
Copy link
Member Author

cc @dotnet/jit-contrib

@jkotas
Copy link
Member

jkotas commented Oct 23, 2021

Consider marking MD array helpers as AggressiveInlining to improve their optimization

It should be possible to do even better job by expanding them as JIT intrinsics when you know the actual array type. I guess they will need to be JIT intrinsics anyway so that they can be recognized by the optimizations.

@BruceForstall
Copy link
Member Author

It should be possible to do even better job by expanding them as JIT intrinsics when you know the actual array type. I guess they will need to be JIT intrinsics anyway so that they can be recognized by the optimizations.

If the JIT knows it's an MD array, then we can eliminate the code in GetLength/GetUpperBound/GetLowerBound that checks for SZ arrays. If we know the dimension argument is in bounds (not sure how realistic that is), we can get rid of the dimension versus rank check (or maybe use a more efficient JIT exception path, like we have for bounds checks).

Interestingly, GetLowerBound/GetUpperBound would still need the rank value to find the data needed, and that's not a constant in the MethodTable -- it's a 4-instruction sequence on x64. But presumably it would be commonly hoistable.

I actually presume that most coders don't use these functions, and assume zero-based dimensions. I would guess they still use GetLength(), but maybe they "know" the dimension length from outside knowledge, like having a 3-D array NxNxN, and just using N directly everywhere.

@jkotas
Copy link
Member

jkotas commented Oct 23, 2021

GetLowerBound/GetUpperBound would still need the rank value to find the data needed,

If the JIT knows the actual array type that should be always the case in the situations we want to optimize, it can substitute the rank with a constant. It does not need to fetch it from MethodTable.

@BruceForstall
Copy link
Member Author

To-do #5 (optimize array helpers) is addressed by #60816 (not by marking them AggressiveInline, but by making them JIT intrinsics).

@am11
Copy link
Member

am11 commented May 6, 2022

Is MDArray stored as a single contiguous memory block? If so, can use col/row major addressing formula (which takes care of non-zero LB) to linearize the problem during LIH?

@BruceForstall
Copy link
Member Author

Is MDArray stored as a single contiguous memory block? If so, can use col/row major addressing formula (which takes care of non-zero LB) to linearize the problem during LIH?

Yes, it's contiguous. The JIT does use essentially the referenced formula for constructing the index.

@BruceForstall
Copy link
Member Author

BruceForstall commented May 6, 2022

Proposal

  1. Move expansion of multi-dimensional ARR_ELEM from Lower to a new phase immediately after the loop optimization phases (optCloneLoops, optUnrollLoops, optClearLoopIterInfo).
    a. Reason: we want to enable loop cloning based on multi-dimensional array accesses. However, we don't want loop cloning to need to recognize a complex expanded form.
    b. Later, we might want to move GT_INDEX expansion for non-MD array access to this phase as well, for the same reason.
    c. We want the expansion to happen before SSA/VN/CSE/hoisting.
  2. GT_ARR_INDEX does both a bounds check and an effective index calculation. Split this into new nodes to made those separate operations that can be separately optimized.
  3. GT_ARR_OFFSET does a simple math operation but requires accessing a per-dimension MD array length. Expand out the math operation and make the MD array length explicit and separately optimizable.

Thus, GT_ARR_INDEX and GT_ARR_OFFSET will be removed. New nodes to be added:

  1. GT_MDARR_LOWER_BOUND(arrObj, rank, dim) -- load the lower bound of a MD array for the specified dimension. This is the value at offset eeGetMDArrayLowerBoundOffset(rank, dim) for an array.
  2. GT_MDARR_LENGTH(arrObj, rank, dim) -- load the length of a MD array for the specified dimension. Note that the existing GT_ARR_LENGTH only stores the offset from the array object base where the length can be found. This new node will also store that, as well as the dimension and rank. The offset is the value eeGetMDArrayLengthOffset(rank, dim) for an MD array. The reason we need a separate node is to be able to support the EarlyProp phase, to walk backwards from an MDARR_LENGTH node to a CORINFO_HELP_NEW_MDARR MD array creation call with constant arguments (though this presumably is rare). Support needs to be added to the EarlyProp phase to support this optimization.

Nodes to be generalized:
3. GT_BOUNDS_CHECK -- can probably be generalized to cover MD array case, else a new GT_MDARR_BOUNDS_CHECK can be added.

For a 2-d access a[i,j]:

*  ARR_ELEM[,] byref           
+--*  LCL_VAR   ref    V00 arg0
+--*  LCL_VAR   int    V03 loc0
\--*  LCL_VAR   int    V04 loc1

We would generate:

ADD // compute `(i - a.GetLowerBound(0)) * a.GetLength(1) + (j - a.GetLowerBound(1))` 
    MUL // compute `(i - a.GetLowerBound(0)) * a.GetLength(1)`
        COMMA
            ASG
                LCL_VAR int tmp1 // compute effective index of dim 0 by subtracting lower bound
                SUB int
                    index0 // `i` variable
                    ARR_LOWER_BOUND [i,] // invariant with array
                        arrObj
            COMMA
                BOUNDS_CHECK
                    tmp1
                    MDARR_LENGTH [i,] // invariant with array
                        arrObj
                tmp1
        MDARR_LENGTH [*,j]
            arrObj
    COMMA
        ASG
            LCL_VAR int tmp2 // compute effective index of dim 1 by subtracting lower bound
            SUB int
                index1 // `j` variable
                ARR_LOWER_BOUND [*,j] // invariant with array
                    arrObj
        COMMA
            BOUNDS_CHECK
                tmp2
                MDARR_LENGTH [*,j] // invariant with array
                    arrObj
            tmp2

Note that many things in this computation are invariant or could be optimized. E.g., we could hoist invariant effective index calculations and bounds checks.

Loop cloning should be augmented to support MD arrays. One additional cloning condition should be added that is specific to MD arrays: checking that the lower bounds of each dimension is zero (which is expected to be the case most of the time, and we will only optimize via cloning those cases). With that optimization, and if cloning removed bounds checks, we would see:

ADD // compute `i * a.GetLength(1) + j` 
    MUL // compute `i * a.GetLength(1)`
        index0
        MDARR_LENGTH [*,j]
            arrObj
    index1

The MDARR_LENGTH here could normally be hoisted.

This computation only computes the linear index. We also need to scale it by the array element size and add the array data offset to access the data itself. Thus, for a[i,j] = b[i,j], we see something like:

ASG
    IND
        ADD byref
            ADD int
                MUL int
                    -- scaled index (above)
                    -- elem size
                -- arrObj data offset (e.g., 32)
            arrObj
    IND
        ADD byref
            ADD int
                MUL int
                    -- scaled index (above)
                    -- elem size
                -- arrObj data offset (e.g., 32)
            arrObj

On xarch, this might take the form of an LEA; on arm64, there might be more computations. E.g., we might want to hoist arrObj + arrObj data offset out of a loop.

@AndyAyersMS @dotnet/jit-contrib Comments?

@AndyAyersMS
Copy link
Member

Seems reasonable.

The full address expression is something like

&arr + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) + (j - a.GetLowerBound(1))

I wonder if we should instead try and produce something like

&arr + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) - a.GetLowerBound(1)) + elemSize * j 

with the expectation that hoisting will turn this into

;; outer
t =  &arr + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) - a.GetLowerBound(1))
  ;; inner
  &a[i,j] = t + elemSize  *  j   // * likely handled by scaling / mul shift

(we'd need to be careful to ensure that t is not pointing below the base of the array, which could happen, so maybe we only do this rewrite in the clone where we know the lower bounds are indeed zero).

@BruceForstall
Copy link
Member Author

Expressing the addressing as suggested makes sense, to expose more loop-invariant parts.

(Maybe it would be nice if we had an expression optimizer that could automatically do that distribution of multiplication across loop-invariant and non-invariant parts, but perhaps it would be hard, and difficult to measure the benefit for.)

Since in your example t is a byref, we definitely need to be careful that it points into the array. But it seems like hoisting as you show couldn't happen because it would be illegal and hoisting should prevent it. I.e.:

byref + loop-invariant-expression + loop-variant-expression

can't be split to:

t (byref) = byref + loop-invariant-expression
t + loop-variant-expression // a byref expression

because the definition of t might not be a legal byref. We could have:

t (int) = loop-invariant-expression
byref + t + loop-variant-expression // a byref expression

In the example above, leaving byref creation to the inner loop and only hoisting loop-invariant index/offset calculcation:

;; outer
t =  offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) - a.GetLowerBound(1))
  ;; inner
  &a[i,j] = &arr + t + elemSize  *  j

This might be better than what my tree above might generate, probably:

;; outer
t =  (i - a.GetLowerBound(0)) * a.GetLength(1)
  ;; inner
  index = t + (j - a.GetLowerBound(1))
  &a[i,j] = &arr + offset + elemSize  *  index

We need to compute j - a.GetLowerBound(1) to do a bounds check, so we can't hoist that unless we also eliminate the bounds check. So I think we probably should avoid any of these optimizations except under cloning, where the fast path both does the bounds checks and can (optimistically) check for zero lower bounds.

In that case, we could have:

;; outer
t =  &arr + offset + elemSize * i * a.GetLength(1)
  ;; inner
  &a[i,j] = t + elemSize  *  j

Considering a 3-d [i,j,k] loop nest is a little more interesting:

;; i
&a[*,*,*] = t = &arr + offset
t1 = elemSize * i * a.GetLength(1)
  ;; j
  t2 = (t + elemSize *  j) * a.GetLength(2)
     ;; k
     t3 = t2 + elemSize  *  k
     &a[i,j,k] = t + t3

or:

;; i
&a[i,*,*] = t = &arr + offset + elemSize * i * a.GetLength(1) * a.GetLength(2)
  ;; j
  &a[i,j,*] t1 = t + elemSize * j * a.GetLength(2)
     ;; k
     &a[i,j,k] = t1 + elemSize * k

Here, there are three multiplications by elemSize and two by GetLength(2), instead of having these accumulate recursively. Note that &arr, offset, elemSize, a.GetLength(1), and a.GetLength(2) are all loop-invariant. With enough temporary registers, we could have:

s1 = &arr + offset
s2 = elemSize * a.GetLength(1) * a.GetLength(2)
s3 = elemSize * a.GetLength(2)
;; i
&a[i,*,*] = t = s1 + i * s2
  ;; j
  &a[i,j,*] t1 = t + j * s3
     ;; k
     &a[i,j,k] = t1 + elemSize * k

Note that all the expressions create legal byrefs, since the indexes are non-negative and within range (already bounds checked).

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 6, 2022
@kunalspathak
Copy link
Member

So I think we probably should avoid any of these optimizations except under cloning

Will be interesting to see how this plays out with #68061

BruceForstall added a commit that referenced this issue Jul 5, 2022
Currently, multi-dimensional (MD) array access operations are treated as opaque to most of
the JIT; they pass through the optimization pipeline untouched. Lowering expands the `GT_ARR_ELEM`
node (representing a `a[i,j]` operation, for example) to `GT_ARR_OFFSET` and `GT_ARR_INDEX` trees,
to expand the register requirements of the operation. These are then directly used to generate code.

This change moves the expansion of `GT_ARR_ELEM` to a new pass that follows loop optimization but precedes
Value Numbering, CSE, and the rest of the optimizer. This placement allows for future improvement to
loop cloning to support cloning loops with MD references, but allows the optimizer to kick in on the new
expansion. One nice feature of this change: there is no machine-dependent code required; all the nodes
get lowered to machine-independent nodes before code generation.

The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%, but there is
one significant CQ regression in MDMulMatrix of over 20%. Future loop cloning, CSE, and/or LSRA work will be needed to get that back.

In this change, `GT_ARR_ELEM` nodes are morphed to appropriate trees. Note that an MD array `Get`, `Set`, or `Address`
operation is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
and replaced by IR nodes, especially `GT_ARR_ELEM` nodes, in `impArrayAccessIntrinsic()`.

For example, a simple 2-dimensional array access like `a[i,j]` looks like:

```
\--*  ARR_ELEM[,] byref
   +--*  LCL_VAR   ref    V00 arg0
   +--*  LCL_VAR   int    V01 arg1
   \--*  LCL_VAR   int    V02 arg2
```

This is replaced by:

```
&a + offset + elemSize * ((i - a.GetLowerBound(0)) * a.GetLength(1) + (j - a.GetLowerBound(1)))
```

plus the appropriate `i` and `j` bounds checks.

In IR, this is:

```
*  ADD       byref
+--*  ADD       long
|  +--*  MUL       long
|  |  +--*  CAST      long <- uint
|  |  |  \--*  ADD       int
|  |  |     +--*  MUL       int
|  |  |     |  +--*  COMMA     int
|  |  |     |  |  +--*  ASG       int
|  |  |     |  |  |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |  |  \--*  SUB       int
|  |  |     |  |  |     +--*  LCL_VAR   int    V01 arg1
|  |  |     |  |  |     \--*  MDARR_LOWER_BOUND int    (0)
|  |  |     |  |  |        \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |  \--*  COMMA     int
|  |  |     |  |     +--*  BOUNDS_CHECK_Rng void
|  |  |     |  |     |  +--*  LCL_VAR   int    V04 tmp1
|  |  |     |  |     |  \--*  MDARR_LENGTH int    (0)
|  |  |     |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     |  |     \--*  LCL_VAR   int    V04 tmp1
|  |  |     |  \--*  MDARR_LENGTH int    (1)
|  |  |     |     \--*  LCL_VAR   ref    V00 arg0
|  |  |     \--*  COMMA     int
|  |  |        +--*  ASG       int
|  |  |        |  +--*  LCL_VAR   int    V05 tmp2
|  |  |        |  \--*  SUB       int
|  |  |        |     +--*  LCL_VAR   int    V02 arg2
|  |  |        |     \--*  MDARR_LOWER_BOUND int    (1)
|  |  |        |        \--*  LCL_VAR   ref    V00 arg0
|  |  |        \--*  COMMA     int
|  |  |           +--*  BOUNDS_CHECK_Rng void
|  |  |           |  +--*  LCL_VAR   int    V05 tmp2
|  |  |           |  \--*  MDARR_LENGTH int    (1)
|  |  |           |     \--*  LCL_VAR   ref    V00 arg0
|  |  |           \--*  LCL_VAR   int    V05 tmp2
|  |  \--*  CNS_INT   long   4
|  \--*  CNS_INT   long   32
\--*  LCL_VAR   ref    V00 arg0
```

before being morphed by the usual morph transformations.

Some things to consider:
1. MD arrays have both a lower bound and length for each dimension (even if very few MD arrays actually have a
   non-zero lower bound)
2. The new `GT_MDARR_LOWER_BOUND(dim)` node represents the lower-bound value for a particular array dimension. The "effective index" for a dimension is the index minus the lower bound.
3. The new `GT_MDARR_LENGTH(dim)` node represents the length value (number of elements in a dimension) for a particular array dimension.
4. The effective index is bounds checked against the dimension length.
5. The lower bound and length values are 32-bit signed integers (`TYP_INT`).
6. After constructing a "linearized index", the index is scaled by the array element size, and the offset from
   the array object to the beginning of the array data is added.
7. Much of the complexity above is simply to assign temps to the various values that are used subsequently.
8. The index expressions are used exactly once. However, if have side effects, they need to be copied, early,
   to preserve exception ordering.
9. Only the top-level operation adds the array object to the scaled, linearized index, to create the final
   address `byref`. As usual, we need to be careful to not create an illegal byref by adding any partial index.
   calculation.
10. To avoid doing unnecessary work, the importer sets the global `OMF_HAS_MDARRAYREF` flag if there are any
   MD array expressions to expand. Also, the block flag `BBF_HAS_MDARRAYREF` is set on blocks where these exist,
   so only those blocks are processed.

Remaining work:
1. Implement `optEarlyProp` support for MD arrays.
2. Implement loop cloning support for MD arrays.
3. (optionally) Remove old `GT_ARR_OFFSET` and `GT_ARR_INDEX` nodes and related code, as well as `GT_ARR_ELEM`
code used after the new expansion.
4. Implement improvements in CSE and LSRA to improve codegen for the MDMulMatrix benchmark.

The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting
`COMPlus_JitEarlyExpandMDArrays=0`. If disabled, it can be selectively enabled using
`COMPlus_JitEarlyExpandMDArraysFilter=<method_set>` (e.g., as specified for `JitDump`).

Fixes #60785.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 5, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants