Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span bounds checking not fully eliminated in some cases when index is constant and length is known #1115

Closed
gdkchan opened this issue Dec 21, 2019 · 2 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@gdkchan
Copy link

gdkchan commented Dec 21, 2019

I have some code that basically reads and writes to a given memory region, with a offset. I had one for each integer type, and also a generic one for structs etc.

I decided, with the use of spans, check if I could remove all the integer read/write methods and just use the generic method instead. So instead of:

int x = ReadInt32(offset);

I could just do:

int x = Read<int>(offset);

After checking the code generated for both, I noticed that the one generated for the generic variant was almost as good as the specialized one, with the exception of an odd issue, the bounds check was not fully eliminated.

Here's some test code that highlights the issue:

using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

public class C
{
    private IntPtr _ptr;
    
    public C(IntPtr ptr)
    {
        _ptr = ptr;
    }
    
    public int ReadInt32(int offset)
    {
        return Read<int>(offset);
    }
    
    public unsafe int ReadInt32Fast(int offset)
    {
        return *(int*)(_ptr + offset);
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public T Read<T>(int offset) where T : struct
    {
        return MemoryMarshal.Cast<byte, T>(GetDataSpan(offset, Unsafe.SizeOf<T>()))[0];
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private unsafe Span<byte> GetDataSpan(int offset, int size)
    {
        return new Span<byte>((void*)(_ptr + offset), size);
    }
}

SharpLab link here.
Code generated for the ReadInt32 method:

C.ReadInt32(Int32)
    L0000: sub rsp, 0x28
    L0004: mov rax, [rcx+0x8]
    L0008: movsxd rdx, edx
    L000b: add rax, rdx
    L000e: mov edx, 0x1
    L0013: cmp edx, 0x0
    L0016: jbe L0025
    L0018: mov eax, [rax]
    L001a: add rsp, 0x28
    L001e: ret
    L001f: call 0x7fff231aed50
    L0024: int3
    L0025: call 0x7fff231aef00
    L002a: int3

This is the odd part:

    L000e: mov edx, 0x1
    L0013: cmp edx, 0x0
    L0016: jbe L0025

It's doing a comparison with constant values. As far I can tell, 1 is the length and 0 the index. Since they are constant, constant folding should evaluate 1U <= 0U == false, and so we know that the branch is never going to be taken, since the condition is always false. So the comparison and jump should be eliminated, and then the other part of the code would become unreachable. It looks like L001f is already unreachable aswell, so I don't know why the code is still there. It looks like that whichever optimization pass is responsible for optimizing length checks runs too late on the compiler pipeline, and so the other optimizations are not kicking in.

The goal is generating the same or as efficient code for ReadInt32 and ReadInt32Fast.

category:cq
theme:value-numbering
skill-level:intermediate
cost:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 21, 2019
@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 22, 2019
@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Dec 28, 2019
@AndyAyersMS AndyAyersMS added this to the Future milestone Dec 28, 2019
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Dec 28, 2019

Naively unblocking the constant eval VNF_CastOvf paths in value numbering fixes this (a real fix would need to be more careful about handling overflow). The issue is that we have a dead overflow cast but don't optimize that path away until too late.

G_M5540_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M5540_IG02:
       mov      rax, qword ptr [rcx+8]
       movsxd   rdx, edx
       add      rax, rdx
       mov      eax, dword ptr [rax]
						;; bbWeight=1    PerfScore 4.50
G_M5540_IG03:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25
G_M5540_IG04:
       call     CORINFO_HELP_OVERFLOW
       int3 

So if value numbering handled VNF_CastOvf with constant args (or if perhaps we did sparse conditional evaluation to avoid getting confused by at dead stuff), we would be able to simplify this.

Also would be nice to get rid of the dead overflow call.

@jakobbotsch
Copy link
Member

Codegen today:

; Assembly listing for method C:ReadInt32(int):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 8 single block inlinees; 2 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  3,  3   )     ref  ->  rcx         this class-hnd single-def
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->  rdx         single-def
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct (16) zero-ref    "struct address for call/obj"
;* V04 tmp2         [V04    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inline stloc first use temp"
;* V05 tmp3         [V05    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp"
;  V06 tmp4         [V06,T03] (  2,  4   )    long  ->  rax         "Inlining Arg"
;* V07 tmp5         [V07,T06] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
;* V08 tmp6         [V08,T07] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
;* V09 tmp7         [V09    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V10 tmp8         [V10,T08] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
;* V11 tmp9         [V11    ] (  0,  0   )    long  ->  zero-ref    "Inline stloc first use temp"
;* V12 tmp10        [V12,T09] (  0,  0   )     int  ->  zero-ref    "Inline stloc first use temp"
;* V13 tmp11        [V13    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "NewObj constructor temp"
;  V14 tmp12        [V14,T02] (  2,  4   )   byref  ->  rax         single-def "Inlining Arg"
;* V15 tmp13        [V15    ] (  0,  0   )    bool  ->  zero-ref    "Inlining Arg"
;* V16 tmp14        [V16    ] (  0,  0   )    bool  ->  zero-ref    "Inlining Arg"
;* V17 tmp15        [V17    ] (  0,  0   )   byref  ->  zero-ref    V03._reference(offs=0x00) P-INDEP "field V03._reference (fldOffset=0x0)"
;* V18 tmp16        [V18    ] (  0,  0   )     int  ->  zero-ref    V03._length(offs=0x08) P-INDEP "field V03._length (fldOffset=0x8)"
;* V19 tmp17        [V19    ] (  0,  0   )   byref  ->  zero-ref    V04._reference(offs=0x00) P-INDEP "field V04._reference (fldOffset=0x0)"
;* V20 tmp18        [V20    ] (  0,  0   )     int  ->  zero-ref    V04._length(offs=0x08) P-INDEP "field V04._length (fldOffset=0x8)"
;  V21 tmp19        [V21,T04] (  2,  2   )   byref  ->  rax         V05._reference(offs=0x00) P-INDEP "field V05._reference (fldOffset=0x0)"
;* V22 tmp20        [V22    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V05._length (fldOffset=0x8)"
;  V23 tmp21        [V23,T05] (  2,  2   )   byref  ->  rax         single-def V09._reference(offs=0x00) P-INDEP "field V09._reference (fldOffset=0x0)"
;* V24 tmp22        [V24    ] (  0,  0   )     int  ->  zero-ref    V09._length(offs=0x08) P-INDEP "field V09._length (fldOffset=0x8)"
;* V25 tmp23        [V25    ] (  0,  0   )   byref  ->  zero-ref    V13._reference(offs=0x00) P-INDEP "field V13._reference (fldOffset=0x0)"
;* V26 tmp24        [V26    ] (  0,  0   )     int  ->  zero-ref    V13._length(offs=0x08) P-INDEP "field V13._length (fldOffset=0x8)"
;
; Lcl frame size = 40

G_M5476_IG01:
       sub      rsp, 40
						;; size=4 bbWeight=1    PerfScore 0.25
G_M5476_IG02:
       movsxd   rax, edx
       add      rax, qword ptr [rcx+08H]
       mov      eax, dword ptr [rax]
						;; size=9 bbWeight=1    PerfScore 5.25
G_M5476_IG03:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25
G_M5476_IG04:
       call     CORINFO_HELP_OVERFLOW
       int3     
						;; size=6 bbWeight=0    PerfScore 0.00

; Total bytes of code 24, prolog size 4, PerfScore 9.15, instruction count 8, allocated bytes for code 24 (MethodHash=5011ea9b) for method C:ReadInt32(int):int:this
; ============================================================

Looks fixed, although we still leave the dead throw helper around.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

6 participants