Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better integral range lowering: start..finish, start..step..finish #16650

Merged
merged 86 commits into from
Mar 6, 2024

Conversation

brianrourkeboll
Copy link
Contributor

@brianrourkeboll brianrourkeboll commented Feb 5, 2024

Description

This subsumes #16577 (and #13573).

Fixes #938.
Fixes #9548.

TL;DR

  • 5 – 6× speedup (for integral types other than int32, and int32 when step is not a constant 1 or -1)

    forin start..finish do
    forin start..step..finish do
  • 2.5 – 8× speedup

    [|start..finish|]
    [|start..step..finish|]
  • 1.25 – 5× speedup

    [start..finish]
    [start..step..finish]

Before

The compiler currently only optimizes for-loops over integral ranges when:

  1. The type is int/int32.
  2. The step is a compile-time constant 1 or -1.

for-loops over ranges for other integral types (int64, uint32, etc.), or when the step is not 1 or -1, use a relatively slow, IEnumerable<_>-based implementation.

List and array comprehension expressions initialized by an integral range also use the slow IEnumerable<_>-based implementation, and, in the case of arrays, incur additional unnecessary array allocations and copying.

After

With this PR, the compiler now lowers for-loops for all built-in integral types — with any arbitrary step — down to fast integral while-loops.

This optimization is in turn applied to the lowering of computed list collection expressions where an integral range is used to initialize the list, as in [start..finish] and [start..step..finish]; a similar technique is used for arrays in [|start..finish|] and [|start..step..finish|], where the total size of the array is computed from the start, step, and finish, at compile-time if possible. Such initialization expressions are now significantly faster and allocate significantly less, especially for arrays and for smaller lists.

  • for-loops over integral ranges of non-int32-types, and/or with arbitrary steps, are about 5 – 6 times as fast and no longer allocate at all
  • Array initialization from an integral range is about 2.5 – 8 times as fast and allocates less than half as much
  • List initialization from an integral range is about 1.25 – 5 times as fast and allocates a bit less

These optimizations are not visible in quotations.

The following are the approximate high-level transformations applied to for-loops and list and array comprehensions over integral ranges.

The basic idea is to precompute the iteration count like this:

start..finish ($step = 1$)

let count = if finish < start then 0 else unsigned (widen (finish - start)) + 1
// We "widen" to the next-biggest integral type, since the count
// won't fit in the original type if finish - start = MaxValue.

start..step..finish

if step = 0 then
    ignore ((.. ..) start step finish) // Throws the appropriate localized exception at runtime.

let count =
    if 0 < step then
        if finish < start then 0 else unsigned (widen ((finish - start) / step)) + 1
    else // step < 0
        if start < finish then 0 else unsigned (widen ((start - finish) / (unsigned ~~~step + 1)) + 1
        // We use unsigned ~~~step + 1 to get the step's absolute value, since step might be the minimum value
        // for the given numeric type, and the neg instruction won't do what we want on the minimum value
        // of a two's complement number.

This lets us handle any potential overflows just once, instead of in each iteration of the loop. The loop then becomes simply:

let mutable loopVar = start
let mutable i = 0

while i < count do
    …
    loopVar <- loopVar + step
    i <- i + 1

If the range type is already 64-bit, or might be (for native ints), we have nothing to widen the count to — so we check whether computing the full count would overflow, and, if it would, we start at 0 and loop until we hit 0 again, indicating that we have wrapped around.

Additional optimizations are applied for various scenarios:

  • If start, step, and finish are constants, we can compute the count at build-time.
  • If step is constant, or if the integral type is unsigned, we can simplify the code we emit for computing the count at runtime: we potentially don't need to emit code to check whether $0 &lt; step$, find $|step|$, etc.
  • Etc.

Examples:

for-loops

for loopVar in start..finish do …

let count = if finish < start then 0 else unsigned (widen (finish - start)) + 1
let mutable loopVar = start
let mutable i = 0

while i < count do
    …
    loopVar <- loopVar + 1
    i <- i + 1

for loopVar in start..step..finish do …

if step = 0 then
    // Call the range operator so that it throws the appropriate localized exception.
    ignore ((.. ..) start step finish)

let count =
    if 0 < step then
        if finish < start then 0 else unsigned (finish - start) / unsigned step + 1
    else // step < 0
        if start < finish then 0 else unsigned (start - finish) / (unsigned ~~~step + 1) + 1

let mutable loopVar = start
let mutable i = 0

while i < count do
    …
    loopVar <- loopVar + step
    i <- i + 1

Lists

[start..finish]

let count = if finish < start then 0 else unsigned (finish - start) + 1
let mutable collector = ListCollector ()
let mutable loopVar = start
let mutable i = 0

while i < count do
    collector.Add loopVar
    loopVar <- loopVar + 1
    i <- i + 1

collector.Close ()

[start..step..finish]

if step = 0 then
    // Call the range operator so that it throws the appropriate localized exception.
    ignore ((.. ..) start step finish)

let count =
    if 0 < step then
        if finish < start then 0 else unsigned (finish - start) / unsigned step + 1
    else // step < 0
        if start < finish then 0 else unsigned (start - finish) / (unsigned ~~~step + 1) + 1

let mutable loopVar = start
let mutable i = 0

while i < count do
    collector.Add loopVar
    loopVar <- loopVar + step
    i <- i + 1

collector.Close ()

Arrays

[|start..finish|]

let count = if finish < start then 0 else unsigned (finish - start) + 1

if count < 1 then
    [||]
else
    let array = (# "newarr !0" type ('T) count : 'T array #)
    let mutable loopVar = start
    let mutable i = 0

    while i < count do
        array[i] <- loopVar
        loopVar <- loopVar + 1
        i <- i + 1

    array

[|start..step..finish|]

if step = 0 then
    // Call the range operator so that it throws the appropriate localized exception.
    ignore ((.. ..) start step finish)

let count =
    if 0 < step then
        if finish < start then 0 else unsigned (finish - start) / unsigned step + 1
    else // step < 0
        if start < finish then 0 else unsigned (start - finish) / (unsigned ~~~step + 1) + 1

if count < 1 then
    [||]
else
    let array = (# "newarr !0" type ('T) count : 'T array #)
    let mutable loopVar = start
    let mutable i = 0

    while i < count do
        array[i] <- loopVar
        loopVar <- loopVar + step
        i <- i + 1

    array

Benchmarks

Source:

for-loops

| Categories                                                    | Mean           | Ratio | Gen0   | Allocated | Alloc Ratio |
|-------------------------------------------------------------- |---------------:|------:|-------:|----------:|------------:|
| Int32,1,10..1                                                 |      0.4186 ns |  1.00 |      - |         - |          NA |
| Int32,1,10..1                                                 |      0.4429 ns |  1.06 |      - |         - |          NA |
|                                                               |                |       |        |           |             |
| Int32,2,1..256                                                |     60.3479 ns |  1.00 |      - |         - |          NA |
| Int32,2,1..256                                                |     60.3165 ns |  1.00 |      - |         - |          NA |
|                                                               |                |       |        |           |             |
| Int32,3,start..finish (start=1,finish=65536)                  | 14,026.4404 ns |  1.00 |      - |         - |          NA |
| Int32,3,start..finish (start=1,finish=65536)                  | 13,769.3827 ns |  0.98 |      - |         - |          NA |
|                                                               |                |       |        |           |             |
| Int32,4,1..2..256                                             |    200.5574 ns |  1.00 | 0.0076 |      96 B |        1.00 |
| Int32,4,1..2..256                                             |     39.6841 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int32,5,start..step..finish (start=1,step=2,finish=65536)     | 45,387.6799 ns |  1.00 |      - |      96 B |        1.00 |
| Int32,5,start..step..finish (start=1,step=2,finish=65536)     |  8,746.4382 ns |  0.19 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,1,10L..1L                                               |     12.9186 ns |  1.00 | 0.0096 |     120 B |        1.00 |
| Int64,1,10L..1L                                               |      0.4135 ns |  0.03 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,2,1L..256L                                              |    357.6166 ns |  1.00 | 0.0076 |      96 B |        1.00 |
| Int64,2,1L..256L                                              |     73.1035 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,3,start..finish (start=1L,finish=65536L)                | 93,050.3060 ns |  1.00 |      - |      96 B |        1.00 |
| Int64,3,start..finish (start=1L,finish=65536L)                | 17,280.7731 ns |  0.19 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,4,1L..2L..256L                                          |    196.9976 ns |  1.00 | 0.0095 |     120 B |        1.00 |
| Int64,4,1L..2L..256L                                          |     39.7791 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,5,start..step..finish (start=1L,step=2L,finish=65536L)  | 43,709.2550 ns |  1.00 |      - |     120 B |        1.00 |
| Int64,5,start..step..finish (start=1L,step=2L,finish=65536L)  |  8,638.8727 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,1,10u..1u                                              |     12.4365 ns |  1.00 | 0.0076 |      96 B |        1.00 |
| UInt32,1,10u..1u                                              |      0.4143 ns |  0.03 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,2,1u..256u                                             |    361.1339 ns |  1.00 | 0.0062 |      80 B |        1.00 |
| UInt32,2,1u..256u                                             |     73.0952 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,3,start..finish (start=1u,finish=65536u)               | 93,666.9857 ns |  1.00 |      - |      80 B |        1.00 |
| UInt32,3,start..finish (start=1u,finish=65536u)               | 17,288.2377 ns |  0.18 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,4,1u..2u..256u                                         |    200.0141 ns |  1.00 | 0.0076 |      96 B |        1.00 |
| UInt32,4,1u..2u..256u                                         |     39.4608 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,5,start..step..finish (start=1u,step=2u,finish=65536u) | 44,989.2765 ns |  1.00 |      - |      96 B |        1.00 |
| UInt32,5,start..step..finish (start=1u,step=2u,finish=65536u) |  8,728.6997 ns |  0.19 |      - |         - |        0.00 |

Lists & arrays

| Categories                                                    | Mean            | Ratio | Gen0     | Gen1     | Gen2     | Allocated | Alloc Ratio |
|-------------------------------------------------------------- |----------------:|------:|---------:|---------:|---------:|----------:|------------:|
| Array,1,[|10..1|]                                             |      17.1115 ns |  1.00 |   0.0076 |        - |        - |      96 B |        1.00 |
| Array,1,[|10..1|]                                             |       0.4347 ns |  0.03 |        - |        - |        - |         - |        0.00 |
|                                                               |                 |       |          |          |          |           |             |
| Array,2,[|1..10|]                                             |      57.1416 ns |  1.00 |   0.0261 |        - |        - |     328 B |        1.00 |
| Array,2,[|1..10|]                                             |       6.8734 ns |  0.12 |   0.0051 |        - |        - |      64 B |        0.20 |
|                                                               |                 |       |          |          |          |           |             |
| Array,3,[|1..256|]                                            |     423.2502 ns |  1.00 |   0.1817 |        - |        - |    2280 B |        1.00 |
| Array,3,[|1..256|]                                            |     117.0278 ns |  0.28 |   0.0834 |        - |        - |    1048 B |        0.46 |
|                                                               |                 |       |          |          |          |           |             |
| Array,4,[|start..finish|] (start=1,finish=65536)              | 193,688.7386 ns |  1.00 | 124.7559 | 124.7559 | 124.7559 |  524754 B |        1.00 |
| Array,4,[|start..finish|] (start=1,finish=65536)              |  79,806.0994 ns |  0.41 |  83.2520 |  83.2520 |  83.2520 |  262196 B |        0.50 |
|                                                               |                 |       |          |          |          |           |             |
| Array,5,[|1..2..256|]                                         |     355.8656 ns |  1.00 |   0.0992 |        - |        - |    1248 B |        1.00 |
| Array,5,[|1..2..256|]                                         |      64.2190 ns |  0.18 |   0.0427 |        - |        - |     536 B |        0.43 |
|                                                               |                 |       |          |          |          |           |             |
| Array,6,[|start..step..finish|] (start=1,step=2,finish=65536) | 106,117.6473 ns |  1.00 |  41.6260 |  41.6260 |  41.6260 |  262574 B |        1.00 |
| Array,6,[|start..step..finish|] (start=1,step=2,finish=65536) |  40,447.5468 ns |  0.38 |  41.6260 |  41.6260 |  41.6260 |  131110 B |        0.50 |
|                                                               |                 |       |          |          |          |           |             |
| List,1,[10..1]                                                |      16.0046 ns |  1.00 |   0.0076 |        - |        - |      96 B |        1.00 |
| List,1,[10..1]                                                |       1.0626 ns |  0.07 |        - |        - |        - |         - |        0.00 |
|                                                               |                 |       |          |          |          |           |             |
| List,2,[1..10]                                                |      54.4243 ns |  1.00 |   0.0318 |        - |        - |     400 B |        1.00 |
| List,2,[1..10]                                                |      32.2665 ns |  0.59 |   0.0255 |        - |        - |     320 B |        0.80 |
|                                                               |                 |       |          |          |          |           |             |
| List,3,[1..256]                                               |     948.1524 ns |  1.00 |   0.6590 |   0.0191 |        - |    8272 B |        1.00 |
| List,3,[1..256]                                               |     748.6352 ns |  0.79 |   0.6523 |   0.0191 |        - |    8192 B |        0.99 |
|                                                               |                 |       |          |          |          |           |             |
| List,4,[start..finish] (start=1,finish=65536)                 | 452,512.8544 ns |  1.00 | 166.9922 | 154.7852 |        - | 2097232 B |        1.00 |
| List,4,[start..finish] (start=1,finish=65536)                 | 407,472.2145 ns |  0.89 | 166.9922 | 154.7852 |        - | 2097152 B |        1.00 |
|                                                               |                 |       |          |          |          |           |             |
| List,5,[1..2..256]                                            |     581.3088 ns |  1.00 |   0.3338 |   0.0048 |        - |    4192 B |        1.00 |
| List,5,[1..2..256]                                            |     380.2920 ns |  0.65 |   0.3262 |   0.0048 |        - |    4096 B |        0.98 |
|                                                               |                 |       |          |          |          |           |             |
| List,6,[start..step..finish] (start=1,step=2,finish=65536)    | 188,436.8229 ns |  1.00 |  83.4961 |  71.7773 |        - | 1048672 B |        1.00 |
| List,6,[start..step..finish] (start=1,step=2,finish=65536)    | 150,491.5234 ns |  0.80 |  83.4961 |  71.5332 |        - | 1048576 B |        1.00 |

Checklist

  • Test cases added.
    • All existing tests involving ranges or comprehensions of ranges, like these and these, pass.
  • Performance benchmarks added in case of performance changes.
    • I have included some basic illustrative benchmarks in this description, but I'm willing to run more if desired.
  • Release notes entry updated.

Design/implementation notes

  • In theory, these optimizations could subsume the existing optimization that applies only to for-loops over int32 ranges with constant 1 or -1 steps. The performance should be comparable, if not identical (and is, in a quick spot-check). I have not replaced or removed the existing transformation in this PR. Note that, for compatibility's sake, if we ever did want to switch int ranges over to use this codegen, we'd need to do it only when OptimizeForExpressionOptions = OptimizeAllForExpressions, since the existing lowering to TOp.IntegerForLoop is visible in quotations.

    Alternatively, the existing TOp.IntegerForLoop construct itself could have been extended to support arbitrary steps and integer types, although that may have compatibility implications since changes to or new usage of TOp.IntegerForLoop would be visible in quotations. Compare Optimize integer for loop code gen #13573.

  • The emitted IL looks reasonably good to me, but additional optimizations could be made in certain cases:

    • If we knew that $start = 0$ and $step = 1$, we could consolidate $idxVar$ and $loopVar$ and halve the number of increment operations.
    • If we knew that $|finish - start| \leq maxValue(ty)$, we wouldn't need to widen the $count$ variable.
    • Try to ensure we emit loop patterns that the JIT knows how to optimize.
    • Try to emit patterns that will let the JIT omit bounds checks when initializing arrays (if we know that $count \leq 2^{32}$).
    • Etc.

    The returns begin to diminish relatively quickly relative to the complexity they add to the compiler, though.

  • I'm emitting IL instructions for arithmetic, comparison, etc., rather than their library equivalents. That's mainly because the call to LowerComputedCollections.LowerComputedListOrArrayExpr is made in IlxGen.fs, which happens after type-checking and optimization, which means that calls to library-level operators won't work. If we extended the TOp.IntegerForLoop construct instead, though, perhaps we could defer the lower-level codegen to IlxGen.fs.

  • Rather than directly emitting loops for collection initialization, we could instead just emit code to compute the count, and then call Array.init/List.init as in Better lowering of [start..finish] & [|start..finish|] #16577. This wouldn't really save on code size, though, since closure types would end up being emitted instead. If we really wanted to minimize code size while keeping the performance boost, we could add some kind of new library method for fast collection initialization from ranges and just emit calls to that. I'm not exactly sure where that would go, though, or how it would be exposed.

* Lower for-loops over integral ranges to fast while-loops for all
  built-in integral types.

* Lower `[start..finish]`, `[start..step..finish]`, `[|start..finish|]`,
  `[|start..step..finish|]` to fast integral while-loop initializers.
Copy link
Contributor

github-actions bot commented Feb 5, 2024

❗ Release notes required


✅ Found changes and release notes in following paths:

Change path Release notes path Description
src/Compiler docs/release-notes/.FSharp.Compiler.Service/8.0.300.md
LanguageFeatures.fsi docs/release-notes/.Language/preview.md

@brianrourkeboll brianrourkeboll marked this pull request as ready for review February 5, 2024 14:45
@brianrourkeboll brianrourkeboll requested a review from a team as a code owner February 5, 2024 14:45
@psfinaki
Copy link
Member

psfinaki commented Feb 6, 2024

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@brianrourkeboll
Copy link
Contributor Author

Ah. I see what the problem is. I'll put this back in draft mode until it's fixed.

@brianrourkeboll brianrourkeboll marked this pull request as draft February 7, 2024 23:52
@KevinRansom
Copy link
Member

@brianrourkeboll ,

@KevinRansom Do you think it makes sense to keep them duplicated for consistency's sake, or should I extract the new tests that I added and put them under a separate EmittedIL/ForEachRangeLoop directory?

Good question, I think running the tests both ways is important to ensure that we don't inadvertently regress the reachability of the values. But, whole il file comparison makes the tests vulnerable to lots of other ilgen modifications.

The reason I provided both was because, I wanted to be able to ensure that the visibility feature, which was about as risky as any I have attempted in terms of "possible breaking changes" produced reviewable changes and that those changes were explainable.

Now that the realsig feature is in, perhaps less proving is necessary, and mere regression proofing is the way to go. Perhaps a smoke test with it on and off and the remainder with it off. We will have to remember to turn it to on when we enable realsig+ by default :-)

@T-Gro
Copy link
Member

T-Gro commented Mar 4, 2024

(There were 3 failures, but I suspect all 3 of them were for flaky reasons - retrying to see)

@vzarytovskii vzarytovskii requested a review from KevinRansom March 4, 2024 18:05
@brianrourkeboll
Copy link
Contributor Author

I have a few more improvements ready —

  • [for … in start..finish -> …], [|for … in start..finish -> …|], etc., using this optimization
  • [for x in xs -> …] using List.map when xs is a list
  • [|for x in xs -> …|] using Array.map when xs is an array

— but I guess I'll wait to open a separate PR for those to avoid making this one any bigger than it already is...

@vzarytovskii vzarytovskii merged commit 34df502 into dotnet:main Mar 6, 2024
31 checks passed
@psfinaki
Copy link
Member

psfinaki commented Mar 6, 2024

@brianrourkeboll yeah feel free to open followups - these look promising!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Using for-loops with the ".." operator produces suboptimal code Striding loop performance
6 participants