Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of splatting a number #29114

Closed
KristofferC opened this issue Sep 10, 2018 · 21 comments · Fixed by #36684
Closed

Performance of splatting a number #29114

KristofferC opened this issue Sep 10, 2018 · 21 comments · Fixed by #36684
Labels
compiler:inference Type inference compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) performance Must go faster

Comments

@KristofferC
Copy link
Sponsor Member

In a few places, the performance of splatting a number has shown up: #29060, #28764 (comment) and working around this give signficiant speed-ups.
The code is well inferred but there is a call to _apply instead of the actual method. This dissapears if the number is wrapped in a tuple.

It would be nice to have this fixed so these type of workarounds are no longer needed.

Ref #27434 (comment)

cc @mbauman, @martinholters

@KristofferC KristofferC added the performance Must go faster label Sep 10, 2018
@Keno
Copy link
Member

Keno commented Sep 10, 2018

We could special case numbers as we do tuples in the inliner. The more generic solution would have to emulate the iteration protocol.

@mbauman
Copy link
Sponsor Member

mbauman commented Sep 10, 2018

Forgive me if this is a naive understanding of inference, but it seems like there's a middle-ground here — isn't it fairly easy to see that iterate(::X,::Any)::Const(nothing) for Xs that only iterate once.

@Keno
Copy link
Member

Keno commented Sep 10, 2018

Yes, but that's still emulating the iteration protocol in the optimizer (inference already does some of this) but the iterate isn't part of the code (since _apply does it internally). Basically you need to implement this TODO:

# TODO: We could basically run the iteration protocol here

or special case it there.

@c42f
Copy link
Member

c42f commented Jul 25, 2019

I just noticed we share basically the same problem over at JuliaArrays/StaticArrays.jl#361.

What if we change lowering of splatting so that f(x...) turns into something like Core._apply(f, Base.splatarg(x))? We can have Base.splatarg(x) = x by default (or possibly collect into a tuple via iteration?) For some more trivial cases like Number and StaticVector a couple of overrides would solve this issue:

Base.splatarg(x::Number) = (x,)
Base.splatarg(v::StaticVector) = Tuple(v)

This does mean that splatting would become configurable by user types but that seems generally consistent with the way we usually lower syntax into function calls.

@yurivish
Copy link
Contributor

Nice idea. Maybe it can be Base.splattable for consistency with Base.broadcastable.

@c42f
Copy link
Member

c42f commented Jul 26, 2019

Base.splattable ...

Interesting analogy; we could go with it, though I think the name is less evocative than splatarg.

@Keno's point about making inference aware of iteration in _apply is might just be a better option if we want splatting to be inherently defined by iteration. Right now I can't think of a reason why you'd ever want splatting to be different, but maybe someone with a better vision of the Julian far future can (I'm looking at you @andyferris :-P)

@andyferris
Copy link
Member

Haha... no, I think I believe in strong interfaces, and guarantees the compiler doesn’t break for the purpose of performance optimizations.

Using splatarg at least puts control/correctness in the hands of users. On the other hand, it puts control/correctness in the hands of users... ! It feels a bit like literal_pow - a clever and useful lowering trick but relies on users’ good behaviour to maintain the illusion of the fundamental guarantee of referencial transparency (not saying I’ve thought of anything better...).

So, to my personal tastes we would only resort to that if Keno’s approach is too messy. But my opinion isn’t strong.

@c42f
Copy link
Member

c42f commented Jul 26, 2019

Yes, something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases. Otherwise it's just a nasty performance hack. My question was really whether such a use case exists. In hindsight that's actually rather orthogonal to solving this issue.

Another thought: I don't know how hard it is to teach inference about iteration in _apply, but an interesting alternative would be to surface the C implementation of argument list construction in _apply to the Julia level. That is, to lower

f(x, y..., z...)

into something like

args = flatten_args((x,), y, z)
_apply(f, args)

where flatten_args applies the iteration protocol to construct the splatted argument list.

@andyferris
Copy link
Member

OK - that seems pretty interesting!

@c42f
Copy link
Member

c42f commented Jul 29, 2019

It's currently tricky to surface the C implementation of _apply argument flattening in a useful way. Ideally we'd want something roughly like

  1. "When possible", arguments are flattened into a single tuple via iteration
  2. With dynamically known small total length, fall back to using a stack allocated buffer of Any. I guess that's not possible right now.
  3. With unknown total length, flatten args into a Vector{Any}

One approach to implementing "when possible" would be to generalize the IteratorSize trait to express lengths which are known based on the type of the input; a Base version of StaticArrays.Length{N}(). This would require that users know about and implement Base.IteratorSize for statically sized objects. A difficulty is that IteratorSize mixes together the expression of length via HasLength and size via HasShape.

Doing more in inference to constant propagate the iteration state for small finite iterations seems helpful and we might get rid of Base.indexed_iterate as part of that. I found the output of the following confusing in fairly recent julia-1.3:

julia> function iter_unpack(t)
           i1 = iterate(t)
           a = i1[1]
           i2 = iterate(t, i1[2])
           b = i2[1]
           b
       end
iter_unpack (generic function with 1 method)

julia> @code_warntype optimize=true iter_unpack((1,2))
Variables
  #self#::Core.Compiler.Const(iter_unpack, false)
  t::Tuple{Int64,Int64}
  i1::Union{Nothing, Tuple{Int64,Int64}}
  a::Int64
  i2::Union{Nothing, Tuple{Int64,Int64}}
  b::Int64

# big pile of code ...
# i1 and i2 are partially const but that seems not to be inferred

@tkf
Copy link
Member

tkf commented Aug 8, 2019

something which can be overridden only really makes sense if there's a compelling reason to allow splatting to be different from iteration in some cases

If we are going to add an API to manipulate splatting behavior, it would be great if it can also be dispatched on in the function side. For example, this lets you dispatch +(xs...) to sum(xs) when xs is an array. I think this is appealing because it would let any binary function op dispatch op(xs...) to reduce(op, xs) or foldl(op, xs). This would give us a concise and intuitive syntax for reduction. I'm thinking something like op(xs::GenericVararg{<:AbstractArray}) while op(xs::GenericVararg{<:Tuple}) is equivalent to op(xs::Vararg). It would also be nice if f.(xs) in op(f.(xs)...) is not materialized; we can then fuse the mapping and reduction by defining op(xs::GenericVararg{<:Broadcasted}).

@c42f
Copy link
Member

c42f commented Aug 9, 2019

@tkf that's a very interesting idea to get actual syntax for reduce and mapreduce while preserving the meaning of splatting.

*(xs...)        sum(xs)
+(xs...)        prod(xs)
min(xs...)      minimum(xs)
max(xs...)      maximum(xs)

+(abs.(xs)...)  mapreduce(abs, +, xs)
+(xs.^2...)     mapreduce(x->x^2, +, xs)

That last one is particularly appealing syntax. One problem is that it would mean defining a lot of op(xs::GenericVararg{<:Union{Broadcasted,AbstractArray}}) for many different op which doesn't seems scalable. Just like the vectorized function problem again. Perhaps there's an acceptable lowering somewhere here which pairs some-syntax and reduction in a similar way that . and broadcast work.

@tkf
Copy link
Member

tkf commented Aug 10, 2019

One problem is that it would mean defining a lot of op(xs::GenericVararg{<:Union{Broadcasted,AbstractArray}}) for many different op which doesn't seems scalable.

I think there are multiple solutions. For example:

Abstract-type based solution

abstract type AssociativeOperator <: Function end

(op::AssociativeOperator)(xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
    reduce(op, xs)

struct Plus <: Associative end
struct Multiply <: Associative end
struct Vcat <: Associative end

const + = Plus()
const * = Multiply()
const vcat = Vcat()

+(x, y) = ...
*(x, y) = ...
vcat(x, y) = ...

Different lowering

Alternatively, you can lower f(xs...) to splatapply(f, GenericVararg(xs)) and define

const AssociativeOperator = Union{
    typoef(*),
    typoef(+),
    typoef(vcat),
    ...and so on...
}

splatapply(f::AssociativeOperator, xs::GenericVararg{<:Union{AbstractArray, Broadcasted}}) =
    reduce(f, xs)

@tkf
Copy link
Member

tkf commented Aug 10, 2019

I guess it's probably a better idea to discuss shorter term solution here for focusing on performance improvements and I don't want to further clutter this issue. So, I opened a new issue #32860 to discuss more wild ideas. @c42f Let's discuss the idea there.

@martinholters
Copy link
Member

Let me repeat the main points from #27434 (comment): Inference already simulates the iteration protocol (to some degree), however, that is solely used to determine the return type of the _apply, not to elide it:

julia> foo(x) = (x...,)
foo (generic function with 1 method)

julia> @code_typed foo(1)
CodeInfo(
1%1 = Core._apply(Core.tuple, x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

As you can see, inference correctly determines that splatting an Int64 will yield exactly one Int64 and calling tuple with it will give a Tuple{Int64}. The missing puzzle piece is having the inliner replace the _apply call with an appropriate sequence of iterate calls. Note that all the intermediate return types of these iterate calls have been determined during inference, but I think we lack the infrastructure to feed this information through to the optimizer. Given that information, it should be relatively easy (cough I guess) to synthesize the appropriate iterate sequence for cases with statically known, small number of resulting elements (e.g. a Number).

That said, inference does not currently propagate constants during the iteration emulation, so that e.g.

julia> @code_typed foo(@SVector [1,2,3])
CodeInfo(
1%1 = Core._apply(Core.tuple, x)::Tuple{Int64,Vararg{Int64,N} where N}
└──      return %1
) => Tuple{Int64,Vararg{Int64,N} where N}

That might be fixable, however, recognizing that it is actually on of the elementary types (here a Tuple) that is being splat might still proof advantageous, not sure. In hindsight, it would have been nice if we had added an initial iterates_as call to the iteration protocol, defaulting to iterates_as(x) = x. Then one could easily e.g. hand out the internal tuple from an SVector without risking inconsistency between iterating and splatting as with the proposed splatarg.

@martinholters
Copy link
Member

Regarding the more general issue, I did some experimenting I also want to share.

I started by assuming Tuple(x) == (x...,) axiomatic and turning it around, lowering e.g. f(x..., y) as Core._apply(f, Tuple(x), (y,)). Of course, this needs a definition of Tuple(x) that does not rely on splatting, except for splatting Tuples, as Tuple(x::Tuple) = x is trivial. What I came up with is an iterate sequence (basically a manually unrolled loop) up to a certain limit, and if that is exceeded, calling a new bulit-in function that can turn a SimpleVector, NamedTuple, or Vector into a tuple---basically a trimmed down version of _apply with the trivially inferred cases only.

I got this working (mostly) with modest effort and my conclusions are: Excellent for cases with inferable (=size statically known) result of Tuple(x), where the _apply call can be elided. But performance penalties otherwise, as having to store the intermediate values in a dynamically allocated tuple is less than ideal. E.g. splatting a vector became about 3x slower.

I have a feeling this is a dead end due to the unavoidable penalties in the size-not-statically-known case, but maybe someone has a good idea...

@c42f
Copy link
Member

c42f commented Aug 22, 2019

What I came up with is an iterate sequence (basically a manually unrolled loop) up to a certain limit

Right, the difficulty is all about the flattening "when possible" referred to in #29114 (comment).

We can try to get at that information via your partial unrolling approach, via traits, or via better inference. It's dissatisfying to introduce traits if we can solve the problem by being smarter with iteration in inference.

It would be interesting to see where you got to with the partial unrolling approach. It looks like part of that is on the mh/totuple_builtin branch?

@martinholters
Copy link
Member

The mh/totuple_builtin branch contains the code for the new builtin, but I'm not sure that's actually needed. While the code is simpler than for _apply, it doesn't seem to add any performance advantage compared to using _apply(tuple, x). Nevertheless, I'll try to bring my code in a less messy state and put it in a branch for others to experiment. I'll post here once done.

I also had another idea that might save this approach. Say we lower f(x...) as _apply(f, Tuple(x)) and define Tuple(x) = _apply(tuple, x), then inlining would give _apply(f, _apply(tuple, x)) where then the optimizer could elide the inner _apply and get back to _apply(f, x). However, we could still provide for efficient Tuple constructors for specific types (and maybe a partially unrolled default fallback, as long as it is inlined so that the same elision of the inner _apply remains possible). Not sure when I'll find time to drive this further, though.

@martinholters
Copy link
Member

Just pushed the current state of my efforts to mh/rework_apply.

@c42f
Copy link
Member

c42f commented Aug 26, 2019

Great thanks @martinholters! Glancing at the unrolling I'm a little more optimistic about that approach. To use Core._apply we need a sequence of collections which will all be flattened together into the final arg list. But there doesn't need to be a 1:1 correspondence between the input args in the surface syntax and the intermediate set of tuples. So I wonder whether we can use that to our advantage.

@martinholters
Copy link
Member

One inconvenience when wrapping the arguments in a function, whether it is splatarg or Tuple, is nested splatting. (I only learned we supported this yesterday, btw.) E.g. f((x...)...) becomes _apply(_apply, (f,), x). Sure, we could lower this to _apply(_apply, (f,), splatarg(x)) instead, but we also need to call splatarg on the elements of x.

@vtjnash vtjnash added compiler:inference Type inference compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) labels Mar 15, 2020
Keno added a commit that referenced this issue Jun 6, 2020
As noted in #36087 and #29114, splatting integers currently has a
performance penalty that is unexpected. For tuples and SimpleVectors,
we have special purpose inliners that will simply inline the
tuple/SimpleVector into the call being splatted. However, for
everything else we'd have to run the iteration protocol to find
out what the values to substitute are. This change does just that,
limited to the case of length-1 (and empty) iterables. Benchmark:

```
f(x) = (x...,)
@code_typed f(1)
@benchmark f(1)
```

Before:
```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     209.357 ns (0.00% GC)
  median time:      213.404 ns (0.00% GC)
  mean time:        218.674 ns (0.16% GC)
  maximum time:     1.922 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     540
```

After:
```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = invoke Base.iterate(_2::Int64)::Tuple{Int64,Nothing}
│   %2 = (getfield)(%1, 1)::Int64
│   %3 = (getfield)(%1, 2)::Nothing
│        invoke Base.iterate(_2::Int64, %3::Nothing)::Nothing
│   %5 = Core.tuple(%2)::Tuple{Int64}
└──      return %5
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.044 ns (0.00% GC)
  median time:      3.047 ns (0.00% GC)
  mean time:        3.049 ns (0.00% GC)
  maximum time:     7.700 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Obviously this isn't 100% optimal yet, because the `iterate` calls themselves
don't get inlined, but it's a lot better. Inlining the `iterate` calls is
left for a follow up commit.
Keno added a commit that referenced this issue Jul 15, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this issue Jul 15, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this issue Jul 15, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this issue Jul 17, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this issue Jul 17, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
Keno added a commit that referenced this issue Jul 18, 2020
This supersedes #36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in #36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes #36087
Fixes #29114
simeonschaub pushed a commit to simeonschaub/julia that referenced this issue Aug 11, 2020
This supersedes JuliaLang#36169. Rather than re-implementing the iteration
analysis as done there, this uses the new stmtinfo infrastrcture
to propagate all the analysis done during inference all the way
to inlining. As a result, it applies not only to splats of
singletons, but also to splats of any other short iterable
that inference can analyze. E.g.:

```
f(x) = (x...,)
@code_typed f(1=>2)
@benchmark f(1=>2)
```

Before:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Core._apply_iterate(Base.iterate, Core.tuple, x)::Tuple{Int64,Int64}
└──      return %1
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     242.659 ns (0.00% GC)
  median time:      246.904 ns (0.00% GC)
  mean time:        255.390 ns (1.08% GC)
  maximum time:     4.415 μs (93.94% GC)
  --------------
  samples:          10000
  evals/sample:     405
```

After:
```
julia> @code_typed f(1=>2)
CodeInfo(
1 ─ %1 = Base.getfield(x, 1)::Int64
│   %2 = Base.getfield(x, 2)::Int64
│   %3 = Core.tuple(%1, %2)::Tuple{Int64,Int64}
└──      return %3
) => Tuple{Int64,Int64}

julia> @benchmark f(1=>2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.701 ns (0.00% GC)
  median time:      1.925 ns (0.00% GC)
  mean time:        1.904 ns (0.00% GC)
  maximum time:     6.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

I also implemented the TODO, I had left in JuliaLang#36169 to inline
the iterate calls themselves, which gives another 3x
improvement over the solution in that PR:

```
julia> @code_typed f(1)
CodeInfo(
1 ─ %1 = Core.tuple(x)::Tuple{Int64}
└──      return %1
) => Tuple{Int64}

julia> @benchmark f(1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.696 ns (0.00% GC)
  median time:      1.699 ns (0.00% GC)
  mean time:        1.702 ns (0.00% GC)
  maximum time:     5.389 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
```

Fixes JuliaLang#36087
Fixes JuliaLang#29114
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:inference Type inference compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants