Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

circshift(SArray) returns MArray #745

Open
wsshin opened this issue Feb 22, 2020 · 16 comments
Open

circshift(SArray) returns MArray #745

wsshin opened this issue Feb 22, 2020 · 16 comments
Labels
feature features and feature requests

Comments

@wsshin
Copy link
Contributor

wsshin commented Feb 22, 2020

Example:

julia> sv = SVector(1,2,3);

julia> circshift(sv, 1)
3-element MArray{Tuple{3},Int64,1,3} with indices SOneTo(3):
 3
 1
 2

I think the result should be SArray.

@c42f c42f added the feature features and feature requests label Feb 24, 2020
@c42f
Copy link
Member

c42f commented Feb 24, 2020

Agreed. The reason for this is that it's dispatching to Base and similar(::SVector) returns an MVector.

@Micket
Copy link

Micket commented Apr 13, 2020

I looked into this as I wanted to learn a bit more about Julia and this library, but, oh boy, this was (to a novice like me) a surprisingly difficult problem.
The immutability basically forces a much more inconvenient implementation (with significant overhead), just like setindex.

So, unless one goes for an approach with a mutable temporary, you'd have to end up with something along the lines of;

circshift(a::StaticArray, shift::Int) = _circshift(Length(a), a, shift)
@generated function _circshift(::Length{L}, a::StaticArray{<:Tuple,T}, shift::Int) where {L, T}
    exprs = [:(a[1+mod($i-shift, L)]) for i = 0:(L-1)]
    return quote
        @inbounds return typeof(a)(tuple($(exprs...)))
    end
end

So, given the forced marriage of stack allocation and immutability in Julia, I have started to think of this more philosophically. Basically, the need for this a circshift (or a setindex) with a dynamic shift variable indicates an underlying problem.
A "compile time" length for the shift is what we actually should want most of the times with static variables (I apologize for the messy code here; i'm new to generated functions and barely know what I'm doing)

function circshift(a::StaticArray{Tuple{L}}, ::Length{N}) where {L, N}
    vcat(_slice(Length(L-N+1), Length(L), a), _slice(Length(1), Length(L-N), a))
end
@generated function _slice(::Length{L}, ::Length{U}, a::StaticArray{<:Tuple,T}) where {L, U, T}
    exprs = [:(a[$i]) for i = L:U]
    return quote
        @inbounds return SArray{Tuple{1+U-L},T,1,1+U-L}(tuple($(exprs...)))
    end
end

which, of course, is super-fast if one specifies a static shift circshift(a, Length(3))

@c42f
Copy link
Member

c42f commented Apr 13, 2020

I looked into this as I wanted to learn a bit more about Julia and this library, but, oh boy, this was (to a novice like me) a surprisingly difficult problem.

Haha! Well thanks for jumping right in I think you're on the right track here :-)

So, unless one goes for an approach with a mutable temporary, you'd have to end up with something along the lines of [...]

Yes I think that's basically the right way to implement it. However it might not be so bad as you think: the compiler is quite good at constant propagation, so in the cases where shift is constant in the caller, the code for all the occurrences of mod in _circshift will be optimized away. However for this to work you may need to encourage some extra inlining:

@inline Base.circshift(a::StaticArray, shift::Int) = _circshift(Length(a), a, shift)
@generated function _circshift(::Length{L}, a::StaticArray{<:Tuple,T}, shift::Int) where {L, T}
    exprs = [:(a[1+mod($i-shift, L)]) for i = 0:(L-1)]
    return quote
        $(Expr(:meta, :inline))
        @inbounds return typeof(a)(tuple($(exprs...)))
    end
end

Note the appearance of Expr(:meta, :inline) in the generated code, which is the confusing way you need to write @inline for @generated functions. It seemed I also needed to add @inline to circshift.

Thus we get extremely simple native code when the shift amount can constant propagated all the way inward. Eg, for a function foo:

julia> foo(a) = circshift(a, length(a)÷2)
foo (generic function with 1 method)

julia> foo(SA[1,2,3,4,5])
5-element SArray{Tuple{5},Int64,1,5} with indices SOneTo(5):
 4
 5
 1
 2
 3

julia> @code_native foo(SA[1,2,3,4,5])
	.text
; ┌ @ REPL[11]:1 within `foo'
	movq	%rdi, %rax
; │┌ @ REPL[9]:1 within `circshift'
; ││┌ @ REPL[3]:2 within `_circshift'
; │││┌ @ REPL[3]:5 within `macro expansion'
	vmovups	24(%rsi), %xmm0
	vmovups	(%rsi), %xmm1
	movq	16(%rsi), %rcx
; │└└└
	vmovups	%xmm0, (%rdi)
	vmovups	%xmm1, 16(%rdi)
	movq	%rcx, 32(%rdi)
	retq
	nop
; └

So, given the forced marriage of stack allocation and immutability in Julia

Luckily this isn't quite true :-) The compiler will commonly allocateMVector on the stack as well, but only if it can prove that it's a temporary which doesn't escape the stack frame. So one viable alternative to the generated function might be to use a temporary MVector, but re-wrap in the type of the input vector before returning it. This will convert to SVector if an SVector was passed in.

@Micket
Copy link

Micket commented Apr 13, 2020

Yeah I didn't manage to get it to inline properly, so i saw very poor performance.

Though, the tricky case is that performance is somewhat poor (as arrays get a bit larger) when the calls to mod can't optimized away

Experimenting a bit

@inline function circshift(v::SVector{L}, shift::Integer) where L
    out = similar(v)
    shift = mod(shift, L)
    cut = L-shift
    @inbounds for i in 1:(L-cut)
        out[i] = v[i+cut]
    end
    @inbounds for i in 1:cut
        out[i+shift] = v[i]
    end
    return SVector(out)
end

So, this version seems to be sufficiently optimizer friendly, with the added benefit of not having to mess about with generated functions (a plus in I.M.H.O.) and the fact that it's fast in cases where shift isn't known at compile time.

@c42f
Copy link
Member

c42f commented Apr 14, 2020

not having to mess about with generated functions (a plus in I.M.H.O.)

Yes it's always nice to avoid @generated!

function circshift(v::SVector{L}, shift::Integer) where L

You're not using L in dispatch here, so you can simplify the signature and just use length internally which should cleaner and can easily be optimized as it can always be known from typeof(v):

@inline function circshift(v::StaticVector, shift::Integer)
    w = similar(v)
    L = length(v)
    shift = mod(shift, L)
    cut = L-shift
    @inbounds for i in 1:(L-cut)
        w[i] = v[i+cut]
    end
    @inbounds for i in 1:cut
        w[i+shift] = v[i]
    end
    return similar_type(v)(w)
end

@andyferris
Copy link
Member

andyferris commented Apr 15, 2020

Really nice - so that does the end up with good code these days? Do we end up with a bunch of extra move instructions on the last line or is it ellided entirely?

What happens for big numbers and other mutable things that might not work well inside an MArray?

@c42f seeing this I feel like we are edging closer to making this package simpler or even redundant - I mean that function is basically good for AbstractArray! I'd simply replace the last line with return freeze(w) (a la Keno's suggestion). To fix MArray we could really use a "mutable tuple" which freezes to a Tuple... and then maybe constant propagation is the only "magic" we need here.

@c42f
Copy link
Member

c42f commented Apr 15, 2020

What happens for big numbers and other mutable things that might not work well inside an MArray?

There's a WIP PR to Base (JuliaLang/julia#34126) which puts pointers inline, so that could maybe help fix MArray with BigFloat etc (though we'd still need extra care to preserve GC invariants). In fact that suggests #749 might be reasonable to merge.

seeing this I feel like we are edging closer to making this package simpler or even redundant - I mean that function is basically good for AbstractArray! I'd simply replace the last line with return freeze(w) (a la Keno's suggestion).

Yes agreed things are getting simpler! I think the larger way forward is to start using freeze (or some freeze-like thing) internally in StaticArrays and see whether we can get good performance plus code which would in principle work for AbstractArray. Then try to merge some of that back into Base once we know things work.

@andyferris
Copy link
Member

There's a WIP PR to Base (JuliaLang/julia#34126) which puts pointers inline, so that could maybe help fix MArray with BigFloat etc (though we'd still need extra care to preserve GC invariants).

Unfortunately this doesn't affect all objects (structs with union types and unallocated fields may be treated differently).

Yes agreed things are getting simpler! I think the larger way forward is to start using freeze (or some freeze-like thing) internally in StaticArrays and see whether we can get good performance plus code which would in principle work for AbstractArray. Then try to merge some of that back into Base once we know things work.

Yeah! Let's do it :) I'd still be tempted to create a seperate package with freeze and thaw and overload that, so everyone can explore.

@tkf
Copy link
Member

tkf commented Apr 15, 2020

I'd still be tempted to create a seperate package with freeze and thaw and overload that, so everyone can explore.

I'm thinking to do this for a while for structs and tuples. It's super easy to implement this with Setfield.jl: jw3126/Setfield.jl#56 (comment)

@andyferris
Copy link
Member

Yes agreed things are getting simpler! I think the larger way forward is to start using freeze (or some freeze-like thing) internally in StaticArrays and see whether we can get good performance plus code which would in principle work for AbstractArray. Then try to merge some of that back into Base once we know things work.

Yeah! Let's do it :) I'd still be tempted to create a seperate package with freeze and thaw and overload that, so everyone can explore.

So... I got overexcited. I think I need this for Dictionaries.jl anyways.

https://github.com/andyferris/Freeze.jl

@tkf
Copy link
Member

tkf commented Apr 15, 2020

You beat me to it! I hope I'm not slowing down your excitement but I think issettable is too coarse-grained. It might be better to be able to freeze/thaw different entities (e.g., shape-frozen vs value-frozen). For example, you'd want shape-frozen value-mutable arrays for columns of tables. Also, I've been thinking that it'd be nice to have an ownership-as-convention kind of API. This way, you can get a from thaw(move(freeze(a))) "safely."

@andyferris
Copy link
Member

Haha!

Yes we definitely need “shape frozen” vs “value frozen” semantics and I meant that issettable is only for mutating values not the shape. My experience with Dictionaries.jl is containers whose keys change are significantly more challenging to design shared generic code for, and also all static arrays are shape frozen anyway.

If you are going to think about these things remember that append-only datasets are a thing and that we might want eg a vector that supports push! but not setindex!. Thinking if it this way we can keep the two concepts simpler and orthogonal rather than complecting them. (That’s also why setindex! doesn’t insert new keys in Dictionaries.jl).

For dictionaries I was thinking isinsertable, freezekeys and thawkeys could be the (orthogonal) “shape” interface. What do you think?

@KristofferC
Copy link
Contributor

Note that it is possible to write something like

function matmul(a::SMatrix{I, J}, b::SMatrix{J, K}) where {I, J, K}
    c = zero(MMatrix{I, K})
    @inbounds for k in 1:K, j in 1:J
        b_jk = b[j, k]
        @simd for i in 1:I
            c[i, k] = muladd(a[i, j], b_jk, c[i, k])
        end
    end
    return SMatrix(c)
end

now, without causing allocations. It seems LLVM only unrolls the innermost loop now for some reason though.

@tkf
Copy link
Member

tkf commented Apr 15, 2020

(I opened andyferris/Freeze.jl#1 to avoid derailing this issue too much.)

@andyferris
Copy link
Member

So I started experimenting further: https://github.com/andyferris/StaticArraysLite.jl

So far the basic functionality is quite competitive to StaticArrays for SVector{3, Float64}. Would be very interesting to see how this goes for matrices and slightly larger arrays.

@wsshin
Copy link
Contributor Author

wsshin commented Feb 22, 2022

I need to use circshift() on SVector, and I discovered someone submitted this issue a while ago. It turned out that it was actually myself 😁.

Back then I just reported the lack of this functionality and didn't think about the solution. Base.circshift() supports multi-dimensional arrays, but if we focus on vectors, would it be possible to add the following simple solutions?

Base.circshift(v::SVector{N,T}, shift::Integer) where {N,T} = SVector(ntuple(k->v[mod1(k-shift,N)], Val(N)))
Base.circshift(v::SVector{N,T}, ::Val{S}) where {N,T,S} = SVector(ntuple(k->v[mod1(k-S,N)], Val(N)))

Here are the benchmark results:

julia> VERSION
v"1.7.2-pre.0"

julia> v = SVector(ntuple(identity, 10))
10-element SVector{10, Int64} with indices SOneTo(10):
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> @btime circshift($v, 3);
  31.051 ns (0 allocations: 0 bytes)

julia> @btime circshift($v, Val(3));
  2.641 ns (0 allocations: 0 bytes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature features and feature requests
Projects
None yet
Development

No branches or pull requests

6 participants