Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance for linspace #13401

Closed
stevengj opened this issue Oct 1, 2015 · 18 comments
Closed

Poor performance for linspace #13401

stevengj opened this issue Oct 1, 2015 · 18 comments
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks

Comments

@stevengj
Copy link
Member

stevengj commented Oct 1, 2015

This seems odd to me:

julia> f(X) = [sin(x) + x^2 - 3 for x in X]
       x = linspace(0,1,10^7)
       @time f(x);
  0.286351 seconds (6 allocations: 76.294 MB, 0.21% gc time)

julia> x = collect(x)
       @time f(x);
  0.085037 seconds (6 allocations: 76.294 MB, 5.49% gc time)

I can understand why e.g. x.^2 might be slower for a linspace, since the computation is so cheap compared to computing the elements of the linspace on the fly, but I don't understand why a more complex operation like this, in a comprehension where the elements are calculated only once per computation, is so much slower.

See also the mailing list discussion.

@stevengj stevengj added the performance Must go faster label Oct 1, 2015
@timholy
Copy link
Member

timholy commented Oct 1, 2015

I'm not sure I understand the concern. f(X) is 3 times slower than collect(X), but don't you think that has a lot to do with sin(x)?

I don't think there's anything special about linspace here; the performance gap noted on the mailing list is just due to the bounds-check here. This is why we really need user-extensible bounds-check removal.

@carlobaldassi
Copy link
Member

It's a bit weirder than that. The comprehension syntax uses the iterator interface, so bound checks should not be involved here. However, look at this: collect just calls vcat, which is implemented like this:

function vcat{T}(rs::Range{T}...)
    n::Int = 0
    for ra in rs
        n += length(ra)
    end
    a = Array(T,n)
    i = 1
    for ra in rs, x in ra
        @inbounds a[i] = x
        i += 1
    end
    return a
end

Now try to simplify it by just removing the varargs:

julia-0.4> function vc{T}(rs::Range{T})
               n::Int = length(rs)
               a = Array(T,n)
               i = 1
               for x in rs
                   @inbounds a[i] = x
                   i += 1
               end
               return a
           end

now it takes more than twice the time!

julia-0.4> x = linspace(0,1,10^7);

julia-0.4> @time collect(x);
  0.046977 seconds (7 allocations: 76.294 MB, 17.88% gc time)

julia-0.4> @time vc(x);
  0.134994 seconds (5.46 k allocations: 76.516 MB, 21.09% gc time)

julia-0.4> c(x) = [y for y in x];

julia-0.4> @time c(x);
  0.135602 seconds (4.09 k allocations: 76.494 MB, 0.80% gc time)

By the way I checked that forcing inlining of the next and done methods for LinSpace doesn't make any difference. Doesn't seem to be related to type inference either.
This also seems to happen with StepRanges and FloatRanges, but not with UnitRanges.

@timholy
Copy link
Member

timholy commented Oct 1, 2015

I had followed the exact same trail, but for me collect, vc, and the comprehension have identical performance.

@musm
Copy link
Contributor

musm commented Oct 1, 2015

@carlobaldassi I get no substantial timings difference.

@timholy is this a bound checking issue? I don't see any differences in the following:

import Base: getindex, unsafe_getindex
getindex{T}(r::LinSpace{T}, i::Integer) = unsafe_getindex(r, i)

f(X) = [sin(x) + x^2 - 3 for x in X];
x = linspace(0,1,10^7);
@time f(x);
 0.231700 seconds (5.04 k allocations: 76.532 MB, 18.22% gc time)

x = collect(x);
@time f(x);
 0.120973 seconds (4.38 k allocations: 76.500 MB, 5.05% gc time)

@timholy
Copy link
Member

timholy commented Oct 1, 2015

None of those should check bounds. On the mailing list there was discussion of [x[i] for i = 1:length(x)], which (at least naively) does check bounds on each access.

@stevengj
Copy link
Member Author

stevengj commented Oct 1, 2015

@timholy, in evaluating sin(x) in my function, x is a scalar, not a vector or a range. So it should be identical in the two cases. The only difference is the for x in X iteration.

@KristofferC
Copy link
Member

julia> @time collect(x);
  0.048368 seconds (7 allocations: 76.294 MB, 18.70% gc time)

julia> @time vc(x);
  0.125762 seconds (6 allocations: 76.294 MB, 22.66% gc time)

LLVMs: https://gist.github.com/KristofferC/0d915e63097010a9bfa1

@timholy
Copy link
Member

timholy commented Oct 1, 2015

Shoot, I missed that you were timing f, not collect.

Still, for me, those are identical. Running tk/backports5

julia> versioninfo()
Julia Version 0.4.0-rc3+30
Commit 440c8c7* (2015-09-30 16:45 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU       L 640  @ 2.13GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Possibly fixed by #13355? (Which is included in tk/backports5, and the main reason I'm running it.)

@KristofferC
Copy link
Member

Changed to same commit, still a difference. Odd. I get same LLVM code as before.

julia> @time collect(x);
  0.049824 seconds (7 allocations: 76.294 MB)

julia> @time vc(x);
  0.130104 seconds (6 allocations: 76.294 MB, 29.40% gc time)
julia> versioninfo()
Julia Version 0.4.0-rc3+30
Commit 440c8c7* (2015-09-30 16:45 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@stevengj
Copy link
Member Author

stevengj commented Oct 1, 2015

I just built with the latest 0.5 (0.5.0-dev+7982), and @time f(x) is the same for the LinRange version but got slower (0.12s vs. 0.08s) for the collect version, so that's not good...

@carlobaldassi
Copy link
Member

I just tested that on the latest master branch (v"0.5.0-dev+564") the difference between collect, vc and the comprehension disappears. The original performance issue is still there though:

julia-0.5> @time collect(x);
  0.046547 seconds (7 allocations: 76.294 MB, 15.26% gc time)

julia-0.5> @time vc(x);
  0.047345 seconds (6 allocations: 76.294 MB, 49.61% gc time)

julia-0.5> @time c(x);
  0.046630 seconds (6 allocations: 76.294 MB, 14.31% gc time)

julia-0.5> @time f(x);
  0.287785 seconds (6 allocations: 76.294 MB, 0.41% gc time)

julia-0.5> y = collect(x);

julia-0.5> @time f(y);
  0.085029 seconds (6 allocations: 76.294 MB, 27.70% gc time)

@stevengj I don't observe the slowdown you mention. We seem to have a different "latest" 0.5 version (I just pulled and compiled):

julia-0.5> versioninfo()
Julia Version 0.5.0-dev+564
Commit 966850c* (2015-10-01 19:15 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@mbauman
Copy link
Member

mbauman commented Oct 3, 2015

Similarly, I cannot see any difference between collect, vcat, and vc.

For SGJ's original f(X) = [sin(x) + x^2 - 3 for x in X] function, there doesn't seem to be anything obviously wrong in the LLVM inner loop:

julia> @code_llvm f(x)
define %jl_value_t* @julia_f_22480(%LinSpace*) {
; ...
L1:                                               ; preds = %pass3, %L1.preheader
  %"#s237.0" = phi i64 [ %43, %pass3 ], [ 0, %L1.preheader ]
  %"#s238.0" = phi i64 [ %44, %pass3 ], [ 1, %L1.preheader ]
  ; This is the LinSpace next calculation:
  %26 = load double* %6, align 8
  %27 = sitofp i64 %"#s238.0" to double
  %28 = fsub double %26, %27
  %29 = load double* %22, align 8
  %30 = fmul double %28, %29
  %31 = add i64 %"#s238.0", -1
  %32 = sitofp i64 %31 to double
  %33 = load double* %23, align 8
  %34 = fmul double %32, %33
  %35 = fadd double %30, %34
  %36 = load double* %24, align 8
  %37 = fdiv double %35, %36
  ; Call sin
  %38 = call double inttoptr (i64 13391592272 to double (double)*)(double %37)
  ; … identical after this

The thing that's really remarkable is that there's a bounds check when using the full array, and it's still faster (ranges don't check bounds in next, since they're just doing math):

julia> @code_llvm f(collect(x))
define %jl_value_t* @julia_f_22666(%jl_value_t*, %jl_value_t**, i32) {
; ...
L1:                                               ; preds = %pass, %L1.preheader
  %"#s237.0" = phi i64 [ %29, %pass ], [ 0, %L1.preheader ]
  %"#s238.0" = phi i64 [ %30, %pass ], [ 1, %L1.preheader ]
  ; Check bounds
  %16 = add i64 %"#s238.0", -1
  %17 = load i64* %10, align 8
  %18 = icmp ult i64 %16, %17
  br i1 %18, label %idxend, label %oob

oob:                                              ; preds = %L1
  %19 = alloca i64, align 8
  store i64 %"#s238.0", i64* %19, align 8
  call void @jl_bounds_error_ints(%jl_value_t* %8, i64* %19, i64 1)
  unreachable

idxend:                                           ; preds = %L1
  ; load element
  %20 = load i8** %14, align 8
  %21 = bitcast i8* %20 to double*
  %22 = getelementptr double* %21, i64 %16
  %23 = load double* %22, align 8
  ; Call sin
  %24 = call double inttoptr (i64 13391592272 to double (double)*)(double %23)
  ; … identical after this

The other strange thing about this is that there seems to be some sort of interaction between the LinSpace calculation and the math. If I just "collect" the LinSpace in a comprehension without doing anything it takes about 60ms longer than collecting a regular Array… but f takes more than 150ms longer!

julia> f(X) = [sin(x) + x^2 - 3 for x in X]
       c(X) = [x for x in X]
       x = linspace(0,1,10^7)
       y = collect(x);

julia> using Benchmarks

julia> @benchmark c(x)
================ Benchmark Results ========================
     Time per evaluation: 98.70 ms [93.92 ms, 103.48 ms]
# … snipped …

julia> @benchmark c(y)
================ Benchmark Results ========================
     Time per evaluation: 42.30 ms [41.58 ms, 43.02 ms]

julia> @benchmark f(x)
================ Benchmark Results ========================
     Time per evaluation: 357.58 ms [328.14 ms, 387.02 ms]

julia> @benchmark f(y)
================ Benchmark Results ========================
     Time per evaluation: 183.70 ms [180.26 ms, 187.13 ms]

Similarly, if we just compare the cost of computing one element to the cost of the scalar computation that we're doing in the comprehension, it would imply that there shouldn't be more than a 10-20% overhead:

julia> g(x) = sin(x) + x^2 - 3
g (generic function with 1 method)

julia> @benchmark g(1.)
================ Benchmark Results ========================
     Time per evaluation: 24.90 ns [24.75 ns, 25.05 ns]

julia> @benchmark next(x,1)
================ Benchmark Results ========================
     Time per evaluation: 6.73 ns [6.68 ns, 6.78 ns]

Even stranger: things get worse with ./julia -O. I think we're just hitting some really strange opcode latency interactions within the CPU, so this will probably vary by specific microarchitecture and LLVM version (I'm still running LLVM 3.3 on an old Nehalem i5). But I'm out of ideas.

@timholy
Copy link
Member

timholy commented Nov 4, 2015

I wondered if there might be a similar fix as for #13866. That didn't pan out, but I did discover something interesting (release-0.4 branch): @simd with a LinSpace results in ~7x faster code, despite the fact that it doesn't successfully trigger vectorization. This turns out to be as fast as running it on the collected array with vectorization:

# The following function will be tested with and without the @simd
function foo(X)
    s = zero(eltype(X))
    @inbounds @simd for x in X
        s += x
    end
    s
end

With the @simd

julia> @time foo(linspace(1,5,10^7))
  0.011315 seconds (7 allocations: 240 bytes)
3.000000000000007e7

julia> @code_llvm foo(linspace(1,5,10^7))

define double @julia_foo_21514(%LinSpace*) {
L:
  %1 = load %LinSpace* %0, align 8
  %2 = extractvalue %LinSpace %1, 0
  %3 = extractvalue %LinSpace %1, 1
  %4 = extractvalue %LinSpace %1, 2
  %5 = extractvalue %LinSpace %1, 3
  %6 = fadd double %4, -1.000000e+00
  %7 = bitcast double %6 to i64
  %8 = icmp sgt i64 %7, -1
  %9 = fadd double %4, 1.000000e+00
  %10 = select i1 %8, double %4, double %9
  %11 = fptosi double %10 to i64
  %12 = sitofp i64 %11 to double
  %13 = fptosi double %12 to i64
  %14 = icmp eq i64 %11, %13
  %15 = fcmp oeq double %10, %12
  %16 = and i1 %15, %14
  %17 = icmp slt i64 %11, 1
  %18 = fmul double %2, %6
  %19 = fdiv double %18, %5
  %20 = fcmp ugt double %4, 0.000000e+00
  %21 = fsub double %3, %2
  %22 = fdiv double %21, %5
  %23 = select i1 %20, double %22, double 0x7FF8000000000000
  br i1 %16, label %pass, label %fail

fail:                                             ; preds = %L
  %24 = load %jl_value_t** @jl_inexact_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %24, i32 67)
  unreachable

pass:                                             ; preds = %L
  br i1 %17, label %L9, label %L3

L3:                                               ; preds = %L3, %pass
  %"##i#7058.0" = phi i64 [ %29, %L3 ], [ 0, %pass ]
  %s.1 = phi double [ %28, %L3 ], [ 0.000000e+00, %pass ]
  %25 = sitofp i64 %"##i#7058.0" to double
  %26 = fmul double %25, %23
  %27 = fadd double %19, %26
  %28 = fadd fast double %s.1, %27
  %29 = add i64 %"##i#7058.0", 1
  %exitcond = icmp eq i64 %29, %11
  br i1 %exitcond, label %L9, label %L3

L9:                                               ; preds = %L3, %pass
  %s.3 = phi double [ 0.000000e+00, %pass ], [ %28, %L3 ]
  ret double %s.3
}

Without the @simd

julia> @time foo(linspace(1,5,10^7))
  0.083340 seconds (7 allocations: 240 bytes)
3.000000000000007e7

julia> @code_llvm foo(linspace(1,5,10^7))

define double @julia_foo_21534(%LinSpace*) {
top:
  %1 = load %LinSpace* %0, align 8
  %2 = extractvalue %LinSpace %1, 2
  %3 = fadd double %2, -1.000000e+00
  %4 = bitcast double %3 to i64
  %5 = icmp sgt i64 %4, -1
  %6 = fadd double %2, 1.000000e+00
  %7 = select i1 %5, double %2, double %6
  %8 = fptosi double %7 to i64
  %9 = sitofp i64 %8 to double
  %10 = fptosi double %9 to i64
  %11 = icmp eq i64 %8, %10
  %12 = fcmp oeq double %7, %9
  %13 = and i1 %12, %11
  br i1 %13, label %pass, label %fail

fail:                                             ; preds = %top
  %14 = load %jl_value_t** @jl_inexact_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %14, i32 3)
  unreachable

pass:                                             ; preds = %top
  %15 = icmp eq i64 %8, 0
  br i1 %15, label %L5, label %L.preheader

L.preheader:                                      ; preds = %pass
  %16 = extractvalue %LinSpace %1, 0
  %17 = extractvalue %LinSpace %1, 1
  %18 = extractvalue %LinSpace %1, 3
  br label %pass4

pass4:                                            ; preds = %pass4, %L.preheader
  %"#s1.0" = phi i64 [ %28, %pass4 ], [ 1, %L.preheader ]
  %s.0 = phi double [ %27, %pass4 ], [ 0.000000e+00, %L.preheader ]
  %19 = sitofp i64 %"#s1.0" to double
  %20 = fsub double %2, %19
  %21 = fmul double %16, %20
  %22 = add i64 %"#s1.0", -1
  %23 = sitofp i64 %22 to double
  %24 = fmul double %17, %23
  %25 = fadd double %21, %24
  %26 = fdiv double %25, %18
  %27 = fadd double %s.0, %26
  %28 = add i64 %"#s1.0", 1
  %29 = icmp eq i64 %"#s1.0", %8
  br i1 %29, label %L5, label %pass4

L5:                                               ; preds = %pass4, %pass
  %s.1 = phi double [ 0.000000e+00, %pass ], [ %27, %pass4 ]
  ret double %s.1
}

Reference: using an Array

function foo2(X)
    s = zero(eltype(X))
    @inbounds @simd for i = 1:length(X)
        s += X[i]
    end
    s
end

julia> @time foo2(X)
  0.013635 seconds (5 allocations: 176 bytes)
3.0000000000000022e7

julia> @code_llvm foo2(X)

define double @julia_foo2_21527(%jl_value_t*) {
L:
  %1 = getelementptr inbounds %jl_value_t* %0, i64 1
  %2 = bitcast %jl_value_t* %1 to i64*
  %3 = load i64* %2, align 8
  %4 = icmp sgt i64 %3, 0
  %5 = select i1 %4, i64 %3, i64 0
  %6 = bitcast %jl_value_t* %0 to i8**
  %7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %5, i64 1)
  %8 = extractvalue { i64, i1 } %7, 1
  br i1 %8, label %fail, label %pass

fail:                                             ; preds = %L
  %9 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %9, i32 67)
  unreachable

pass:                                             ; preds = %L
  %10 = extractvalue { i64, i1 } %7, 0
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
  %12 = extractvalue { i64, i1 } %11, 1
  br i1 %12, label %fail1, label %pass2

fail1:                                            ; preds = %pass
  %13 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %13, i32 67)
  unreachable

pass2:                                            ; preds = %pass
  %14 = extractvalue { i64, i1 } %11, 0
  %15 = icmp slt i64 %14, 1
  br i1 %15, label %L11, label %if3

if3:                                              ; preds = %pass2
  %16 = load i8** %6, align 8
  %17 = bitcast i8* %16 to double*
  %n.vec = and i64 %14, -4
  %cmp.zero = icmp eq i64 %n.vec, 0
  br i1 %cmp.zero, label %middle.block, label %vector.body

vector.body:                                      ; preds = %vector.body, %if3
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %if3 ]
  %vec.phi = phi <2 x double> [ %22, %vector.body ], [ zeroinitializer, %if3 ]
  %vec.phi13 = phi <2 x double> [ %23, %vector.body ], [ zeroinitializer, %if3 ]
  %18 = getelementptr double* %17, i64 %index
  %19 = bitcast double* %18 to <2 x double>*
  %wide.load = load <2 x double>* %19, align 8
  %.sum19 = or i64 %index, 2
  %20 = getelementptr double* %17, i64 %.sum19
  %21 = bitcast double* %20 to <2 x double>*
  %wide.load14 = load <2 x double>* %21, align 8
  %22 = fadd <2 x double> %vec.phi, %wide.load
  %23 = fadd <2 x double> %vec.phi13, %wide.load14
  %index.next = add i64 %index, 4
  %24 = icmp eq i64 %index.next, %n.vec
  br i1 %24, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if3
  %resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]
  %rdx.vec.exit.phi = phi <2 x double> [ zeroinitializer, %if3 ], [ %22, %vector.body ]
  %rdx.vec.exit.phi17 = phi <2 x double> [ zeroinitializer, %if3 ], [ %23, %vector.body ]
  %bin.rdx = fadd <2 x double> %rdx.vec.exit.phi17, %rdx.vec.exit.phi
  %rdx.shuf = shufflevector <2 x double> %bin.rdx, <2 x double> undef, <2 x i32> <i32 1, i32 undef>
  %bin.rdx18 = fadd <2 x double> %bin.rdx, %rdx.shuf
  %25 = extractelement <2 x double> %bin.rdx18, i32 0
  %cmp.n = icmp eq i64 %14, %resume.val
  br i1 %cmp.n, label %L11, label %L5

L5:                                               ; preds = %L5, %middle.block
  %"##i#6971.0" = phi i64 [ %29, %L5 ], [ %resume.val, %middle.block ]
  %s.1 = phi double [ %28, %L5 ], [ %25, %middle.block ]
  %26 = getelementptr double* %17, i64 %"##i#6971.0"
  %27 = load double* %26, align 8
  %28 = fadd fast double %s.1, %27
  %29 = add i64 %"##i#6971.0", 1
  %exitcond = icmp eq i64 %29, %14
  br i1 %exitcond, label %L11, label %L5

L11:                                              ; preds = %L5, %middle.block, %pass2
  %s.3 = phi double [ 0.000000e+00, %pass2 ], [ %28, %L5 ], [ %25, %middle.block ]
  ret double %s.3
}

@timholy
Copy link
Member

timholy commented Nov 4, 2015

On master (and a different machine) things are even more dramatic, because it vectorizes with the @simd. I get a 14x difference. Also, foo2 is 7x slower than foo if you pass it the LinSpace object, presumably due to bounds-checking preventing the vectorization.

@simonster
Copy link
Member

With @simd, x is being computed as first(r)+i*step(r) rather than ((r.len-i)*r.start + (i-1)*r.stop)/r.divisor. That's faster, but it's not quite right.

@timholy
Copy link
Member

timholy commented Nov 4, 2015

Ah, I bet it all comes down to that division---one cannot account for a 7-fold effect from a couple of extra adds or multiplies. I don't have time to check now, but perhaps we should be storing, and multiplying by, 1/r.divisor? Or does that destroy the ulp-level accuracy that motivated this construct? CC @StefanKarpinski.

@jrevels jrevels added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 17, 2015
@c42f
Copy link
Member

c42f commented Jan 26, 2016

Over at #14420 I've been reconsidering the LinSpace implementation for other reasons, but it seems relevant to this issue too, especially since I'm also seeing similar strange and confusing inconsistencies when it comes to benchmarking. With my current LinSpace2 hack, there's about a 10x speed advantage over LinSpace in Tim's foo simd test loop.

However, LinSpace2 generates a warning because of the somewhat bogus way I implemented last(). Fine, that's easy to fix and I did so... but that completely kills the performance boost (back to only 2x faster). With that in mind I've left last() as it is for now, in the hope that it'll shed some light here.

@jrevels jrevels closed this as completed Jan 26, 2016
@jrevels jrevels reopened this Jan 26, 2016
@jrevels
Copy link
Member

jrevels commented Jan 26, 2016

(sorry, finger slipped when scrolling)

blakejohnson added a commit that referenced this issue Feb 7, 2016
Fix #13401 (disable inbound checking for arrays generated on the fly)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

No branches or pull requests

9 participants