Poor performance for linspace #13401

stevengj · 2015-10-01T12:59:34Z

This seems odd to me:

julia> f(X) = [sin(x) + x^2 - 3 for x in X]
       x = linspace(0,1,10^7)
       @time f(x);
  0.286351 seconds (6 allocations: 76.294 MB, 0.21% gc time)

julia> x = collect(x)
       @time f(x);
  0.085037 seconds (6 allocations: 76.294 MB, 5.49% gc time)

I can understand why e.g. x.^2 might be slower for a linspace, since the computation is so cheap compared to computing the elements of the linspace on the fly, but I don't understand why a more complex operation like this, in a comprehension where the elements are calculated only once per computation, is so much slower.

See also the mailing list discussion.

The text was updated successfully, but these errors were encountered:

timholy · 2015-10-01T15:27:01Z

I'm not sure I understand the concern. f(X) is 3 times slower than collect(X), but don't you think that has a lot to do with sin(x)?

I don't think there's anything special about linspace here; the performance gap noted on the mailing list is just due to the bounds-check here. This is why we really need user-extensible bounds-check removal.

carlobaldassi · 2015-10-01T16:02:29Z

It's a bit weirder than that. The comprehension syntax uses the iterator interface, so bound checks should not be involved here. However, look at this: collect just calls vcat, which is implemented like this:

function vcat{T}(rs::Range{T}...)
    n::Int = 0
    for ra in rs
        n += length(ra)
    end
    a = Array(T,n)
    i = 1
    for ra in rs, x in ra
        @inbounds a[i] = x
        i += 1
    end
    return a
end

Now try to simplify it by just removing the varargs:

julia-0.4> function vc{T}(rs::Range{T})
               n::Int = length(rs)
               a = Array(T,n)
               i = 1
               for x in rs
                   @inbounds a[i] = x
                   i += 1
               end
               return a
           end

now it takes more than twice the time!

julia-0.4> x = linspace(0,1,10^7);

julia-0.4> @time collect(x);
  0.046977 seconds (7 allocations: 76.294 MB, 17.88% gc time)

julia-0.4> @time vc(x);
  0.134994 seconds (5.46 k allocations: 76.516 MB, 21.09% gc time)

julia-0.4> c(x) = [y for y in x];

julia-0.4> @time c(x);
  0.135602 seconds (4.09 k allocations: 76.494 MB, 0.80% gc time)

By the way I checked that forcing inlining of the next and done methods for LinSpace doesn't make any difference. Doesn't seem to be related to type inference either.
This also seems to happen with StepRanges and FloatRanges, but not with UnitRanges.

timholy · 2015-10-01T16:17:51Z

I had followed the exact same trail, but for me collect, vc, and the comprehension have identical performance.

musm · 2015-10-01T17:13:27Z

@carlobaldassi I get no substantial timings difference.

@timholy is this a bound checking issue? I don't see any differences in the following:

import Base: getindex, unsafe_getindex
getindex{T}(r::LinSpace{T}, i::Integer) = unsafe_getindex(r, i)

f(X) = [sin(x) + x^2 - 3 for x in X];
x = linspace(0,1,10^7);
@time f(x);
 0.231700 seconds (5.04 k allocations: 76.532 MB, 18.22% gc time)

x = collect(x);
@time f(x);
 0.120973 seconds (4.38 k allocations: 76.500 MB, 5.05% gc time)

timholy · 2015-10-01T17:57:25Z

None of those should check bounds. On the mailing list there was discussion of [x[i] for i = 1:length(x)], which (at least naively) does check bounds on each access.

stevengj · 2015-10-01T18:06:48Z

@timholy, in evaluating sin(x) in my function, x is a scalar, not a vector or a range. So it should be identical in the two cases. The only difference is the for x in X iteration.

KristofferC · 2015-10-01T18:11:17Z

julia> @time collect(x);
  0.048368 seconds (7 allocations: 76.294 MB, 18.70% gc time)

julia> @time vc(x);
  0.125762 seconds (6 allocations: 76.294 MB, 22.66% gc time)

LLVMs: https://gist.github.com/KristofferC/0d915e63097010a9bfa1

timholy · 2015-10-01T19:27:39Z

Shoot, I missed that you were timing f, not collect.

Still, for me, those are identical. Running tk/backports5

julia> versioninfo()
Julia Version 0.4.0-rc3+30
Commit 440c8c7* (2015-09-30 16:45 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7 CPU       L 640  @ 2.13GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

Possibly fixed by #13355? (Which is included in tk/backports5, and the main reason I'm running it.)

KristofferC · 2015-10-01T19:41:46Z

Changed to same commit, still a difference. Odd. I get same LLVM code as before.

julia> @time collect(x);
  0.049824 seconds (7 allocations: 76.294 MB)

julia> @time vc(x);
  0.130104 seconds (6 allocations: 76.294 MB, 29.40% gc time)

julia> versioninfo()
Julia Version 0.4.0-rc3+30
Commit 440c8c7* (2015-09-30 16:45 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

stevengj · 2015-10-01T20:28:59Z

I just built with the latest 0.5 (0.5.0-dev+7982), and @time f(x) is the same for the LinRange version but got slower (0.12s vs. 0.08s) for the collect version, so that's not good...

carlobaldassi · 2015-10-01T20:40:09Z

I just tested that on the latest master branch (v"0.5.0-dev+564") the difference between collect, vc and the comprehension disappears. The original performance issue is still there though:

julia-0.5> @time collect(x);
  0.046547 seconds (7 allocations: 76.294 MB, 15.26% gc time)

julia-0.5> @time vc(x);
  0.047345 seconds (6 allocations: 76.294 MB, 49.61% gc time)

julia-0.5> @time c(x);
  0.046630 seconds (6 allocations: 76.294 MB, 14.31% gc time)

julia-0.5> @time f(x);
  0.287785 seconds (6 allocations: 76.294 MB, 0.41% gc time)

julia-0.5> y = collect(x);

julia-0.5> @time f(y);
  0.085029 seconds (6 allocations: 76.294 MB, 27.70% gc time)

@stevengj I don't observe the slowdown you mention. We seem to have a different "latest" 0.5 version (I just pulled and compiled):

julia-0.5> versioninfo()
Julia Version 0.5.0-dev+564
Commit 966850c* (2015-10-01 19:15 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

mbauman · 2015-10-03T02:52:53Z

Similarly, I cannot see any difference between collect, vcat, and vc.

For SGJ's original f(X) = [sin(x) + x^2 - 3 for x in X] function, there doesn't seem to be anything obviously wrong in the LLVM inner loop:

julia> @code_llvm f(x)
define %jl_value_t* @julia_f_22480(%LinSpace*) {
; ...
L1:                                               ; preds = %pass3, %L1.preheader
  %"#s237.0" = phi i64 [ %43, %pass3 ], [ 0, %L1.preheader ]
  %"#s238.0" = phi i64 [ %44, %pass3 ], [ 1, %L1.preheader ]
  ; This is the LinSpace next calculation:
  %26 = load double* %6, align 8
  %27 = sitofp i64 %"#s238.0" to double
  %28 = fsub double %26, %27
  %29 = load double* %22, align 8
  %30 = fmul double %28, %29
  %31 = add i64 %"#s238.0", -1
  %32 = sitofp i64 %31 to double
  %33 = load double* %23, align 8
  %34 = fmul double %32, %33
  %35 = fadd double %30, %34
  %36 = load double* %24, align 8
  %37 = fdiv double %35, %36
  ; Call sin
  %38 = call double inttoptr (i64 13391592272 to double (double)*)(double %37)
  ; … identical after this

The thing that's really remarkable is that there's a bounds check when using the full array, and it's still faster (ranges don't check bounds in next, since they're just doing math):

julia> @code_llvm f(collect(x))
define %jl_value_t* @julia_f_22666(%jl_value_t*, %jl_value_t**, i32) {
; ...
L1:                                               ; preds = %pass, %L1.preheader
  %"#s237.0" = phi i64 [ %29, %pass ], [ 0, %L1.preheader ]
  %"#s238.0" = phi i64 [ %30, %pass ], [ 1, %L1.preheader ]
  ; Check bounds
  %16 = add i64 %"#s238.0", -1
  %17 = load i64* %10, align 8
  %18 = icmp ult i64 %16, %17
  br i1 %18, label %idxend, label %oob

oob:                                              ; preds = %L1
  %19 = alloca i64, align 8
  store i64 %"#s238.0", i64* %19, align 8
  call void @jl_bounds_error_ints(%jl_value_t* %8, i64* %19, i64 1)
  unreachable

idxend:                                           ; preds = %L1
  ; load element
  %20 = load i8** %14, align 8
  %21 = bitcast i8* %20 to double*
  %22 = getelementptr double* %21, i64 %16
  %23 = load double* %22, align 8
  ; Call sin
  %24 = call double inttoptr (i64 13391592272 to double (double)*)(double %23)
  ; … identical after this

The other strange thing about this is that there seems to be some sort of interaction between the LinSpace calculation and the math. If I just "collect" the LinSpace in a comprehension without doing anything it takes about 60ms longer than collecting a regular Array… but f takes more than 150ms longer!

julia> f(X) = [sin(x) + x^2 - 3 for x in X]
       c(X) = [x for x in X]
       x = linspace(0,1,10^7)
       y = collect(x);

julia> using Benchmarks

julia> @benchmark c(x)
================ Benchmark Results ========================
     Time per evaluation: 98.70 ms [93.92 ms, 103.48 ms]
# … snipped …

julia> @benchmark c(y)
================ Benchmark Results ========================
     Time per evaluation: 42.30 ms [41.58 ms, 43.02 ms]

julia> @benchmark f(x)
================ Benchmark Results ========================
     Time per evaluation: 357.58 ms [328.14 ms, 387.02 ms]

julia> @benchmark f(y)
================ Benchmark Results ========================
     Time per evaluation: 183.70 ms [180.26 ms, 187.13 ms]

Similarly, if we just compare the cost of computing one element to the cost of the scalar computation that we're doing in the comprehension, it would imply that there shouldn't be more than a 10-20% overhead:

julia> g(x) = sin(x) + x^2 - 3
g (generic function with 1 method)

julia> @benchmark g(1.)
================ Benchmark Results ========================
     Time per evaluation: 24.90 ns [24.75 ns, 25.05 ns]

julia> @benchmark next(x,1)
================ Benchmark Results ========================
     Time per evaluation: 6.73 ns [6.68 ns, 6.78 ns]

Even stranger: things get worse with ./julia -O. I think we're just hitting some really strange opcode latency interactions within the CPU, so this will probably vary by specific microarchitecture and LLVM version (I'm still running LLVM 3.3 on an old Nehalem i5). But I'm out of ideas.

timholy · 2015-11-04T09:33:37Z

I wondered if there might be a similar fix as for #13866. That didn't pan out, but I did discover something interesting (release-0.4 branch): @simd with a LinSpace results in ~7x faster code, despite the fact that it doesn't successfully trigger vectorization. This turns out to be as fast as running it on the collected array with vectorization:

# The following function will be tested with and without the @simd
function foo(X)
    s = zero(eltype(X))
    @inbounds @simd for x in X
        s += x
    end
    s
end

With the @simd

julia> @time foo(linspace(1,5,10^7))
  0.011315 seconds (7 allocations: 240 bytes)
3.000000000000007e7

julia> @code_llvm foo(linspace(1,5,10^7))

define double @julia_foo_21514(%LinSpace*) {
L:
  %1 = load %LinSpace* %0, align 8
  %2 = extractvalue %LinSpace %1, 0
  %3 = extractvalue %LinSpace %1, 1
  %4 = extractvalue %LinSpace %1, 2
  %5 = extractvalue %LinSpace %1, 3
  %6 = fadd double %4, -1.000000e+00
  %7 = bitcast double %6 to i64
  %8 = icmp sgt i64 %7, -1
  %9 = fadd double %4, 1.000000e+00
  %10 = select i1 %8, double %4, double %9
  %11 = fptosi double %10 to i64
  %12 = sitofp i64 %11 to double
  %13 = fptosi double %12 to i64
  %14 = icmp eq i64 %11, %13
  %15 = fcmp oeq double %10, %12
  %16 = and i1 %15, %14
  %17 = icmp slt i64 %11, 1
  %18 = fmul double %2, %6
  %19 = fdiv double %18, %5
  %20 = fcmp ugt double %4, 0.000000e+00
  %21 = fsub double %3, %2
  %22 = fdiv double %21, %5
  %23 = select i1 %20, double %22, double 0x7FF8000000000000
  br i1 %16, label %pass, label %fail

fail:                                             ; preds = %L
  %24 = load %jl_value_t** @jl_inexact_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %24, i32 67)
  unreachable

pass:                                             ; preds = %L
  br i1 %17, label %L9, label %L3

L3:                                               ; preds = %L3, %pass
  %"##i#7058.0" = phi i64 [ %29, %L3 ], [ 0, %pass ]
  %s.1 = phi double [ %28, %L3 ], [ 0.000000e+00, %pass ]
  %25 = sitofp i64 %"##i#7058.0" to double
  %26 = fmul double %25, %23
  %27 = fadd double %19, %26
  %28 = fadd fast double %s.1, %27
  %29 = add i64 %"##i#7058.0", 1
  %exitcond = icmp eq i64 %29, %11
  br i1 %exitcond, label %L9, label %L3

L9:                                               ; preds = %L3, %pass
  %s.3 = phi double [ 0.000000e+00, %pass ], [ %28, %L3 ]
  ret double %s.3
}

Without the @simd

julia> @time foo(linspace(1,5,10^7))
  0.083340 seconds (7 allocations: 240 bytes)
3.000000000000007e7

julia> @code_llvm foo(linspace(1,5,10^7))

define double @julia_foo_21534(%LinSpace*) {
top:
  %1 = load %LinSpace* %0, align 8
  %2 = extractvalue %LinSpace %1, 2
  %3 = fadd double %2, -1.000000e+00
  %4 = bitcast double %3 to i64
  %5 = icmp sgt i64 %4, -1
  %6 = fadd double %2, 1.000000e+00
  %7 = select i1 %5, double %2, double %6
  %8 = fptosi double %7 to i64
  %9 = sitofp i64 %8 to double
  %10 = fptosi double %9 to i64
  %11 = icmp eq i64 %8, %10
  %12 = fcmp oeq double %7, %9
  %13 = and i1 %12, %11
  br i1 %13, label %pass, label %fail

fail:                                             ; preds = %top
  %14 = load %jl_value_t** @jl_inexact_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %14, i32 3)
  unreachable

pass:                                             ; preds = %top
  %15 = icmp eq i64 %8, 0
  br i1 %15, label %L5, label %L.preheader

L.preheader:                                      ; preds = %pass
  %16 = extractvalue %LinSpace %1, 0
  %17 = extractvalue %LinSpace %1, 1
  %18 = extractvalue %LinSpace %1, 3
  br label %pass4

pass4:                                            ; preds = %pass4, %L.preheader
  %"#s1.0" = phi i64 [ %28, %pass4 ], [ 1, %L.preheader ]
  %s.0 = phi double [ %27, %pass4 ], [ 0.000000e+00, %L.preheader ]
  %19 = sitofp i64 %"#s1.0" to double
  %20 = fsub double %2, %19
  %21 = fmul double %16, %20
  %22 = add i64 %"#s1.0", -1
  %23 = sitofp i64 %22 to double
  %24 = fmul double %17, %23
  %25 = fadd double %21, %24
  %26 = fdiv double %25, %18
  %27 = fadd double %s.0, %26
  %28 = add i64 %"#s1.0", 1
  %29 = icmp eq i64 %"#s1.0", %8
  br i1 %29, label %L5, label %pass4

L5:                                               ; preds = %pass4, %pass
  %s.1 = phi double [ 0.000000e+00, %pass ], [ %27, %pass4 ]
  ret double %s.1
}

Reference: using an Array

function foo2(X)
    s = zero(eltype(X))
    @inbounds @simd for i = 1:length(X)
        s += X[i]
    end
    s
end

julia> @time foo2(X)
  0.013635 seconds (5 allocations: 176 bytes)
3.0000000000000022e7

julia> @code_llvm foo2(X)

define double @julia_foo2_21527(%jl_value_t*) {
L:
  %1 = getelementptr inbounds %jl_value_t* %0, i64 1
  %2 = bitcast %jl_value_t* %1 to i64*
  %3 = load i64* %2, align 8
  %4 = icmp sgt i64 %3, 0
  %5 = select i1 %4, i64 %3, i64 0
  %6 = bitcast %jl_value_t* %0 to i8**
  %7 = call { i64, i1 } @llvm.ssub.with.overflow.i64(i64 %5, i64 1)
  %8 = extractvalue { i64, i1 } %7, 1
  br i1 %8, label %fail, label %pass

fail:                                             ; preds = %L
  %9 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %9, i32 67)
  unreachable

pass:                                             ; preds = %L
  %10 = extractvalue { i64, i1 } %7, 0
  %11 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %10, i64 1)
  %12 = extractvalue { i64, i1 } %11, 1
  br i1 %12, label %fail1, label %pass2

fail1:                                            ; preds = %pass
  %13 = load %jl_value_t** @jl_overflow_exception, align 8
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %13, i32 67)
  unreachable

pass2:                                            ; preds = %pass
  %14 = extractvalue { i64, i1 } %11, 0
  %15 = icmp slt i64 %14, 1
  br i1 %15, label %L11, label %if3

if3:                                              ; preds = %pass2
  %16 = load i8** %6, align 8
  %17 = bitcast i8* %16 to double*
  %n.vec = and i64 %14, -4
  %cmp.zero = icmp eq i64 %n.vec, 0
  br i1 %cmp.zero, label %middle.block, label %vector.body

vector.body:                                      ; preds = %vector.body, %if3
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %if3 ]
  %vec.phi = phi <2 x double> [ %22, %vector.body ], [ zeroinitializer, %if3 ]
  %vec.phi13 = phi <2 x double> [ %23, %vector.body ], [ zeroinitializer, %if3 ]
  %18 = getelementptr double* %17, i64 %index
  %19 = bitcast double* %18 to <2 x double>*
  %wide.load = load <2 x double>* %19, align 8
  %.sum19 = or i64 %index, 2
  %20 = getelementptr double* %17, i64 %.sum19
  %21 = bitcast double* %20 to <2 x double>*
  %wide.load14 = load <2 x double>* %21, align 8
  %22 = fadd <2 x double> %vec.phi, %wide.load
  %23 = fadd <2 x double> %vec.phi13, %wide.load14
  %index.next = add i64 %index, 4
  %24 = icmp eq i64 %index.next, %n.vec
  br i1 %24, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body, %if3
  %resume.val = phi i64 [ 0, %if3 ], [ %n.vec, %vector.body ]
  %rdx.vec.exit.phi = phi <2 x double> [ zeroinitializer, %if3 ], [ %22, %vector.body ]
  %rdx.vec.exit.phi17 = phi <2 x double> [ zeroinitializer, %if3 ], [ %23, %vector.body ]
  %bin.rdx = fadd <2 x double> %rdx.vec.exit.phi17, %rdx.vec.exit.phi
  %rdx.shuf = shufflevector <2 x double> %bin.rdx, <2 x double> undef, <2 x i32> <i32 1, i32 undef>
  %bin.rdx18 = fadd <2 x double> %bin.rdx, %rdx.shuf
  %25 = extractelement <2 x double> %bin.rdx18, i32 0
  %cmp.n = icmp eq i64 %14, %resume.val
  br i1 %cmp.n, label %L11, label %L5

L5:                                               ; preds = %L5, %middle.block
  %"##i#6971.0" = phi i64 [ %29, %L5 ], [ %resume.val, %middle.block ]
  %s.1 = phi double [ %28, %L5 ], [ %25, %middle.block ]
  %26 = getelementptr double* %17, i64 %"##i#6971.0"
  %27 = load double* %26, align 8
  %28 = fadd fast double %s.1, %27
  %29 = add i64 %"##i#6971.0", 1
  %exitcond = icmp eq i64 %29, %14
  br i1 %exitcond, label %L11, label %L5

L11:                                              ; preds = %L5, %middle.block, %pass2
  %s.3 = phi double [ 0.000000e+00, %pass2 ], [ %28, %L5 ], [ %25, %middle.block ]
  ret double %s.3
}

timholy · 2015-11-04T09:42:16Z

On master (and a different machine) things are even more dramatic, because it vectorizes with the @simd. I get a 14x difference. Also, foo2 is 7x slower than foo if you pass it the LinSpace object, presumably due to bounds-checking preventing the vectorization.

simonster · 2015-11-04T15:25:10Z

With @simd, x is being computed as first(r)+i*step(r) rather than ((r.len-i)*r.start + (i-1)*r.stop)/r.divisor. That's faster, but it's not quite right.

timholy · 2015-11-04T17:40:21Z

Ah, I bet it all comes down to that division---one cannot account for a 7-fold effect from a couple of extra adds or multiplies. I don't have time to check now, but perhaps we should be storing, and multiplying by, 1/r.divisor? Or does that destroy the ulp-level accuracy that motivated this construct? CC @StefanKarpinski.

c42f · 2016-01-26T14:49:54Z

Over at #14420 I've been reconsidering the LinSpace implementation for other reasons, but it seems relevant to this issue too, especially since I'm also seeing similar strange and confusing inconsistencies when it comes to benchmarking. With my current LinSpace2 hack, there's about a 10x speed advantage over LinSpace in Tim's foo simd test loop.

However, LinSpace2 generates a warning because of the somewhat bogus way I implemented last(). Fine, that's easy to fix and I did so... but that completely kills the performance boost (back to only 2x faster). With that in mind I've left last() as it is for now, in the hope that it'll shed some light here.

jrevels · 2016-01-26T16:50:08Z

(sorry, finger slipped when scrolling)

Fix #13401 (disable inbound checking for arrays generated on the fly)

stevengj added the performance Must go faster label Oct 1, 2015

timholy mentioned this issue Nov 4, 2015

Atsign-simd errors with for x in X with arrays #13869

Closed

jrevels added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 17, 2015

c42f mentioned this issue Jan 26, 2016

linspace(...)[1] can differ from linspace start value #14420

Closed

jrevels closed this as completed Jan 26, 2016

jrevels reopened this Jan 26, 2016

blakejohnson closed this as completed in bcbd756 Feb 7, 2016

blakejohnson added a commit that referenced this issue Feb 7, 2016

Merge pull request #14969 from lvnguyen/ln/j13401

4091c1d

Fix #13401 (disable inbound checking for arrays generated on the fly)

timholy mentioned this issue Mar 9, 2016

RFC: Modifying base to play nicely with physical units #15394

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance for linspace #13401

Poor performance for linspace #13401

stevengj commented Oct 1, 2015

timholy commented Oct 1, 2015

carlobaldassi commented Oct 1, 2015

timholy commented Oct 1, 2015

musm commented Oct 1, 2015

timholy commented Oct 1, 2015

stevengj commented Oct 1, 2015

KristofferC commented Oct 1, 2015

timholy commented Oct 1, 2015

KristofferC commented Oct 1, 2015

stevengj commented Oct 1, 2015

carlobaldassi commented Oct 1, 2015

mbauman commented Oct 3, 2015

timholy commented Nov 4, 2015

timholy commented Nov 4, 2015

simonster commented Nov 4, 2015

timholy commented Nov 4, 2015

c42f commented Jan 26, 2016

jrevels commented Jan 26, 2016

Poor performance for linspace #13401

Poor performance for linspace #13401

Comments

stevengj commented Oct 1, 2015

timholy commented Oct 1, 2015

carlobaldassi commented Oct 1, 2015

timholy commented Oct 1, 2015

musm commented Oct 1, 2015

timholy commented Oct 1, 2015

stevengj commented Oct 1, 2015

KristofferC commented Oct 1, 2015

timholy commented Oct 1, 2015

KristofferC commented Oct 1, 2015

stevengj commented Oct 1, 2015

carlobaldassi commented Oct 1, 2015

mbauman commented Oct 3, 2015

timholy commented Nov 4, 2015

With the @simd

Without the @simd

Reference: using an Array

timholy commented Nov 4, 2015

simonster commented Nov 4, 2015

timholy commented Nov 4, 2015

c42f commented Jan 26, 2016

jrevels commented Jan 26, 2016