julia parallel accumulator #11228

fabioramponi · 2015-05-11T06:57:41Z

Hi,
I am testing the parallel programming features in Julia (version 0.4.0-dev+4629).

I started 4 local worker processes on my laptop with addprocs(4) then I defined on every process this simple function:

@everywhere function twice(n::Int32)
    2 * n 
end

Now, if I run an accumulator function which adds 2*x for x random booleans by using the function twice or multiplying by 2 inline I found a huge difference in performances:

@time tot1 = @parallel (+) for i in 1:200000000
    2*Int32(rand(Bool))
end

returns

elapsed time: 0.167981744 seconds (364 kB allocated),

while

@time tot1 = begin
    @parallel (+) for i in 1:200000000
        begin 
            twice(Int32(rand(Bool)))
        end
    end
end

returns

elapsed time: 1.94347008 seconds (382 kB allocated)

the 2 loops take the same time for running on a single process.
Is this behaviour expected?

This is the output of versioninfo()

Julia Version 0.4.0-dev+4629
Commit 85582bd* (2015-05-04 00:27 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin14.3.0)
CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

The text was updated successfully, but these errors were encountered:

oxinabox · 2015-05-11T13:29:15Z

Did you (in both cases),
remember to run twice, and discard the first result?
Always do this when profiling julia,
because the overhead of compiling the functions is considerable (but only happens the first time)

Second idea:

During compilation LLVM picks up on instances of SIMD (single instruction multiple data), and optimises them to use the CPU machinery for such. So the loop would be translated into a (very fast) SIMD 2*Array operation. Or since @parallel, the many loops would be made into many SIMD 2*Array operations.

fabioramponi · 2015-05-11T13:39:20Z

Thanks for the reply.
Answering to your question: yes, in both cases I reported the results only for the second run.

The problem I see here is that using the function twice w.r.t. writing directly 2*Int32(rand(Bool)) in the @parallel loop causes a degradation of performances by a factor of 10.

jiahao · 2015-05-15T15:22:30Z

From what I can tell, the compiler isn't smart enough to detect an optimization that it did to the first code block that it didn't do to the second code block. It looks there are some constant folding optimizations that didn't happen with twice(); the need to inline the function probably confused the optimization passes.

julia> @code_native f()                                 julia> @code_native g()         
Source line: 1                                          Source line: 1          
        pushq   %rbp                                    pushq   %rbp
        movq    %rsp, %rbp                                      movq    %rsp, %rbp
        pushq   %rbx                                    pushq   %rbx
        subq    $40, %rsp                                       subq    $40, %rsp
        movq    $4, -40(%rbp)                                   movq    $6, -48(%rbp)
Source line: 1                                          Source line: 1          
        movabsq $jl_pgcstack, %rbx                      movabsq $jl_pgcstack, %rbx
        movq    (%rbx), %rax                            movq    (%rbx), %rax
        movq    %rax, -32(%rbp)                         movq    %rax, -40(%rbp)
        leaq    -40(%rbp), %rax                         leaq    -48(%rbp), %rax
        movq    %rax, (%rbx)                            movq    %rax, (%rbx)
        movq    $0, -24(%rbp)                           movq    $0, -16(%rbp)
        movabsq $4486589424, %rax ## imm = 0x10B6BEBF0  movq    $0, -32(%rbp)
Source line: 1504                                       movabsq $4486663920, %rax ## imm = 0x10B6D0EF0
        movq    %rax, -16(%rbp)                         Source line: 1504               
        movabsq $4486589424, %rdi ## imm = 0x10B6BEBF0  movq    %rax, -24(%rbp)
        xorl    %esi, %esi                              leaq    -16(%rbp), %rsi
        xorl    %edx, %edx                              movabsq $4469143928, %rcx ## imm = 0x10A61B978
        callq   *(%rax)         movq                    (%rax), %rax
        movabsq $13038518624, %rcx ## imm = 0x309280160 movq    (%rcx), %rcx
        movabsq $4444474056, %rdx ## imm = 0x108E94AC8  movq    %rcx, -16(%rbp)
        movq    %rax, -24(%rbp)                         movabsq $4486663920, %rdi ## imm = 0x10B6D0EF0
        movq    (%rdx), %rdi                            movl    $1, %edx
        movq    %rax, %rsi                              callq   *%rax
        movl    $200000000, %edx ## imm = 0xBEBC200     movabsq $13038518624, %rcx ## imm = 0x309280160
        callq   *%rcx                                   movabsq $4444474056, %rdx ## imm = 0x108E94AC8
        movq    -32(%rbp), %rcx                         movq    %rax, -32(%rbp)
        movq    %rcx, (%rbx)                            movq    (%rdx), %rdi
        addq    $40, %rsp                               movq    %rax, %rsi
        popq    %rbx                                    movl    $200000000, %edx ## imm = 0xBEBC200
        popq    %rbp                                    callq   *%rcx
        ret                                             movq    -40(%rbp), %rcx
                                                        movq    %rcx, (%rbx)
                                                        addq    $40, %rsp
                                                        popq    %rbx
                                                        popq    %rbp
                                                        ret

Hopefully one of the compiler experts can have a look. @vtjnash maybe?

amitmurthy · 2015-08-06T12:30:12Z

Qualifying twice as Main.twice gives the same performance as 2*

@time tot1 = @parallel (+) for i in 1:200000000
    Main.twice(Int32(rand(Bool)))
end

I don't know the reason for this. @vtjnash ?

fabioramponi · 2015-09-02T10:25:29Z

Further analysis of this performance issue:

the output of macroexpand in the 2 cases is not identical:

macroexpand(
         :(@time tot1 = begin
             @parallel (+) for i in 1:20000000
             begin 
               Main.twice(Int32(rand(Bool)))
             end
           end
        end))

gives

...    
local JuliaLang/julia#5#val = (tot1 = begin  # none, line 3:
                    $(Expr(:localize, :(()->begin  # expr.jl, line 113:
 ...
        end)))
                end) # line 156:
...
end

while

macroexpand(
  :(@time tot1 = begin
      @parallel (+) for i in 1:20000000
      begin 
        twice(Int32(rand(Bool)))
      end
    end
 end))

gives this output:

...    
local JuliaLang/julia#5#val = (tot1 = begin  # none, line 3:
                    $(Expr(:localize, :(()->begin  # expr.jl, line 113:
 ...
        end), :twice, :twice))
                end) # line 156:
...
end

The difference is on the arguments passed to localize in addition to the lambda function: no extra args when called with Main.twice, two extra arguments (:twice, :twice) when called with twice.

The macro @parallel calls localize_vars which in turn calls find_vars; by following the macro expansion, find_vars is called in one case on :(Main.twice) returning an empty list, in the other on :twice returning [:twice].
Here the behaviour seems to be inconsistent: what I would expect is the same result in both cases.

malmaud · 2015-10-14T06:38:43Z

Good detective work. If the root problem does indeed turn out to be the implementation of localize_vars and find_vars, I can probably get around to looking at this in the next week. If it's a lower-level issue with codegen, then probably not.

malmaud · 2015-10-16T23:02:44Z

OK, I think I have a handle on this. @fabioramponi is right about the inconsistency - here's a more minimal demo:

julia> x=1
1

julia> Base.find_vars(:x)
1-element Array{Any,1}:
 :x

julia> Base.find_vars(:(Main.x))
0-element Array{Any,1}

Since Main.twice isn't "found", it won't be localized.

Here's a demo of the performance effect of localization (which I believe but am not totally sure is ultimately the same as what will happen to the body of the parallel loop):

julia> twice(x)=2x
twice (generic function with 1 method)

julia> function f1(N)
         for n=1:N
           twice(n)
         end
       end
f1 (generic function with 1 method)

julia> function f2(N)
         let twice=twice
           for n=1:N
             twice(n)
           end
         end
       end
f2 (generic function with 1 method)

# .. Elided warmup timings...
julia> @time f1(10^6)
  0.000005 seconds (5 allocations: 176 bytes)
julia> @time f2(10^6)
  0.030310 seconds (2.00 M allocations: 30.506 MB, 6.83% gc time)

Putting these two pieces together explains what's happening here.

StefanKarpinski · 2016-08-26T17:23:55Z

Is this relevant anymore?

tkelman · 2016-08-27T10:25:53Z

JuliaLang/Distributed.jl#30 absolutely is, and should be either 0.6.0 or 1.0. This might be considered a subset of that however.

ViralBShah · 2016-08-27T14:03:13Z

cc @amitmurthy

andreasnoack · 2016-12-14T11:45:55Z

This issue seems fixed on 0.5.

amitmurthy · 2016-12-14T12:08:17Z

Only because localize_vars is broken on 0.5 and master. 😄

StefanKarpinski · 2016-12-14T19:43:14Z

So should we reopen or just defer this issue to deprecating @parallel for altogether?

amitmurthy · 2016-12-15T02:43:04Z

Let's keep it closed. I do expect localize_vars to be removed (#19594) anyway.

kshyatt added the parallelism Parallel or distributed computation label Jul 2, 2015

simonster added the performance Must go faster label Jul 2, 2015

malmaud mentioned this issue Feb 10, 2024

Think about simplified scoping rules for parallel macros JuliaLang/Distributed.jl#30

Open

9 tasks

StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016

amitmurthy mentioned this issue Dec 14, 2016

Remove localize_vars. Serialize global values under Main. #19594

Merged

andreasnoack closed this as completed Dec 14, 2016

amitmurthy mentioned this issue Oct 23, 2016

Closures are availble automatically on parrellel workers, but normal functions are not JuliaLang/Distributed.jl#37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

julia parallel accumulator #11228

julia parallel accumulator #11228

fabioramponi commented May 11, 2015

oxinabox commented May 11, 2015

fabioramponi commented May 11, 2015

jiahao commented May 15, 2015

amitmurthy commented Aug 6, 2015

fabioramponi commented Sep 2, 2015

malmaud commented Oct 14, 2015

malmaud commented Oct 16, 2015

StefanKarpinski commented Aug 26, 2016

tkelman commented Aug 27, 2016

ViralBShah commented Aug 27, 2016

andreasnoack commented Dec 14, 2016

amitmurthy commented Dec 14, 2016

StefanKarpinski commented Dec 14, 2016

amitmurthy commented Dec 15, 2016

julia parallel accumulator #11228

julia parallel accumulator #11228

Comments

fabioramponi commented May 11, 2015

oxinabox commented May 11, 2015

fabioramponi commented May 11, 2015

jiahao commented May 15, 2015

amitmurthy commented Aug 6, 2015

fabioramponi commented Sep 2, 2015

malmaud commented Oct 14, 2015

malmaud commented Oct 16, 2015

StefanKarpinski commented Aug 26, 2016

tkelman commented Aug 27, 2016

ViralBShah commented Aug 27, 2016

andreasnoack commented Dec 14, 2016

amitmurthy commented Dec 14, 2016

StefanKarpinski commented Dec 14, 2016

amitmurthy commented Dec 15, 2016