Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

julia parallel accumulator #11228

Closed
fabioramponi opened this issue May 11, 2015 · 14 comments
Closed

julia parallel accumulator #11228

fabioramponi opened this issue May 11, 2015 · 14 comments
Labels
parallelism Parallel or distributed computation performance Must go faster
Milestone

Comments

@fabioramponi
Copy link

Hi,
I am testing the parallel programming features in Julia (version 0.4.0-dev+4629).

I started 4 local worker processes on my laptop with addprocs(4) then I defined on every process this simple function:

@everywhere function twice(n::Int32)
    2 * n 
end

Now, if I run an accumulator function which adds 2*x for x random booleans by using the function twice or multiplying by 2 inline I found a huge difference in performances:

@time tot1 = @parallel (+) for i in 1:200000000
    2*Int32(rand(Bool))
end

returns

elapsed time: 0.167981744 seconds (364 kB allocated),

while

@time tot1 = begin
    @parallel (+) for i in 1:200000000
        begin 
            twice(Int32(rand(Bool)))
        end
    end
end

returns

elapsed time: 1.94347008 seconds (382 kB allocated)

the 2 loops take the same time for running on a single process.
Is this behaviour expected?

This is the output of versioninfo()

Julia Version 0.4.0-dev+4629
Commit 85582bd* (2015-05-04 00:27 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin14.3.0)
CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

@oxinabox
Copy link
Contributor

Did you (in both cases),
remember to run twice, and discard the first result?
Always do this when profiling julia,
because the overhead of compiling the functions is considerable (but only happens the first time)

Second idea:

During compilation LLVM picks up on instances of SIMD (single instruction multiple data), and optimises them to use the CPU machinery for such. So the loop would be translated into a (very fast) SIMD 2*Array operation. Or since @parallel, the many loops would be made into many SIMD 2*Array operations.

@fabioramponi
Copy link
Author

Thanks for the reply.
Answering to your question: yes, in both cases I reported the results only for the second run.

The problem I see here is that using the function twice w.r.t. writing directly 2*Int32(rand(Bool)) in the @parallel loop causes a degradation of performances by a factor of 10.

@jiahao
Copy link
Member

jiahao commented May 15, 2015

From what I can tell, the compiler isn't smart enough to detect an optimization that it did to the first code block that it didn't do to the second code block. It looks there are some constant folding optimizations that didn't happen with twice(); the need to inline the function probably confused the optimization passes.

julia> @code_native f()                                 julia> @code_native g()         
Source line: 1                                          Source line: 1          
        pushq   %rbp                                    pushq   %rbp
        movq    %rsp, %rbp                                      movq    %rsp, %rbp
        pushq   %rbx                                    pushq   %rbx
        subq    $40, %rsp                                       subq    $40, %rsp
        movq    $4, -40(%rbp)                                   movq    $6, -48(%rbp)
Source line: 1                                          Source line: 1          
        movabsq $jl_pgcstack, %rbx                      movabsq $jl_pgcstack, %rbx
        movq    (%rbx), %rax                            movq    (%rbx), %rax
        movq    %rax, -32(%rbp)                         movq    %rax, -40(%rbp)
        leaq    -40(%rbp), %rax                         leaq    -48(%rbp), %rax
        movq    %rax, (%rbx)                            movq    %rax, (%rbx)
        movq    $0, -24(%rbp)                           movq    $0, -16(%rbp)
        movabsq $4486589424, %rax ## imm = 0x10B6BEBF0  movq    $0, -32(%rbp)
Source line: 1504                                       movabsq $4486663920, %rax ## imm = 0x10B6D0EF0
        movq    %rax, -16(%rbp)                         Source line: 1504               
        movabsq $4486589424, %rdi ## imm = 0x10B6BEBF0  movq    %rax, -24(%rbp)
        xorl    %esi, %esi                              leaq    -16(%rbp), %rsi
        xorl    %edx, %edx                              movabsq $4469143928, %rcx ## imm = 0x10A61B978
        callq   *(%rax)         movq                    (%rax), %rax
        movabsq $13038518624, %rcx ## imm = 0x309280160 movq    (%rcx), %rcx
        movabsq $4444474056, %rdx ## imm = 0x108E94AC8  movq    %rcx, -16(%rbp)
        movq    %rax, -24(%rbp)                         movabsq $4486663920, %rdi ## imm = 0x10B6D0EF0
        movq    (%rdx), %rdi                            movl    $1, %edx
        movq    %rax, %rsi                              callq   *%rax
        movl    $200000000, %edx ## imm = 0xBEBC200     movabsq $13038518624, %rcx ## imm = 0x309280160
        callq   *%rcx                                   movabsq $4444474056, %rdx ## imm = 0x108E94AC8
        movq    -32(%rbp), %rcx                         movq    %rax, -32(%rbp)
        movq    %rcx, (%rbx)                            movq    (%rdx), %rdi
        addq    $40, %rsp                               movq    %rax, %rsi
        popq    %rbx                                    movl    $200000000, %edx ## imm = 0xBEBC200
        popq    %rbp                                    callq   *%rcx
        ret                                             movq    -40(%rbp), %rcx
                                                        movq    %rcx, (%rbx)
                                                        addq    $40, %rsp
                                                        popq    %rbx
                                                        popq    %rbp
                                                        ret     

Hopefully one of the compiler experts can have a look. @vtjnash maybe?

@kshyatt kshyatt added the parallelism Parallel or distributed computation label Jul 2, 2015
@simonster simonster added the performance Must go faster label Jul 2, 2015
@amitmurthy
Copy link
Contributor

Qualifying twice as Main.twice gives the same performance as 2*

@time tot1 = @parallel (+) for i in 1:200000000
    Main.twice(Int32(rand(Bool)))
end

I don't know the reason for this. @vtjnash ?

@fabioramponi
Copy link
Author

Further analysis of this performance issue:

the output of macroexpand in the 2 cases is not identical:

macroexpand(
         :(@time tot1 = begin
             @parallel (+) for i in 1:20000000
             begin 
               Main.twice(Int32(rand(Bool)))
             end
           end
        end))

gives

...    
local JuliaLang/julia#5#val = (tot1 = begin  # none, line 3:
                    $(Expr(:localize, :(()->begin  # expr.jl, line 113:
 ...
        end)))
                end) # line 156:
...
end

while

macroexpand(
  :(@time tot1 = begin
      @parallel (+) for i in 1:20000000
      begin 
        twice(Int32(rand(Bool)))
      end
    end
 end))

gives this output:

...    
local JuliaLang/julia#5#val = (tot1 = begin  # none, line 3:
                    $(Expr(:localize, :(()->begin  # expr.jl, line 113:
 ...
        end), :twice, :twice))
                end) # line 156:
...
end

The difference is on the arguments passed to localize in addition to the lambda function: no extra args when called with Main.twice, two extra arguments (:twice, :twice) when called with twice.

The macro @parallel calls localize_vars which in turn calls find_vars; by following the macro expansion, find_vars is called in one case on :(Main.twice) returning an empty list, in the other on :twice returning [:twice].
Here the behaviour seems to be inconsistent: what I would expect is the same result in both cases.

@malmaud
Copy link
Contributor

malmaud commented Oct 14, 2015

Good detective work. If the root problem does indeed turn out to be the implementation of localize_vars and find_vars, I can probably get around to looking at this in the next week. If it's a lower-level issue with codegen, then probably not.

@malmaud
Copy link
Contributor

malmaud commented Oct 16, 2015

OK, I think I have a handle on this. @fabioramponi is right about the inconsistency - here's a more minimal demo:

julia> x=1
1

julia> Base.find_vars(:x)
1-element Array{Any,1}:
 :x

julia> Base.find_vars(:(Main.x))
0-element Array{Any,1}

Since Main.twice isn't "found", it won't be localized.

Here's a demo of the performance effect of localization (which I believe but am not totally sure is ultimately the same as what will happen to the body of the parallel loop):

julia> twice(x)=2x
twice (generic function with 1 method)

julia> function f1(N)
         for n=1:N
           twice(n)
         end
       end
f1 (generic function with 1 method)

julia> function f2(N)
         let twice=twice
           for n=1:N
             twice(n)
           end
         end
       end
f2 (generic function with 1 method)

# .. Elided warmup timings...
julia> @time f1(10^6)
  0.000005 seconds (5 allocations: 176 bytes)
julia> @time f2(10^6)
  0.030310 seconds (2.00 M allocations: 30.506 MB, 6.83% gc time)

Putting these two pieces together explains what's happening here.

@StefanKarpinski
Copy link
Member

Is this relevant anymore?

@tkelman
Copy link
Contributor

tkelman commented Aug 27, 2016

JuliaLang/Distributed.jl#30 absolutely is, and should be either 0.6.0 or 1.0. This might be considered a subset of that however.

@ViralBShah
Copy link
Member

cc @amitmurthy

@andreasnoack
Copy link
Member

This issue seems fixed on 0.5.

@amitmurthy
Copy link
Contributor

Only because localize_vars is broken on 0.5 and master. 😄

@StefanKarpinski
Copy link
Member

So should we reopen or just defer this issue to deprecating @parallel for altogether?

@amitmurthy
Copy link
Contributor

Let's keep it closed. I do expect localize_vars to be removed (#19594) anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation performance Must go faster
Projects
None yet
Development

No branches or pull requests