-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
julia parallel accumulator #11228
Comments
Did you (in both cases), Second idea: During compilation LLVM picks up on instances of SIMD (single instruction multiple data), and optimises them to use the CPU machinery for such. So the loop would be translated into a (very fast) SIMD |
Thanks for the reply. The problem I see here is that using the function |
From what I can tell, the compiler isn't smart enough to detect an optimization that it did to the first code block that it didn't do to the second code block. It looks there are some constant folding optimizations that didn't happen with julia> @code_native f() julia> @code_native g()
Source line: 1 Source line: 1
pushq %rbp pushq %rbp
movq %rsp, %rbp movq %rsp, %rbp
pushq %rbx pushq %rbx
subq $40, %rsp subq $40, %rsp
movq $4, -40(%rbp) movq $6, -48(%rbp)
Source line: 1 Source line: 1
movabsq $jl_pgcstack, %rbx movabsq $jl_pgcstack, %rbx
movq (%rbx), %rax movq (%rbx), %rax
movq %rax, -32(%rbp) movq %rax, -40(%rbp)
leaq -40(%rbp), %rax leaq -48(%rbp), %rax
movq %rax, (%rbx) movq %rax, (%rbx)
movq $0, -24(%rbp) movq $0, -16(%rbp)
movabsq $4486589424, %rax ## imm = 0x10B6BEBF0 movq $0, -32(%rbp)
Source line: 1504 movabsq $4486663920, %rax ## imm = 0x10B6D0EF0
movq %rax, -16(%rbp) Source line: 1504
movabsq $4486589424, %rdi ## imm = 0x10B6BEBF0 movq %rax, -24(%rbp)
xorl %esi, %esi leaq -16(%rbp), %rsi
xorl %edx, %edx movabsq $4469143928, %rcx ## imm = 0x10A61B978
callq *(%rax) movq (%rax), %rax
movabsq $13038518624, %rcx ## imm = 0x309280160 movq (%rcx), %rcx
movabsq $4444474056, %rdx ## imm = 0x108E94AC8 movq %rcx, -16(%rbp)
movq %rax, -24(%rbp) movabsq $4486663920, %rdi ## imm = 0x10B6D0EF0
movq (%rdx), %rdi movl $1, %edx
movq %rax, %rsi callq *%rax
movl $200000000, %edx ## imm = 0xBEBC200 movabsq $13038518624, %rcx ## imm = 0x309280160
callq *%rcx movabsq $4444474056, %rdx ## imm = 0x108E94AC8
movq -32(%rbp), %rcx movq %rax, -32(%rbp)
movq %rcx, (%rbx) movq (%rdx), %rdi
addq $40, %rsp movq %rax, %rsi
popq %rbx movl $200000000, %edx ## imm = 0xBEBC200
popq %rbp callq *%rcx
ret movq -40(%rbp), %rcx
movq %rcx, (%rbx)
addq $40, %rsp
popq %rbx
popq %rbp
ret Hopefully one of the compiler experts can have a look. @vtjnash maybe? |
Qualifying
I don't know the reason for this. @vtjnash ? |
Further analysis of this performance issue: the output of macroexpand(
:(@time tot1 = begin
@parallel (+) for i in 1:20000000
begin
Main.twice(Int32(rand(Bool)))
end
end
end)) gives
while macroexpand(
:(@time tot1 = begin
@parallel (+) for i in 1:20000000
begin
twice(Int32(rand(Bool)))
end
end
end)) gives this output:
The difference is on the arguments passed to The macro |
Good detective work. If the root problem does indeed turn out to be the implementation of |
OK, I think I have a handle on this. @fabioramponi is right about the inconsistency - here's a more minimal demo: julia> x=1
1
julia> Base.find_vars(:x)
1-element Array{Any,1}:
:x
julia> Base.find_vars(:(Main.x))
0-element Array{Any,1} Since Here's a demo of the performance effect of localization (which I believe but am not totally sure is ultimately the same as what will happen to the body of the parallel loop): julia> twice(x)=2x
twice (generic function with 1 method)
julia> function f1(N)
for n=1:N
twice(n)
end
end
f1 (generic function with 1 method)
julia> function f2(N)
let twice=twice
for n=1:N
twice(n)
end
end
end
f2 (generic function with 1 method)
# .. Elided warmup timings...
julia> @time f1(10^6)
0.000005 seconds (5 allocations: 176 bytes)
julia> @time f2(10^6)
0.030310 seconds (2.00 M allocations: 30.506 MB, 6.83% gc time) Putting these two pieces together explains what's happening here. |
Is this relevant anymore? |
JuliaLang/Distributed.jl#30 absolutely is, and should be either 0.6.0 or 1.0. This might be considered a subset of that however. |
cc @amitmurthy |
This issue seems fixed on 0.5. |
Only because |
So should we reopen or just defer this issue to deprecating |
Let's keep it closed. I do expect |
Hi,
I am testing the parallel programming features in Julia (version 0.4.0-dev+4629).
I started 4 local worker processes on my laptop with
addprocs(4)
then I defined on every process this simple function:Now, if I run an accumulator function which adds 2*x for x random booleans by using the function
twice
or multiplying by 2 inline I found a huge difference in performances:returns
while
returns
the 2 loops take the same time for running on a single process.
Is this behaviour expected?
This is the output of versioninfo()
The text was updated successfully, but these errors were encountered: