Speed of data movement in @spawn #9992

andreasnoack · 2015-02-01T02:10:58Z

UPDATE 9 February: After merging #10073, the speed of @spawn for large arrays has improved a lot, so I have updated the plots below.

This is part of #9167, but deserves a separate issue as it is a well defined problem whereas some the other bullet points are more unspecific. In short, the issue is that we are moving data slowly between processes compared to MPI.

I don't know how and if this can be fixed, so my best bet is to provide data that illustrates the issue. I hope that you can give input to improvements then I can offer to run benchmarks.

The essence of the issue is in this graph

which is an updated version of the graph in #9167. It shows the time of parallel data movement against the size of an Vector{Float64} for three different schemes of which the first is our @spawn and the two others are MPI based. In contrast to the plot in #9167, I have now included timings where I force MPI to use TCP instead of shared memory for the data transport.

MPI-TCP uses MPI.jl's Send and Recv!. TCP is used for data transport.
MPI-SM is the same, but here the data transport is over shared memory instead of TCP.

This overhead makes it difficult to benefit from our parallel functionality e.g. in parallel linear algebra where significant data movement is unavoidable. Below are some further comments to the graph.

Large arrays

The relative timings between our @spawn and MPI-TCP is now approximately 2x for large arrays.

The difference between @spawn and MPI when working within a node with shared memory is over 3x for the largest arrays and the difference grows as the arrays become smaller.

Small arrays

When the array has fever than approximately 1000 elements the size doesn't have an effect on the time it takes to move the array. I don't know exactly where this time is spent as it is difficult to profile parallel code. However, the bottomline is that @spawn is 10x slower than MPI when using TCP and over 40x slower when MPI is using shared memory for the transport.

Example: Symmetric tridiagonal solver

@alanedelman, his PhD student Eka and I have made some implementations of parallel symmetric tridiagonal solvers in Julia. For the same parallel algorithm, Eka did an implementation with DArray's and I did an implementation based on MPI.jl. For a problem of dimension 100000 solved on 1, 2, 4, and 8 processors a graph of the timings showed

the an exponential model for the scalings were

DArray: time = 0.0082 * nprocs^0.4
MPI.jl: time = 0.0033 * nprocs^(-0.78)

where MPI uses TCP for transport. Notice that the exponent for the DArray is positive, i.e. overhead dominates the benefit from parallelization in contrast to MPI.jl where the problem scales as expected in the number of processors.

cc: @ViralBShah, @amitmurthy

The text was updated successfully, but these errors were encountered:

amitmurthy · 2015-02-01T12:37:31Z

I'll put together a PR that combines #6876 and #9181. That should address some of the issues.

amitmurthy · 2015-02-10T11:43:59Z

If possible, could you share your benchmarking code?

andreasnoack · 2015-02-10T11:54:58Z

Here it is https://github.com/andreasnoack/ParallelBenchmarks.jl

amitmurthy · 2015-02-11T09:09:34Z

@sync @spawnat in the benchmarking code results in both 1) transfer of array to remote (the @spawant and 2) waiting for an acknowledgement that the data has indeed been transferred (the @sync)

while the MPI code was calling MPI_Send which according to http://www.mcs.anl.gov/research/projects/mpi/sendmode.html only needs to block till the buffer can be reused.

So, I changed the test to do an echo of the sent buffer - remotecall_fetch of the same array for Julia and a send-recv combination for MPI.

The changed tests are here - https://github.com/amitmurthy/ParallelBenchmarks.jl

The results I get with an echo test (single worker) are :

andreasnoack · 2015-02-12T00:54:17Z

Okay. That might be more fair, but I think the relative timings appear very similar. Notice that I've removed the "MPI_serialize" series from my last plot.

Two questions, do you see any possibilities for improvements in the left part of the graph and how much of this speed up is possible to achieve if more complicated object like e.g. a factorization is moved.

amitmurthy · 2015-02-12T06:46:24Z

The MPI library may be using gather-send and scatter-recv to send its header+data in a single socket call avoiding an intermediate buffer. I don't see a straightforward way of doing this via libuv currently.

As for complicated objects, I think it is an issue we will see with both MPI.jl as well as @spawn. Unless the types return isbits true, serialization will be an overhead. Maybe #7568 will help in having complicated bits types?

I'll add a plot of only serialization times to the above graph, That should gives us an idea of serialization overhead.

amitmurthy · 2015-02-12T07:54:35Z

Added a timing of serializing-deserializing the request (basically the values - :call_fetch, Base.next_id(), x->x, a twice. This is what a remotecall_fetch sends over the wire. It is interesting that the bulk of the overhead is not on the network side but on serialization.

The timings are all minimum values in this plot.

amitmurthy · 2015-02-12T08:51:38Z

@spawn and typical usages of remotecall* all serialize anonymous functions which seems to be the culprit:

function ser_timings(x)
    io=PipeBuffer()
    serialize(io, x)
    deserialize(io)

    @elapsed for n in 1:10^5
        serialize(io, x)
        deserialize(io)
    end
end

function all_timings()
    ser_timings(1)
    anon_func = x->x

    for t in (1, 1.0, "Hello", 'c', :a_symbol, x->x, anon_func, myid, (1,1), ones(1), ones(10), ones(1000), fill(1, 1), fill(1, 10), fill(1,1000))
        println(isa(t, Array) ? string(typeof(t),":", length(t)) : typeof(t), "   : ", ser_timings(t))
    end
end

all_timings()

prints

Int64   : 0.01174819
Float64   : 0.05571949
ASCIIString   : 0.112928149
Char   : 0.037121222
Symbol   : 0.057779137
Function   : 3.080522616
Function   : 3.094634032
Function   : 0.100687586
(Int64,Int64)   : 0.059134811
Array{Float64,1}:1   : 0.13216678
Array{Float64,1}:10   : 0.123837175
Array{Float64,1}:1000   : 0.545716207
Array{Int64,1}:1   : 0.110177278
Array{Int64,1}:10   : 0.111104982
Array{Int64,1}:1000   : 0.476917077

ViralBShah · 2015-02-12T09:34:51Z

Wow, anonymous functions have a 30x higher overhead for serialization/deserialization. Seems that both serialization and deserialization are equally to blame.

ViralBShah · 2015-02-12T09:46:36Z

I guess this makes sense. In the case of a regular function, the serialization just sends the symbol name, whereas in the case of an anonymous function, it sends over the entire AST, which even for x->x is 20-30x bigger than a symbol name.

amitmurthy · 2015-02-12T10:02:00Z

Not really, since serializing-deserializing even an array of 1000 floats is quite less compared to the anonymous function.

ViralBShah · 2015-02-12T10:05:02Z

The time is all going in serializing LambdaStaticData, which spends all its time in uncompressed_ast(). serialize_array_data seems to have much lesser work it has to do in comparison.

ViralBShah · 2015-02-12T10:17:51Z

I guess we could cache the uncompressed ASTs in serialization. Perhaps the benchmark example is only good for benchmarking, and for real usage, for now, one can avoid using anonymous functions.

amitmurthy · 2015-02-12T10:33:16Z

Caching anonymous functions does not make sense. For small arrays, even if we use the pattern of calling only defined functions via @everywhere foo(x)=x; remotecall_fetch(p, foo, a), the overhead of serializing defined functions will be quite large compared to an MPI model where the remote code that works on the sent array is necessarily part of the program flow.

julia> io=PipeBuffer()

julia> @elapsed for n in 1:10^5; serialize(io, x->x); deserialize(io); end
3.560338764

julia> echo(x)=x
echo (generic function with 1 method)

julia> @elapsed for n in 1:10^5; serialize(io, echo); deserialize(io); end
0.197402999

julia> @elapsed for n in 1:10^5; serialize(io, myid); deserialize(io); end
0.118048503

ggggggggg · 2015-05-31T01:35:18Z

Is it possible the type ambiguity in the specification of Worker is part of the speed issue? r_stream and w_stream are both of type AsyncStream which is abstract. Is there a reason it isn't more like

type Worker{T<:AsyncStream}
    id::Int
    r_stream::T
    w_stream::T
    ...
end

vtjnash · 2019-01-08T04:49:09Z

Closing issue as stale. Please re-open new issues as appropriate.

jiahao added the parallelism Parallel or distributed computation label Feb 1, 2015

amitmurthy mentioned this issue Feb 4, 2015

optimized send - direct writes for large bitstype arrays #10073

Merged

amitmurthy mentioned this issue Mar 3, 2015

Cluster Manager supports all processes as part of MPI. Also allows for MPI as a transport option. JuliaParallel/MPI.jl#38

Merged

JeffBezanson added the performance Must go faster label Jul 13, 2015

JeffBezanson mentioned this issue Jul 13, 2015

Improve performance of network I/O in comparison to MPI #12133

Open

vtjnash closed this as completed Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed of data movement in @spawn #9992

Speed of data movement in @spawn #9992

andreasnoack commented Feb 1, 2015

amitmurthy commented Feb 1, 2015

amitmurthy commented Feb 10, 2015

andreasnoack commented Feb 10, 2015

amitmurthy commented Feb 11, 2015

andreasnoack commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

ViralBShah commented Feb 12, 2015

ViralBShah commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

ViralBShah commented Feb 12, 2015

ViralBShah commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

ggggggggg commented May 31, 2015

vtjnash commented Jan 8, 2019

Speed of data movement in @spawn #9992

Speed of data movement in @spawn #9992

Comments

andreasnoack commented Feb 1, 2015

Large arrays

Small arrays

Example: Symmetric tridiagonal solver

amitmurthy commented Feb 1, 2015

amitmurthy commented Feb 10, 2015

andreasnoack commented Feb 10, 2015

amitmurthy commented Feb 11, 2015

andreasnoack commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

ViralBShah commented Feb 12, 2015

ViralBShah commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

ViralBShah commented Feb 12, 2015

ViralBShah commented Feb 12, 2015

amitmurthy commented Feb 12, 2015

ggggggggg commented May 31, 2015

vtjnash commented Jan 8, 2019