Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating RemoteRefs to Futures and RemoteChannels #13923

Merged
merged 1 commit into from
Nov 18, 2015
Merged

Conversation

amitmurthy
Copy link
Contributor

This PR is ready for a review:

The intention of this PR is to optimize the messaging overhead w.r.t. the current RemoteRef model.

A Future is a write-once remote reference and all remote calls now return Futures. The existing RemoteRef has been renamed to RemoteChannel, since in effect, a RemoteChannel points to an implementation of a AbstractChannel on a remote node.

Both Future and RemoteChannel are implementations of AbstractRemoteRef.

The below section is the same as the updated manual in this PR:

Remote References and Distributed Garbage Collection

Objects referred to by remote references can be freed only when all held references in the cluster
are deleted.

The node where the value is stored keeps tracks of which of the workers have a reference to it.
Every time a RemoteChannel or a (unfetched) Future is serialized to a worker, the node pointed
to by the reference is notified. And every time a RemoteChannel or a (unfetched) Future
is garbage collected locally, the node owning the value is again notified.

The notifications are done via sending of "tracking" messages - an "add reference" message when
a reference is serialized to a different process and a "delete reference" message when a reference
is locally garbage collected.

Since Future\ s are write-once and cached locally, the act of fetching a Future also updates
reference tracking information on the node owning the value.

The node which owns the value, frees it once all references to it are cleared.

With Future\ s, serializing an already fetched Future to a different node also sends the value
since the original remote store may have collected the value by this time.

It is important to note that when an object is locally garbage collected is dependent
upon the size of the object and the current memory pressure in the system.

In case of remote references, the size of the local reference object is quite small, while the remote value
stored on the remote node may be quite large. Since the local object may not be collected immediately, it is
a good practice to explicitly call finalize on local instances of a RemoteChannel, or on unfetched
Future\ s. Since calling fetch on a Future also removes its reference from the remote store, this
is not required on fetched Future\ s. Explicitly calling finalize immediately sends a message to
the remote node to go ahead and remove its reference to the value.

@amitmurthy amitmurthy changed the title RFC/WIP : Futures Updating RemoteRefs to Futures and RemoteChannels Nov 11, 2015
@amitmurthy
Copy link
Contributor Author

cc: @JeffBezanson , @jakebolewski , @malmaud , @carnaval

@tkelman tkelman added the parallelism Parallel or distributed computation label Nov 11, 2015
@malmaud
Copy link
Contributor

malmaud commented Nov 11, 2015

This looks really promising! Do you have any benchmarking numbers to share?

@StefanKarpinski
Copy link
Sponsor Member

Shouldn't really change performance – the main advantage is that it can allow memory to be freed sooner in certain circumstances, without needing to wait for a global garbage collection.

@malmaud
Copy link
Contributor

malmaud commented Nov 11, 2015

It might though by changing the timing and distribution of messages which are sent as a result of a @spawn

@jakebolewski
Copy link
Member

It would be good to check that the performance doesn't decrease though.

@amitmurthy
Copy link
Contributor Author

For

addprocs(1)

@everywhere echo(x) = x

function foo(n)
    for i in 1:10^n
        f=remotecall(echo, 2, 1)
        fetch(f)
    end
end

foo(1)
@time foo(5)

On current master:

julia> @time foo(5)
 14.342809 seconds (19.87 M allocations: 1.115 GB, 1.85% gc time)

julia> @time foo(5)
 14.446782 seconds (19.88 M allocations: 1.115 GB, 2.09% gc time)

With this PR:

julia> @time foo(5)
 10.433722 seconds (18.25 M allocations: 1.006 GB, 1.66% gc time)

julia> @time foo(5)
  9.475284 seconds (18.25 M allocations: 1.006 GB, 1.90% gc time)

This is probably due to the absence of "del reference " messages in this test case.

@amitmurthy
Copy link
Contributor Author

The bulk of the above time is in socket communication. Running https://gist.github.com/amitmurthy/4b01abb0ced926252b9c

julia> mean([(tic();foo(5);toq()) for i in 1:5])
9.931510753

@amitmurthy
Copy link
Contributor Author

I'll merge this in 4-5 days. This will be a breaking change. Hope to get some detailed reviews by then.

@tkelman tkelman added the breaking This change will break code label Nov 13, 2015
@amitmurthy
Copy link
Contributor Author

Will be merging this tomorrow if there are no objections raised.

@StefanKarpinski
Copy link
Sponsor Member

+1

@@ -355,18 +331,6 @@ General Parallel Computing Support
* ``put!`` on a closed channel.
* ``take!`` and ``fetch`` on an empty, closed channel.

.. function:: RemoteRef()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change these to the signatures for RemoteChannel and rerun genstdlib

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the changes in base/docs/helpdb.jl (see changes above), built julia and then ran ./julia doc/genstdlib.jl. I don't see the changes reflected in doc/stdlib/parallel.rst. Am I missing a step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change the old signature (whatever after .. function::) to the new signature (which should be the first line in your doc) so that the doc system can find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@amitmurthy amitmurthy force-pushed the amitm/futures branch 2 times, most recently from 005b6e1 to 03a56c0 Compare November 18, 2015 04:23
@amitmurthy
Copy link
Contributor Author

AppVeyor is failing very early with

.......
/c/projects/julia/contrib/windows/msys_build.sh: line 36: ./jq: Bad file number
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 43 37293   43 16123    0     0  41235      0 --:--:-- --:--:-- --:--:-- 41235 
curl: (23) Failed writing body (0 != 16123)
Command exited with code 1

Lines 35-36 in msys_build.sh are

  query=".builds | map(select(.pullRequestId == \"$APPVEYOR_PULL_REQUEST_NUMBER\"))[0].buildNumber"
  latestbuild="$(curl $av_api_url | ./jq "$query")"

Restarting build on AV didn't help. How do I reset AppVeyor?

@amitmurthy
Copy link
Contributor Author

Pushing a new commit has not helped either...

amitmurthy added a commit that referenced this pull request Nov 18, 2015
Updating RemoteRefs to Futures and RemoteChannels
@amitmurthy amitmurthy merged commit 708f6ab into master Nov 18, 2015
@amitmurthy amitmurthy deleted the amitm/futures branch November 18, 2015 10:53
@tkelman
Copy link
Contributor

tkelman commented Nov 18, 2015

AppVeyor apparently reset the build cache. This meant it had to redownload the toolchain and jq.exe. The toolchain download still worked, but the jq.exe was a bad link and it was saving a 404 page instead of an executable. I fixed it by using powershell instead of jq to parse json api responses.

@amitmurthy
Copy link
Contributor Author

Thanks. For fixing AV and the doc review.

@josefsachsconning
Copy link
Contributor

Will there not be a deprecation warning for the disappearance of RemoteRef?
How about inclusion in Compat.jl?
(I meant to comment here instead of #12174)

@amitmurthy
Copy link
Contributor Author

I'll be unable to make a PR for the next week or so. Will do so once I am back.

A Future is a write-once ref, which is slightly different from the old RemoteRef. The older RemoteRef can be replaced with a RemoteChannel. But since all the remote calls now return Futures, any code that would declare a RemoteRef[] to collect return values from remote calls will still need to be changed.

@yuyichao
Copy link
Contributor

This probably needs to update the netload/nettest test as well. This also breaks the JSON test.

@dehann
Copy link
Contributor

dehann commented Nov 27, 2016

I've been using RemoteRef as a type pretty heavily in 0.3 and 0.4, but now a large portion of my parallel code is broken since RemoteRef no longer is a type. For example this one line breaks:

RR = Dict{Int, RemoteRef}()

I have been using the following to distribute computation over a graphical tree model just fine in 0.4, IncrementalInference.jl

function helloworld()
  "Hello world"
end

function atest!(rr::Dict{Int,RemoteRef})
  rr[1] = remotecall(1, helloworld)
  nothing
end

RR = Dict{Int, RemoteRef}()
@sync begin
  @async atest!(RR)
end

@show fetch(RR[1])

I was hoping to either maintain the ability to distribute computation while using type inference, or switch to an improved dispatch mechanism without too much rework. Maybe type inference / dispatch should be able to handle the new Base.#RemoteRef?
As an aside, I was forced to use RemoteRef as a type since previous versions of Julia failed at type inference during a @async remotecall. I started down the path of trying to use pmap with composite types, but ended up with the example above. Any suggestions on how to convert this code for staying with Julia 0.5?

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0 (2016-09-19 18:14 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> typeof(RemoteRef)
Base.#RemoteRef
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.7 (2016-09-18 16:17 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> typeof(RemoteRef)
DataType

The error trace starts off with:

WARNING: An error occurred during inference. Type inference is now partially disabled.
TypeError(func=:subtype, context="", expected=Type, got=Base.#RemoteRef())
rec_backtrace at /home/centos/buildbot/slave/package_tarball64/build/src/stackwalk.c:84
record_backtrace at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:232
jl_throw at /home/centos/buildbot/slave/package_tarball64/build/src/task.c:550
jl_type_error_rt at /home/centos/buildbot/slave/package_tarball64/build/src/builtins.c:119
jl_type_error at /home/centos/buildbot/slave/package_tarball64/build/src/builtins.c:125
jl_f_issubtype at /home/centos/buildbot/slave/package_tarball64/build/src/builtins.c:386
typeseq at ./reflection.jl:120 [inlined]
record_slot_type! at ./inference.jl:2021
...

Details:

julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3920XM CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)

@amitmurthy
Copy link
Contributor Author

The remotecalls now return a Future. Changing RR = Dict{Int, RemoteRef}() to RR = Dict{Int, Future}() should get your code working under 0.5 . Futures behave differently from the previous RemoteRefs in the sense that values once fetched are cached locally.

@dehann
Copy link
Contributor

dehann commented Nov 29, 2016

Thanks @amitmurthy, I't doesn't seem to be only a direct swap of RemoteRef to Future. I'm still getting some strange behaviour which looks to be related to first run compile delay... I'll post here once I have figured out what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants