Atomic operations on array elements #32455

chengchingwen · 2019-06-30T07:05:39Z

The atomic methods for Atomic call the pointer_from_objref and call the function on with llvmcall. If we expose the pointer api, we can do atomic operation on array more easily. Are there any reason to dispatch the atomic operation only on Atomic type?

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2019-12-16T18:02:39Z

Ah, I see, this differs from #29943. That is asking for atomic operations on pointers, while this is about doing atomic operations via pointers to other structures like arrays. I will try to clarify the title.

oschulz · 2020-06-13T12:24:44Z

This recently came up here: https://discourse.julialang.org/t/parallelizing-a-histogram/41260/9

Would be great to have atomic_add!(A::Array, x, idxs), etc.

tkf · 2020-06-15T02:08:30Z

Quoting my comment in the discourse thread:

OK, so I uploaded the code I used for trying out atomic increments on array elements: https://github.com/tkf/ParallelIncrements.jl

(Comment added: I just realized that the name of the package is completely misleading since it was about single thread performance. Initially, I was thinking about trying it with multiple threads.)

This package contains a function ref = atomicref(::Array, indices...) which can be used to for atomic operations like Threads.atomic_add!(ref, x).

As you can see in the README, the overhead is 10x in my laptop. It can go up to 100x and possibly more depending on the kind of optimization the compiler can do. It might be possible to get a lower overhead using other memory orderings but I haven't tried it yet.

I'm not a big fan in this direction but I'm interested in whatever beneficial for HPC in Julia. So, I'm just sharing it just in case some people want to try it out (and maybe find some performance bugs I made in the sample code or something).

oschulz · 2020-06-15T02:25:38Z

Came across this - that's on a GPU, though: https://devblogs.nvidia.com/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

oschulz · 2020-06-15T02:38:37Z

A slowdown of 10x may actually be quite acceptable in some cases, compared to the alternative.

Obviously, when you deal with things like small histograms (histograms are just a nice example use case for this), it's most efficient for every thread to histogram separately in 1st level cache, and then merge after. But if the histogram is large (multi-dimensional, many bins), atomics with a 10x overhead (compared to non-atomic writes may still be faster) than maintaining one histogram per thread, especially if the number of threads is large. Just a hunch, of course. I guess it'll come down to memory bandwidths, latencies and caching behavior in the individual use case ... hard to predict.

@tfk, to you think the atomicref comes with some additional overhead, compared to having an atomic_add that directly operates on array and index, or will the compiler optimize that away?

tkf · 2020-06-15T02:53:55Z

I agree there might be some situations that this can be beneficial. I uploaded the code because I thought it'd be great if people can explore the use of it.

But, let me also note that my benchmark is about single-thread performance. If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs. So, I'd guess 10x is more like a lower bound of the overhead.

atomicref comes with some additional overhead, compared to having an atomic_add that directly operates on array and index

I didn't see anything suspicious about it when I quickly look at LLVM IR.

(BTW, you might want to use @tkf to ping me, to avoid pinging another user.)

oschulz · 2020-06-15T02:58:22Z

If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs.

Yes, I've been thinking about that - however, if the access pattern is scattered access to a fairly large amount of memory, the threads would, for the most part, not share cache lines, right?

tkf · 2020-06-15T03:02:37Z

Ah, yes, you might be right.

vtjnash · 2020-06-15T03:07:01Z

For a histogram-type operation, I'd expect a (histogram-shaped distribution) of cache conflicts, resulting in a large slowdown for each additional thread you add—starting at 10x slower just for going from single-threaded to 1 thread-but-threadsafe-with atomics. Unlike @tkf, I have no numbers though, and I'm just somewhat parroting his comment.

This is actually relevant, because I'm currently thinking about not (yet) having atomics for individual array elements (just too likely to be slow and/or incorrect), so that you should perhaps instead just use a lock over the whole array. https://docs.google.com/document/d/e/2PACX-1vT7Ibthj9WyM8s5bcQbiKsVK6MtvzqmnPFMy-bcjZLlbqv55a0_sTJ99AkbvPIZk3t7MbhZ57NzaIzC/pub

vtjnash · 2020-06-15T03:09:20Z

(note: the idealized cost of lock is on the order of a couple atomic operations, since that's about all it takes to implement one: it's important to mentally realize that the cost of a lock and an atomic are therefore somewhat nearly the same.)

oschulz · 2020-06-15T03:15:25Z

Well, you kinda dash my hopes for efficient scattered atomic adds (maybe I can try to do some parallel benchmarking to get us numbers, though). :-)

But I'm very excited about the ideas in your manifest!

tkf · 2020-06-15T03:37:33Z

I wonder what's the best strategy for implementing histogram-like function with a very large output array using threads. Maybe (1) create a batch of local index list (for the entries to be incremented) in each thread grouped by sub-regions of the output array and then (2) send it to the "writer" tasks that write it to the sub-region of the output array? This way, I guess we can distribute the contention a bit by dividing it by the number of sub-regions (the contention in this strategy would be coming from the queue for communicating with each writer task).

oschulz · 2020-06-15T11:49:20Z

EDIT: I messed something up here, see below for corrections. Using a global lock is not fast, for this use case.

@vtjnash, you were right, of course: Using a global lock is indeed fastest. Here's a benchmark I ran
(using Julia v1.5.0-beta1.0):

https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790

I'm filling a histogram with 201 x 201 x 201 bins with 10^6 (randn(rng), randn(rng), randn(rng)) variates, using a separate Random123.Philox4x RNG for each thread.

I defined

Base.@propagate_inbounds function atomic_addindex!(A::Array{Int64}, v::U, I...) where {U}
    T = Int64
    i = LinearIndices(size(A))[I...]
    v_conv = convert(T, v)
    ptr = pointer(A, i)
    Base.Threads.llvmcall("%ptr = inttoptr i64 %0 to i64*\n%rv = atomicrmw add i64* %ptr, i64 %1 acq_rel\nret i64 %rv\n", T, Tuple{Ptr{T},T}, ptr, v_conv)::T
    A
end

I hope that's correct.

With JULIA_NUM_THREADS=64 (Epyc-2 CPU, 64 physical cores, single socket), I get:

No multi-threading: 205.771 ms
Using atomic_addindex!: 519.649 ms
Using a global SpinLock(): 4.784 ms

So with the global lock, I get a 43x speedup on 64 threads - I wouldn't have expected it to scale that well.

The random number generation itself is about 20 ms using a thread and 630 μs using all 64 threads, so it's only a relatively small fraction of the total time.

When I run with JULIA_NUM_THREADS=1:

No multi-threading: 203.243 ms ms
Using atomic_addindex!: 224.736 ms
Using a global SpinLock(): 195.301 ms

So if there's a bit of work to do (computing the random numbers and hist-bin index), the overhead of atomic_addindex! has little impact, single threaded. But with many threads, even on a large array, atomic access seems to cause way too much cache invalidation/contention.

Yay for global locks! :-)

EDIT: I messed something up here, see below for corrections. Using a global lock is not fast, for this use case.

biochem-fan · 2021-04-07T22:11:44Z

@oschulz I looked at your code on Gist.

Your fill_randn_mt_atomic! actually uses a global SpinLock, while fill_randn_mt_lock! uses atomic_push!. Is this intentional?

Where is your atomic_addindex! used? Shouldn't your atomic_push! call it?

biochem-fan · 2021-04-08T18:02:09Z

@oschulz I confirmed that without using atomic_addindex!, the result is not correct.

I added sanity checks https://gist.github.com/biochem-fan/19072418e90dd783227a347efe892860/revisions and repaired function names.

The results with 24 threads on Xeon E5-2643 v2 are:

No multi-threading: 255.538 ms
Atomic add: 38.816 ms
Global spin lock: 578.194 ms

So, atomic add is the fastest.

oschulz · 2021-04-08T21:20:34Z

Hi @biochem-fan , thanks for catching this. No idea what I did back there, looks like I definitely mixed the function names up and my atomic_addindex! isn't even used in my old version my Gist. Maybe I accidentally used and uploaded a debugging state of the code or so - must have been late ... sorry!

In any case, I just updated my Gist (https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790) to use atomic_addindex! (similar to your change) and to use locking only during the actual memory manipulation (for locked_push!).

Here's some benchmarks on a 128-core machine (2x AMD EPYC 7662 64-core). The locking approach clearly doesn't scale, while atomic_addindex! does scale, though not linearly.

(We also have to take reduced CPU clock boost under heavy multi-threading into account, CPU ran at 3.3 GHz during fill_randn_mt_atomic! with 1 thread, but only at 2.4 GHz average with 64 threads).

Using 1 threads (mean time):

gen_randn_st: 12.353 ms
gen_randn_mt: 12.271 ms
fill_randn_st!: 122.237 ms
fill_randn_mt_atomic!: 124.936 ms
fill_randn_mt_lock!: 147.139 ms

Using 2 threads (mean time):

gen_randn_st: 12.383 ms
gen_randn_mt: 6.404 ms
fill_randn_st!: 123.549 ms
fill_randn_mt_atomic!: 67.433 ms
fill_randn_mt_lock!: 145.086 ms

Using 4 threads (mean time):

gen_randn_st: 12.395 ms
gen_randn_mt: 4.330 ms
fill_randn_st!: 122.585 ms
fill_randn_mt_atomic!: 35.574 ms
fill_randn_mt_lock!: 161.797 ms

Using 8 threads (mean time):

gen_randn_st: 12.326 ms
gen_randn_mt: 3.102 ms
fill_randn_st!: 122.782 ms
fill_randn_mt_atomic!: 23.418 ms
fill_randn_mt_lock!: 300.381 ms

Using 16 threads (mean time):

gen_randn_st: 12.154 ms
gen_randn_mt: 1.648 ms
fill_randn_st!: 123.109 ms
fill_randn_mt_atomic!: 15.220 ms
fill_randn_mt_lock!: 424.703 ms

Using 32 threads (mean time):

gen_randn_st: 12.397 ms
gen_randn_mt: 911.524 μs
fill_randn_st!: 122.468 ms
fill_randn_mt_atomic!: 8.906 ms
fill_randn_mt_lock!: 455.146 ms

Using 64 threads (mean time):

gen_randn_st: 12.153 ms
gen_randn_mt: 647.767 μs
fill_randn_st!: 121.362 ms
fill_randn_mt_atomic!: 5.512 ms
fill_randn_mt_lock!: 503.848 ms

Using 128 threads (mean time):

gen_randn_st: 12.120 ms
gen_randn_mt: 639.225 μs
fill_randn_st!: 149.159 ms
fill_randn_mt_atomic!: 5.153 ms
fill_randn_mt_lock!: 844.348 ms

I don't understand the results for 128 threads, though - even though things are spread over two NUMA domains now, it should be faster than with 64 threads, at least the pure RNG generation should scale well. I did use thread pinning, so each thread should have a separate physical core (no hyper-threading). Can Julia somehow only schedule 64 threads even if started with 128?

biochem-fan · 2021-04-08T21:36:52Z

@oschulz Thank you very much for confirmation and more test results.

Indeed your result for 128 threads is interesting. Unfortunately I don't have access to machines with so many cores and I cannot try it.

This new results demonstrate that atomic_addindex! is useful at least in some cases. I am new to Julia developer community. How should we move forward? PR #37683 seems stale.

Until atomic_addindex! gets incorporated to Julia, may I use your implementation in my projects?

oschulz · 2021-04-08T21:52:26Z

Until atomic_addindex! gets incorporated to Julia, may I use your implementation in my projects?

Of course!

I agree that having something like atomic_addindex! in Julia would be beneficial. I think the Julia dev team has some long-term plans regarding parallel access and memory mutability in general, but I don't know any details. But having something simple like atomic_addindex! IMHO couldn't hurt.

Alternatively, this could go into a little package. I'm not sure how fragile this would be, with the LLVM intrinsics, regarding newer Julia (and therefore newer LLVM versions). The advantage would be that other packages could use it without depending on a newer Julia version.

oschulz · 2021-04-08T21:56:15Z

Maybe a little package (e.g. "AtomicArrays.jl") would be best for now.

jpsamaroo · 2021-06-02T23:20:36Z

Does #37847 actually add atomic array accesses? I don't see any tests for this functionality, although I could just be overlooking them.

vtjnash · 2021-06-03T00:34:41Z

No, I eventually noticed that there aren't often algorithms that would benefit from it. Though internally, we use structures that benefit from it: the access pattern used by IdDict (and others), internally already do benefit from something similar. However, I'm already proposing making that particular usage (concurrent-reader-with-writer, single-writer-with-lock) access pattern safe by default, such that it wouldn't require anything different. But since that can now be simulated with an atomic-release fence, it isn't strictly necessary to add either.

tkf · 2021-06-18T03:04:26Z

I don't think this issue is fully addressed. Let me reopen this issue since I believe it's reasonable to keep tracking it (but, of course, I'm open to be convinced otherwise).

I do agree that supporting atomic operations on array element can be misleading for people who are not familiar with parallel programming. The best approach is to create per-task mutable states and merge them wisely. I'm very serious about this to the degree that I build a GitHub org JuliaFolds and a series of posts (https://juliafolds.github.io/data-parallelism/) around this principle.

Having said that, I believe that atomic operations on arrays are an important ingredient for trickier and interesting parallel and concurrent programs. I think addressing this issue would be important for Julia to keep staying relevant in high-performance technical computing.

there aren't often algorithms that would benefit from it

I don't believe this is the case. There are plenty of practical nonblocking algorithms that require, e.g., CAS on an entry of an array. For example:

Linked Concurrent Ring Queue (LCRQ): Morrison, Adam, and Yehuda Afek. 2013. “Fast Concurrent Queues for X86 Processors.” https://doi.org/10.1145/2442516.2442527.
Open addressing hash table: e.g., Maier, Tobias, Peter Sanders, and Roman Dementiev. 2019. “Concurrent Hash Tables: Fast and General(?)!” https://doi.org/10.1145/3309206.
Ctrie: Prokopec, Aleksandar. 2018. “Cache-Tries: Concurrent Lock-Free Hash Tries with Constant-Time Operations.” https://doi.org/10.1145/3178487.3178498. See also Ctrie - Wikipedia

These data structures would be relevant in high-performance programming access pattern is sparse (= very little contention) and/or resulting object is relatively large compared to available memory.

An extreme example is GPU where there are just so many threads and so moderate sized object (in CPU standard) cannot be allocated per-thread basis. This necessitates the use of atomics when computing, e.g., histogram. Indeed, GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell | NVIDIA Developer Blog does not even have "non-atomic" example and explains why adding atomics to a more level of memory was beneficial.

Edit: I forgot to mention that when it comes to asynchronous updates on a large data structure, Chaotic Relaxation (Chazan and Miranker, 1969) is also an interesting example. There are also relatively recent examples of similar concepts in stochastic gradient descent (Recht et al., 2011) and parallel graph analytics (Lenharth, Nguyen, and Pingali (2016)).

tkf · 2021-11-13T04:04:07Z

FWIW, just as a stopgap measure until we get this in julia proper, there's https://github.com/JuliaConcurrent/UnsafeAtomics.jl. There's also a wrapper https://github.com/JuliaConcurrent/AtomicArrays.jl but it's not yet registered.

JeffBezanson mentioned this issue Jul 2, 2019

Add missing atomic operations on struct and pointer types. #24662

Closed

JeffBezanson added the multithreading Base.Threads and related functionality label Jul 2, 2019

dillondaudert mentioned this issue Nov 16, 2019

RFC: Add Atomic{Ptr} support and tests #33869

Closed

JeffBezanson changed the title ~~Expose the atomic methods for pointer?~~ atomic operations on array elements and object fields Dec 16, 2019

chethega mentioned this issue Sep 21, 2020

Atomic operations on Pointers #37683

Closed

vtjnash mentioned this issue Jun 1, 2021

Add some support for atomic operations on mutable fields and Ptrs #37847

Merged

vtjnash closed this as completed in #37847 Jun 2, 2021

tkf reopened this Jun 18, 2021

tkf mentioned this issue Nov 13, 2021

Mark parallel algorithms as experimental JuliaGraphs/Graphs.jl#10

Closed

tkf changed the title ~~atomic operations on array elements and object fields~~ Atomic operations on array elements Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Atomic operations on array elements #32455

Atomic operations on array elements #32455

chengchingwen commented Jun 30, 2019

JeffBezanson commented Dec 16, 2019

oschulz commented Jun 13, 2020

tkf commented Jun 15, 2020 •

edited

Loading

oschulz commented Jun 15, 2020

oschulz commented Jun 15, 2020

tkf commented Jun 15, 2020

oschulz commented Jun 15, 2020

tkf commented Jun 15, 2020

vtjnash commented Jun 15, 2020

vtjnash commented Jun 15, 2020

oschulz commented Jun 15, 2020

tkf commented Jun 15, 2020

oschulz commented Jun 15, 2020 •

edited

Loading

biochem-fan commented Apr 7, 2021

biochem-fan commented Apr 8, 2021

oschulz commented Apr 8, 2021 •

edited

Loading

biochem-fan commented Apr 8, 2021

oschulz commented Apr 8, 2021

oschulz commented Apr 8, 2021

jpsamaroo commented Jun 2, 2021

vtjnash commented Jun 3, 2021 •

edited

Loading

tkf commented Jun 18, 2021 •

edited

Loading

tkf commented Nov 13, 2021

Atomic operations on array elements #32455

Atomic operations on array elements #32455

Comments

chengchingwen commented Jun 30, 2019

JeffBezanson commented Dec 16, 2019

oschulz commented Jun 13, 2020

tkf commented Jun 15, 2020 • edited Loading

oschulz commented Jun 15, 2020

oschulz commented Jun 15, 2020

tkf commented Jun 15, 2020

oschulz commented Jun 15, 2020

tkf commented Jun 15, 2020

vtjnash commented Jun 15, 2020

vtjnash commented Jun 15, 2020

oschulz commented Jun 15, 2020

tkf commented Jun 15, 2020

oschulz commented Jun 15, 2020 • edited Loading

biochem-fan commented Apr 7, 2021

biochem-fan commented Apr 8, 2021

oschulz commented Apr 8, 2021 • edited Loading

biochem-fan commented Apr 8, 2021

oschulz commented Apr 8, 2021

oschulz commented Apr 8, 2021

jpsamaroo commented Jun 2, 2021

vtjnash commented Jun 3, 2021 • edited Loading

tkf commented Jun 18, 2021 • edited Loading

tkf commented Nov 13, 2021

tkf commented Jun 15, 2020 •

edited

Loading

oschulz commented Jun 15, 2020 •

edited

Loading

oschulz commented Apr 8, 2021 •

edited

Loading

vtjnash commented Jun 3, 2021 •

edited

Loading

tkf commented Jun 18, 2021 •

edited

Loading