Discussion: A more feature-complete histogram, where and how? #704

Moelf · 2021-07-30T10:35:00Z

The current Histogram is great, but it falls a bit short in some areas:

We have weights, but in order to track error properly, we need to keep track of sum of weights .^ 2 (in each bin)
This is more critical when our data doesn't fit in memory and we had to push!(hist, val, weight). This is also not supported right now
For 2 to work nicely, we want to make histograms thread-safe. The easy way to do this is to add a SpinLock, but this can be slow in certain workload. Another way is to have a buffer, which is what ROOT does, but I think that's a bit overkill. But in any case push! should be thread-safe.
Overflow bins support. (for 1D only, have 2 extra bins to track overflow and underflow data points)

My question for StatsBase community is: Do we welcome these changes (of course it will not be breaking)? Or this is too "bloated" that we should make a new type of histogram (I would hate this option because having two sets of normal histogram is gonna be annoying)

cc for perspective: @oschulz

The text was updated successfully, but these errors were encountered:

oschulz · 2021-07-30T11:24:38Z

May be related: #650

oschulz · 2021-07-30T11:46:53Z

in order to track error properly, we need to keep track of sum of weights .^ 2

Hm, maybe I misunderstand, but sum of weights .^ 2 is only a rough estimate of that "error" (more precisely, the uncertainty on the assumption that the weight equals the expected value), and is anyhow only appropriate if the number of events is very high, right?

we want to make histograms thread-safe

Yes, I think that should be non-controversial.

Overflow bins support.

I guess we'd want a specialized histogram type for that, not everyone will want overflow bins, and a lot of current code makes assumptions on what's in weights.

Moelf · 2021-07-30T12:07:34Z

Hm, maybe I misunderstand, but sum of weights .^ 2 is only a rough estimate of that "error"

it's not the error itself, but it will be needed if you want to compute errors. say you have 2 entries, with weight 0.3 and 0.7. If you only track count, you will see 1.0, which is the same than if you have 1 entry with unit weight. But sqrt(0.3^2 + 0.7^2) != 1 will be useful for doing error computation down the road and for combining histograms with correct error propagation.

oschulz · 2021-07-30T12:15:53Z

Ah, I see - yes, but the validity of that calculation is dependent on your statistical model. I have a feeling that would be a bit too "biased" for Histogram itself. But maybe we could have thread-safely in general, and then a specialized histogram type with both overflow bins and sum of weights .^ 2? That should probably leave in a separate package, to decouple it from StatsBase release cycles.

Moelf · 2021-07-30T16:18:47Z

I don't think it's that "biased". Basically the only alternative is to record ALL weights of the bin, which I think isn't happening -- or, if it's that simple, the user can easily do that in-memory.

Btw, I think if we are gonna make a separate histogram pkg anyway for the general physics (maybe just HEP?) community, we can just do it without adding thread safety here.

Because adding thread-safe push!(hist, val, weight) is almost useless if we can't track weight other than knowing the final bin count.

nalimilan · 2021-08-31T08:25:06Z

My question for StatsBase community is: Do we welcome these changes (of course it will not be breaking)? Or this is too "bloated" that we should make a new type of histogram (I would hate this option because having two sets of normal histogram is gonna be annoying)

I'm not super familiar with the histograms code, but this kind of improvements sounds seem worth discussing for a possible inclusion in StatsBase. If you two can agree on features that are generic enough (and/or can be enabled only for those who need them) we can have a look at what kind of API would be needed.

Regarding thread safety, we will have to check that the overhead isn't too large when you work from a single thread.

Moelf · 2021-08-31T11:07:05Z

checkout unsafe_push!() and push!() from FHist

nalimilan · 2021-08-31T11:44:51Z

I see the definition, but that doesn't tell me whether the overhead is high or not. :-) I assume that you added unsafe_push! because it's relatively high? I'm hesitant to adopt this design, as I'm not aware of any precedent of using unsafe_* to indicate a non-thread-safe method (neither in StatsBase, nor in Base, nor in other packages). Why should this method be thread-safe when the rest of the API isn't?

Moelf · 2021-08-31T11:48:19Z

julia> const ary = rand(10^5);

julia> @benchmark for a in ary
           push!(h1, a)
       end setup=(h1=Histogram(0:0.03:1)) evals=1
BenchmarkTools.Trial: 1994 samples with 1 evaluation.
 Range (min … max):  2.474 ms …  2.778 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.504 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.507 ms ± 14.866 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▅▇█▄▅▃          ▁                           
  ▂▁▁▂▁▂▂▁▁▁▂▂▂▃▅█████████▅▆▆▆▅█▆███████▇▇▅▄▄▄▄▂▃▂▂▂▂▁▂▂▂▂▂▂ ▄
  2.47 ms        Histogram: frequency by time        2.54 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.


julia> @benchmark for a in ary
           push!(h1, a)
       end setup=(h1=Hist1D(Int; bins=0:0.03:1)) evals=1
BenchmarkTools.Trial: 4180 samples with 1 evaluation.
 Range (min … max):  1.120 ms …  1.389 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.192 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.194 ms ± 27.280 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▁▁▃▂▂▁▃▆▅▆▆▇██▅▂ ▁  ▁                       
  ▂▂▁▁▁▁▂▂▂▃▅▆▇▆▆███████████████████▇▇██▆▅▆▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂ ▄
  1.12 ms        Histogram: frequency by time        1.28 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark for a in ary
           unsafe_push!(h1, a)
       end setup=(h1=Hist1D(Int; bins=0:0.03:1)) evals=1
BenchmarkTools.Trial: 8811 samples with 1 evaluation.
 Range (min … max):  123.033 μs … 590.720 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     565.413 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   565.395 μs ±  20.235 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                             █   
  ▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▆ ▂
  123 μs           Histogram: frequency by time          575 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Moelf · 2021-08-31T11:49:32Z

because it's relatively high?

you assumed wrong, in the sense that it's still faster (at least no slower) than StatsBase's thread unsafe push!. Hope one day the @atomic work in base would help array element increment here.

unsafe_* to indicate a non-thread-safe method.

good point, maybe a re-name before 1.0. But the naming comes from thread-safety

Why should this method be thread-safe when the rest of the API isn't?

you mean this method unsafe_push!() isn't thread-safe and the rest of API is? I guess we could let push!() be thread-unsafe and have a safe_push!()?

nalimilan · 2021-08-31T13:08:17Z

you assumed wrong, in the sense that it's still faster (at least no slower) than StatsBase's thread unsafe push!. Hope one day the @atomic work in base would help array element increment here.

Well that's not what I said. I mean that push! is slower than unsafe_push!. I'm afraid atomics can't help here as that would mean all functions working on histograms would have to work with atomic types, which are generally noticeably slower (otherwise we would always use atomic types).

good point, maybe a re-name before 1.0. But the naming comes from thread-safety

Yes, I get it, but do you have other examples of this terminology? In general it would be interesting to see whether a convention emerges in the ecosystem to distinguish thread-safe from non-thread-safe functions.

you mean this method unsafe_push!() isn't thread-safe and the rest of API is? I guess we could let push!() be thread-unsafe and have a safe_push!()?

No I mean that none of the StatsBase API is thread-safe currently, so why should push! be thread-safe, or why should we provide a thread-safe safe_push! function but not equivalent functions in other areas? Basically any function in the API that mutates its argument could gain a safe_* variant that takes a lock (for example addcounts!, setindex! for weight vectors...). That approach doesn't really scale.

Moelf · 2021-08-31T14:21:08Z

mean all functions working on histograms would have to work with atomic types

I thought @atomic fields has to pair with @atomic at access site too, other wise it's just normal.

have other examples of this terminology

no, do you have other recommendation?

why should we provide a thread-safe safe_push! function but not equivalent functions in other areas

because AFAIK only histogram has the practical demand of "push into from multiple threads because we're reducing over large amount of data and benefits from parallelism".

so why should push! be thread-safe. That approach doesn't really scale.

ok, let's not have push! to be thread-safe.

nalimilan · 2021-08-31T14:44:53Z

I thought @atomic fields has to pair with @atomic at access site too, other wise it's just normal.

No, AFAIK if you don't pair them with @atomic you get an error:

julia> mutable struct A
           @atomic x::Int
       end

julia> x = A(1);

julia> x.x = 1
ERROR: ConcurrencyViolationError("setfield!: atomic field cannot be written non-atomically")
Stacktrace:
 [1] setproperty!(x::A, f::Symbol, v::Int64)
   @ Base ./Base.jl:39
 [2] top-level scope
   @ REPL[10]:1

(AFAIK the storage is different.)

no, do you have other recommendation?

Unfortunately not. :-/

because AFAIK only histogram has the practical demand of "push into from multiple threads because we're reducing over large amount of data and benefits from parallelism".

I'm not sure that's the only case. For example, addcounts! may also quite similarly be called from multiple threads that do independent computations. Also in Base push! on Dict is very similar and it's not thread-safe because that hurts performance.

ok, let's not have push! to be thread-safe.

OK. Are you still interested in other features though? Maybe worth filing separate issues to discuss them?

bkamins · 2021-08-31T15:32:58Z

My thinking on this issue is:

following Julia Base contract for push! it should be made as fast possible (i.e. if thread safety would reduce speed we do not have to make it thread safe)
unsafe prefix in Julia Base means usually that the operation does not perform bounds checking or pointer validity checking; Therefore I would restrict the use of unsafe for such cases;
regarding threaded/distributed computation of histogram, I find a map-reduce pattern natural to use (i.e. compute several histograms in parallel, and then merge them as I assume merging would have a negligible cost in comparison to initial histogram construction that would be parallelized)

Moelf closed this as completed Aug 31, 2021

Moelf mentioned this issue Aug 31, 2021

Naming of push/unsafe_push Moelf/FHist.jl#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: A more feature-complete histogram, where and how? #704

Discussion: A more feature-complete histogram, where and how? #704

Moelf commented Jul 30, 2021 •

edited

Loading

oschulz commented Jul 30, 2021

oschulz commented Jul 30, 2021

Moelf commented Jul 30, 2021 •

edited

Loading

oschulz commented Jul 30, 2021

Moelf commented Jul 30, 2021 •

edited

Loading

nalimilan commented Aug 31, 2021

Moelf commented Aug 31, 2021

nalimilan commented Aug 31, 2021

Moelf commented Aug 31, 2021

Moelf commented Aug 31, 2021 •

edited

Loading

nalimilan commented Aug 31, 2021

Moelf commented Aug 31, 2021 •

edited

Loading

nalimilan commented Aug 31, 2021

bkamins commented Aug 31, 2021

Discussion: A more feature-complete histogram, where and how? #704

Discussion: A more feature-complete histogram, where and how? #704

Comments

Moelf commented Jul 30, 2021 • edited Loading

oschulz commented Jul 30, 2021

oschulz commented Jul 30, 2021

Moelf commented Jul 30, 2021 • edited Loading

oschulz commented Jul 30, 2021

Moelf commented Jul 30, 2021 • edited Loading

nalimilan commented Aug 31, 2021

Moelf commented Aug 31, 2021

nalimilan commented Aug 31, 2021

Moelf commented Aug 31, 2021

Moelf commented Aug 31, 2021 • edited Loading

nalimilan commented Aug 31, 2021

Moelf commented Aug 31, 2021 • edited Loading

nalimilan commented Aug 31, 2021

bkamins commented Aug 31, 2021

Moelf commented Jul 30, 2021 •

edited

Loading

Moelf commented Jul 30, 2021 •

edited

Loading

Moelf commented Jul 30, 2021 •

edited

Loading

Moelf commented Aug 31, 2021 •

edited

Loading

Moelf commented Aug 31, 2021 •

edited

Loading