-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: A more feature-complete histogram, where and how? #704
Comments
May be related: #650 |
Hm, maybe I misunderstand, but sum of weights .^ 2 is only a rough estimate of that "error" (more precisely, the uncertainty on the assumption that the weight equals the expected value), and is anyhow only appropriate if the number of events is very high, right?
Yes, I think that should be non-controversial.
I guess we'd want a specialized histogram type for that, not everyone will want overflow bins, and a lot of current code makes assumptions on what's in weights. |
it's not the error itself, but it will be needed if you want to compute errors. say you have 2 entries, with weight 0.3 and 0.7. If you only track count, you will see |
Ah, I see - yes, but the validity of that calculation is dependent on your statistical model. I have a feeling that would be a bit too "biased" for |
I don't think it's that "biased". Basically the only alternative is to record ALL weights of the bin, which I think isn't happening -- or, if it's that simple, the user can easily do that in-memory. Btw, I think if we are gonna make a separate histogram pkg anyway for the general physics (maybe just HEP?) community, we can just do it without adding thread safety here. Because adding thread-safe |
I'm not super familiar with the histograms code, but this kind of improvements sounds seem worth discussing for a possible inclusion in StatsBase. If you two can agree on features that are generic enough (and/or can be enabled only for those who need them) we can have a look at what kind of API would be needed. Regarding thread safety, we will have to check that the overhead isn't too large when you work from a single thread. |
checkout unsafe_push!() and push!() from FHist |
I see the definition, but that doesn't tell me whether the overhead is high or not. :-) I assume that you added |
julia> const ary = rand(10^5);
julia> @benchmark for a in ary
push!(h1, a)
end setup=(h1=Histogram(0:0.03:1)) evals=1
BenchmarkTools.Trial: 1994 samples with 1 evaluation.
Range (min … max): 2.474 ms … 2.778 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.504 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.507 ms ± 14.866 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▇█▄▅▃ ▁
▂▁▁▂▁▂▂▁▁▁▂▂▂▃▅█████████▅▆▆▆▅█▆███████▇▇▅▄▄▄▄▂▃▂▂▂▂▁▂▂▂▂▂▂ ▄
2.47 ms Histogram: frequency by time 2.54 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark for a in ary
push!(h1, a)
end setup=(h1=Hist1D(Int; bins=0:0.03:1)) evals=1
BenchmarkTools.Trial: 4180 samples with 1 evaluation.
Range (min … max): 1.120 ms … 1.389 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.192 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.194 ms ± 27.280 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▃▂▂▁▃▆▅▆▆▇██▅▂ ▁ ▁
▂▂▁▁▁▁▂▂▂▃▅▆▇▆▆███████████████████▇▇██▆▅▆▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂ ▄
1.12 ms Histogram: frequency by time 1.28 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark for a in ary
unsafe_push!(h1, a)
end setup=(h1=Hist1D(Int; bins=0:0.03:1)) evals=1
BenchmarkTools.Trial: 8811 samples with 1 evaluation.
Range (min … max): 123.033 μs … 590.720 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 565.413 μs ┊ GC (median): 0.00%
Time (mean ± σ): 565.395 μs ± 20.235 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▆ ▂
123 μs Histogram: frequency by time 575 μs <
Memory estimate: 0 bytes, allocs estimate: 0. |
you assumed wrong, in the sense that it's still faster (at least no slower) than StatsBase's thread unsafe
good point, maybe a re-name before 1.0. But the naming comes from thread-safety
you mean this method |
Well that's not what I said. I mean that
Yes, I get it, but do you have other examples of this terminology? In general it would be interesting to see whether a convention emerges in the ecosystem to distinguish thread-safe from non-thread-safe functions.
No I mean that none of the StatsBase API is thread-safe currently, so why should |
I thought
no, do you have other recommendation?
because AFAIK only histogram has the practical demand of "push into from multiple threads because we're reducing over large amount of data and benefits from parallelism".
ok, let's not have |
No, AFAIK if you don't pair them with julia> mutable struct A
@atomic x::Int
end
julia> x = A(1);
julia> x.x = 1
ERROR: ConcurrencyViolationError("setfield!: atomic field cannot be written non-atomically")
Stacktrace:
[1] setproperty!(x::A, f::Symbol, v::Int64)
@ Base ./Base.jl:39
[2] top-level scope
@ REPL[10]:1 (AFAIK the storage is different.)
Unfortunately not. :-/
I'm not sure that's the only case. For example,
OK. Are you still interested in other features though? Maybe worth filing separate issues to discuss them? |
My thinking on this issue is:
|
The current
Histogram
is great, but it falls a bit short in some areas:weights
, but in order to track error properly, we need to keep track of sum ofweights .^ 2
(in each bin)push!(hist, val, weight)
. This is also not supported right nowSpinLock
, but this can be slow in certain workload. Another way is to have a buffer, which is whatROOT
does, but I think that's a bit overkill. But in any casepush!
should be thread-safe.My question for StatsBase community is: Do we welcome these changes (of course it will not be breaking)? Or this is too "bloated" that we should make a new type of histogram (I would hate this option because having two sets of normal histogram is gonna be annoying)
cc for perspective: @oschulz
The text was updated successfully, but these errors were encountered: