-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Simpler array hashing #26022
RFC: Simpler array hashing #26022
Conversation
Thanks for this. I certainly appreciate the simplicity of this approach compared with the current one. I think we should keep in mind a third possible approach, though, which is to just drop O(1) hashing for ranges (it's not clear that's very useful in practice) and use a simple solution similar to what was implemented before #16401. As a hybrid solution, we could also continue to hash all elements for multidimensional arrays (as done both before and after #16401), and only change how vectors are hashed, since that's what is needed to hash ranges in O(1) time. Assuming we adopt the general approach from this PR (i.e. do not hash all elements nor all differences between subsequent elements, but only a part of the information), several other quantities could be included in the hash, with the objective of making collisions less likely when modifying a single entry in the middle of the array:
These quantities have the advantage that they can be computed in O(1) time for (integer, at least) ranges and relatively efficiently for sparse matrices. But they require going over the full array, making the computation of hashes O(N) for general arrays while your PR is currently O(1). Overall I think we should identify a few typical use cases to decide which approach is the best one. Without that it difficult to know what's the best tradeoff between computation cost and probability of collision. In general I'm a bit concerned about the fact that changing the value of entries in the middle of the array wouldn't change the hash: ideally collisions shouldn't happen in such trivial cases IMHO. But maybe that's fine in practice and the reduction in hashing cost is worth it. I'll just leave a pointer to JuliaLang/Distributed.jl#48, which AFAICT isn't a very correct use of array hashing, but which is going to become even more problematic if we make collisions much more likely (it could use a custom algorithm if needed). Finally, let me note that we still have the possibility of continuing to hash all elements for multidimensional arrays (as was done before and after #16401), and only change how vectors are hashed, since that's what is needed to hash ranges in O(1) time. Doing some research, I've found only two examples of array hashing in main languages:
|
base/abstractarray.jl
Outdated
# Efficient O(1) method equivalent to the O(N) AbstractArray fallback, | ||
# which works only for ranges with regular step (RangeStepRegular) | ||
function hash_range(r::AbstractRange, h::UInt) | ||
h += hashaa_seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: I forgot to include the hashaa_seed
in the computation
base/abstractarray.jl
Outdated
# But we cannot define those here, due to bootstrapping. They're defined later in iterators.jl. | ||
function hashndistinct(A, I, n, hx) | ||
_hashcheckbounds(A, I) | ||
seen = Set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would probably be better to use a Vector for this since the number of items is small, and it will avoid hashing them multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking of using a specialized 3-vector to avoid allocations entirely, but figured we could decide on the direction and exact behavior first before optimizing it further.
This is a good optimization. These are some hash admixes worth consideration. in some manner, one from one or more of these pairs may help (a) the shape (dims) # reducingly hash each dim size (b) the initial 1st, 2nd, 7th and final 1st (end), 2nd, 7th (end-6) (or missings) (edited to comport with reality, thank you for the note, Milan) |
@JeffreySarnoff (0), (a) and (α) do not matter for |
So the options and their tradeoffs here are:
I'm becoming more convinced that an O(1) hashing method is the way to go here. I'd happily extend the number of elements to help |
I generally agree, the current approach is too brittle and complex. I still a bit sad to move to an approach which makes it so easy to get collisions. Maybe I just need some time to accept this. ;-) Another less fragile possibility which occurred to me would be to use another property of ranges, which is less strict than computing differences between subsequent elements but also less problematic to check: the fact that ranges are monotonically increasing or decreasing. We could continue to hash all elements for vectors which are not sorted and multidimensional arrays (keeping RLE for efficiency with sparse matrices of course), but use the O(1) approach for sorted vectors (for element types which support |
I agree that loosing discrimination is a negative (there are algorithms that thrive on some probable assurance of differents hashing differently, so things need be visited once to be useful with other things of a group). Increasing the span at front and back that are hashed, and speckle-ing some of the ith quartile positions (or a few thereabout) should do a decent of job of discrimination, especially where the length>>some mod a_length_relevant_prime is used to pick the span between successive hash-sampled values (or value pairs). Or for speed, keep a sequence of indices to pluck and let them wrap on the length. |
One option might be to use exponentially-spaced indices, and hash |
... and interweave those indicies, taking them from [1] increasing and from [end] decreasing, to disambiguate small vectors and similar arrays more successfully? |
|
Fails for the |
Taking What if we used a combination of these approaches?
Of course, you would never compute it this way, but it can be done iteratively. This has the slightly strange effect that we compute the hash of every element for |
The proposal in my last post still leaves a bad taste in my mouth. If we're going to do
|
I'm a bit concerned that hashing |
Right — that's my second proposal. In short: Change this PR (which does O(1) hashing of some small number of distinct elements from the front and back) to ignore the forward iteration and augment the backwards iteration with log(N) hashes after it's hashed the small number of distinct elements. |
That's an interesting solution. There are a few things I don't understand:
|
Great questions:
But you're right — it's probably not worth hashing elements that don't get used in the total hash, and we probably don't want to use a |
I think we should proceed with any reasonable solution so that the API is stabilized and |
Unclear that log(N) is the "hatrack" .. the same would be true of sqrt(N) samples. Milan's push to get this one done so other contingencies can move along makes sense. It all sounds non-blocking and pathway providing. Maybe it is worthwhile to keep a short vector of linearized indices to nonzero values to provide informed indirection. 32 or 64 Int64s could do that for a sparse array of 10 trillion elements. If the data structure knows the index of its first and its final nonzero entries, two of those indices could be rewritten when that internal state is rewritten. Two can get two more (the next innermost) and the structure may find that useful too, giving the hash at worst 4 nonzero values and their linearized indexings -- 8 nonzero Ints to hash or better |
@mbauman Do you plan to finish this, or do you need help? |
Yes, let's resurrect this. I have a WIP branch locally where I started implementing #26022 (comment) a while ago, but I stalled on it (and then forgot about it) because I didn't like how hard it was to either implement or describe. I'll take a second look this afternoon, and see if I can find some simplifications. |
Here's another thought: hash arrays in such a manner that if an array approximates an range, it hashes like it and just take the hit of those two objects being in the same bucket. After all, what are the chances that you're hashing both arrays and ranges together and have an array that is almost equal to a range but not quite exactly equal to it? That's the only bad case with such an approach. |
won't that fail because any bucket will have an edge that will have the same precision problems that your were trying to avoid? |
base/abstractarray.jl
Outdated
i = last(I) | ||
v1 = A[i] | ||
h = hash(i=>v1, h) | ||
i = let v1=v1; findprev(x->!isequal(x, v1), A, i); end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly should use Ref{eltype(A)}(v1)
here, or just open-code the loop (to make heterogeneous arrays fast). the same would apply to h = hash(i=>v1, h)
though (vs. writing it as h = hash(i, hash(v1, h))
), so I'm not sure of the importance of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a simple type-unstable case to the matrix above — this is already smoking the status quo (largely by simply touching fewer unstable elements). I get your point about constructing the Pair
s, but I don't fully understand the Ref
in the closure. In any case, this should be a relatively small constant cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment explaining why let
is used.
This is the complexity for inserting the whole set of hash-equal items into the dictionary. For each item, it must compare |
Thanks! The code and the benchmarks look good to me. Of course it's somewhat slower for ranges, and much slower when there are collisions, as expected. But the main question which is hard to address is whether collisions are frequent in practice. If that'd really the case (which isn't very likely), an ideal approach would maybe to have BTW, if this solution is retained, we'll need to do something about JuliaLang/Distributed.jl#48. Maybe just use the old algorithm there, to limit the risk that the array wouldn't be updated because of collisions (but of course it's not completely correct either). |
base/abstractarray.jl
Outdated
i = last(I) | ||
v1 = A[i] | ||
h = hash(i=>v1, h) | ||
i = let v1=v1; findprev(x->!isequal(x, v1), A, i); end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment explaining why let
is used.
base/abstractarray.jl
Outdated
fibskip = prevfibskip = oneunit(j) | ||
while j > fibskip | ||
j -= fibskip | ||
h = hash(A[J[j]], h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inbounds
would be OK here right?
Also, have you considered hashing the first entry which differs from the previous/next one starting from J[j]
? That would improve detection of differences between sparse matrices. I'm concerned that hashing the last distinct values isn't enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this could be @inbounds
but I wouldn't expect it to make a difference in performance — profiling hashing a Float64 array shows checkbounds
accounts for 8 samples out of ~5000. I'm inclined to leave it.
Also, have you considered hashing the first entry which differs from the previous/next one
Yes, something along those lines is compelling, but I had avoided it due to the challenge of efficiently skipping linear offsets across cartesian indices. Now that I have coded that computation here, though, perhaps it wouldn't be so bad. I'll see if it can be done reasonably.
…ments Goal: Hash approximately log(N) entries with a higher density of hashed elements weighted towards the end and special consideration for repeated values. Colliding hashes will often subsequently be compared by equality -- and equality between arrays works elementwise forwards and is short-circuiting. This means that a collision between arrays that differ by elements at the beginning is cheaper than one where the difference is towards the end. Furthermore, blindly choosing log(N) entries from a sparse array will likely only choose the same element repeatedly (zero in this case). To achieve this, we work backwards, starting by hashing the last element of the array. After hashing each element, we skip the next `fibskip` elements, where `fibskip` is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension and only end up hashing one slice of the array (as might happen with powers of two). Finally, we find the next distinct value from the one we just hashed.
50a1c70
to
f16326f
Compare
Thanks @nalimilan — I like the design you proposed in #26022 (comment) much better. |
base/abstractarray.jl
Outdated
fibskip, prevfibskip = fibskip + prevfibskip, fibskip | ||
|
||
# Find a key index with a value distinct from `elt` -- might be `keyidx` itself | ||
keyidx = findprev(!isequal(elt), A, keyidx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC this finds the first entry after keyidx
which differs from the last hashed element. That's a bit different from what I had in mind: I was thinking about finding the first element which differs from the one at keyidx
. I'm not sure it's really better, but the idea was that since in a sparse array it's likely that keyidx
hits a structural zero, looking for the previous distinct element makes it likely you'll hash a non-zero entry. With your approach, if you hash a zero the first time, you will look for the previous non-zero entry the next time, but if you hit a zero entry the time after that, you'll happily hash it; so you'll end up hashing a zero half of the time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly right. I think it makes the behavior a little more robust — otherwise the hashes of sparse arrays with a nonzero last element will more likely hash the same. I also think it's most likely that diagonals of sparse matrices are filled.
That said, this is now clearly not hashing enough elements. The test failures are from four-element arrays colliding. Gotta slow down the exponential a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly right. I think it makes the behavior a little more robust — otherwise the hashes of sparse arrays with a nonzero last element will more likely hash the same. I also think it's most likely that diagonals of sparse matrices are filled.
Sorry, I'm not sure I follow. Could you develop?
That said, this is now clearly not hashing enough elements. The test failures are from four-element arrays colliding. Gotta slow down the exponential a little bit.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why use findprev
, which requires all this keys vs. linear indices dance, instead of a plain loop? I imagine one reason could be that findprev
could have a specialized method for sparse arrays which would skip empty ranges, but currently that's not the case. Is that what you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misunderstood your comment at first — I thought you were suggesting only unique'ing against the very first element. Now I understand that you mean to access the value after each skip, and then hash the next element that's different from it.
There are two reasons to do the keys vs. linear indices dance: one is findprev
, but the other is that I want to hash index-value pairs to add a bit more information about the structure of the array. And of course, we cannot introduce a hashing difference between IndexLinear and IndexCartesian arrays. The fact that we then also allow arrays to opt into optimizations via findprev
is a nice bonus, especially since it's going to be a pain to completely re-write your own hash
optimization that has exactly the same behavior.
Ok, I've slowed down that exponential — by quite a bit. I now only grab the next number in the Fibonacci sequence after every 4096 hashes. I chose this number by considering:
This is the balance point that seems reasonable to me — a power of two just because it's slightly cheaper to perform the mod. |
Makes sense. I wonder whether we couldn't find a function which increases slowly for common sizes, but quite faster when sizes get really large. Or maybe we could just apply an arbitrary threshold beyond which we stop hashing entries? A EDIT: the goal being to ensure even the largest possible range doesn't take one second to hash. |
base/abstractarray.jl
Outdated
# entirely-distinct 8000-element array will have ~75% of its elements hashed, | ||
# with every other element hashed in the first half of the array. At the same | ||
# time, hashing a `typemax(Int64)`-length Float64 range takes about a second. | ||
if mod(n, 4096) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since performance is important here, why not replacing the mod
with explicit bit twiddling? In this case I would suggest replacing this line with:
if (Int==Int64 && n<<52==0) || n<<20==0
That's because 4096 = 2^12 and 52 = 64 - 12 and 20 = 32 - 12. Also the Int==Int64
part will get inlined so this will be effectively replaced by either n<<52==0
or n<<20==0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use rem(n, 4096) == 0
the compiler takes care of the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always forget about the rem
and %
Bump. Could we add the test cases from #27865, #26011, and #16401 (comment) and merge this? |
Fixes #26034
Ok, test cases pushed. There are, of course, many other schemes that we could dream up here… but triage was in support of merging this as it stands. To help avoid JuliaLang/Distributed.jl#48, I'm putting together another PR that will basically make Distributed use its own hashing system — the constraints are a little different there since we don't need to worry about hash equality across different types. I don't think the two need to be merged at the same time, but I should have that ready to go by tomorrow. |
Goal: Hash approximately log(N) entries with a higher density of hashed elements weighted towards the end and special consideration for repeated values. Colliding hashes will often subsequently be compared by equality -- and equality between arrays works elementwise forwards and is short-circuiting. This means that a collision between arrays that differ by elements at the beginning is cheaper than one where the difference is towards the end. Furthermore, blindly choosing log(N) entries from a sparse array will likely only choose the same element repeatedly (zero in this case). To achieve this, we work backwards, starting by hashing the last element of the array. After hashing each element, we skip the next `fibskip` elements, where `fibskip` is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension and only end up hashing one slice of the array (as might happen with powers of two). Finally, we find the next distinct value from the one we just hashed. Fixes #27865 and fixes #26011. Fixes #26034
Implementing `widen` isn't a requirement any more, since JuliaLang#26022.
This is a straw-man implementation of a simpler array hashing scheme. It's very basic and has room for further optimizations, but it's intention is to provide a starting point as an alternative to #25822. In short: This hashes the size of the array and then the first three distinct elements and their linear indices and then the last three distinct elements and their linear distance from the end of the array.
The exact scheme here is open to bikeshedding -- we could use more or less distinct elements. I use "distinct elements" as a way of ensuring that all relatively empty sparse arrays don't hash similarly, and I hash elements at both the beginning and end of the array because they're the simplest to get to and final elements will be the "most expensive" to discover that they differ if we have to fall back to an equality check. The most complicated part is keeping track of what you hashed in order to prevent running through the entire array twice (once forwards and once backwards).