Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Simpler array hashing #26022

Merged
merged 8 commits into from
Aug 2, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 39 additions & 33 deletions base/abstractarray.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2067,40 +2067,46 @@ function hash(A::AbstractArray, h::UInt)
h = hash(map(last, axes(A)), h)
isempty(A) && return h

# Now work backwards and hash (up to) three distinct key-value pairs
# Working backwards introduces an asymmetry with isequal; in many cases
# arrays that hash equally will be compared via isequal, which iteratively
# works forwards and _short-circuits_. Therefore the elements at the
# beginning of the array are not as valuable to include in the hash computation
# as they are "cheaper" to compare within `isequal`.
# A small number of distinct elements are included in the hashing algorithm
# in order to emphasize distinctions between arrays that are nearly all the
# same constant value but have a handful of differences the O(log(n)) skipping
# algorithm might miss (in particular, this includes sparse matrices).
I = keys(A)
i = last(I)
v1 = A[i]
h = hash(i=>v1, h)
i = let v1=v1; findprev(x->!isequal(x, v1), A, i); end
i === nothing && return h
v2 = A[i]
h = hash(i=>v2, h)
i = let v1=v1, v2=v2; findprev(x->!isequal(x, v1) && !isequal(x, v2), A, i); end
i === nothing && return h
h = hash(i=>A[i], h)

# Now launch into an ~O(log(n)) hashing of values, continuing from the
# last-found distinct index. The Fibonacci series is used here to avoid
# repeating common divisors and potentially only including a single slice
# of an array (as might be the case with powers of two and a matrix with
# an evenly divisible size).
J = vec(I) # Reshape the (potentially cartesian) keys to more efficiently compute the linear skips
j = LinearIndices(I)[i]
fibskip = prevfibskip = oneunit(j)
while j > fibskip
j -= fibskip
h = hash(A[J[j]], h)
# Goal: Hash approximately log(N) entries with a higher density of hashed elements
# weighted towards the end and special consideration for repeated values. Colliding
# hashes will often subsequently be compared by equality -- and equality between arrays
# works elementwise forwards and is short-circuiting. This means that a collision
# between arrays that differ by elements at the beginning is cheaper than one where the
# difference is towards the end. Furthermore, blindly choosing log(N) entries from a
# sparse array will likely only choose the same element repeatedly (zero in this case).

# To achieve this, we work backwards, starting by hashing the last element of the
# array. After hashing each element, we skip the next `fibskip` elements, where
# `fibskip` is pulled from the Fibonacci sequence -- Fibonacci was chosen as a simple
# ~O(log(N)) algorithm that ensures we don't hit a common divisor of a dimension and
# only end up hashing one slice of the array (as might happen with powers of two).
# Finally, we find the next distinct value from the one we just hashed.

# This is a little tricky since skipping an integer number of values inherently works
# with linear indices, but `findprev` uses `keys`. Hoist out the conversion "maps":
ks = keys(A)
key_to_linear = LinearIndices(ks) # Index into this map to compute the linear index
linear_to_key = vec(ks) # And vice-versa

# Start at the last index
keyidx = last(ks)
linidx = key_to_linear[keyidx]
fibskip = prevfibskip = oneunit(linidx)
while true
# Hash the current key-index and its element
elt = A[keyidx]
h = hash(keyidx=>elt, h)

# Skip backwards a Fibonacci number of indices -- this is a linear index operation
linidx = key_to_linear[keyidx]
linidx <= fibskip && break
linidx -= fibskip
keyidx = linear_to_key[linidx]
fibskip, prevfibskip = fibskip + prevfibskip, fibskip

# Find a key index with a value distinct from `elt` -- might be `keyidx` itself
keyidx = findprev(!isequal(elt), A, keyidx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC this finds the first entry after keyidx which differs from the last hashed element. That's a bit different from what I had in mind: I was thinking about finding the first element which differs from the one at keyidx. I'm not sure it's really better, but the idea was that since in a sparse array it's likely that keyidx hits a structural zero, looking for the previous distinct element makes it likely you'll hash a non-zero entry. With your approach, if you hash a zero the first time, you will look for the previous non-zero entry the next time, but if you hit a zero entry the time after that, you'll happily hash it; so you'll end up hashing a zero half of the time, right?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly right. I think it makes the behavior a little more robust — otherwise the hashes of sparse arrays with a nonzero last element will more likely hash the same. I also think it's most likely that diagonals of sparse matrices are filled.

That said, this is now clearly not hashing enough elements. The test failures are from four-element arrays colliding. Gotta slow down the exponential a little bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly right. I think it makes the behavior a little more robust — otherwise the hashes of sparse arrays with a nonzero last element will more likely hash the same. I also think it's most likely that diagonals of sparse matrices are filled.

Sorry, I'm not sure I follow. Could you develop?

That said, this is now clearly not hashing enough elements. The test failures are from four-element arrays colliding. Gotta slow down the exponential a little bit.

Agreed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why use findprev, which requires all this keys vs. linear indices dance, instead of a plain loop? I imagine one reason could be that findprev could have a specialized method for sparse arrays which would skip empty ranges, but currently that's not the case. Is that what you have in mind?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood your comment at first — I thought you were suggesting only unique'ing against the very first element. Now I understand that you mean to access the value after each skip, and then hash the next element that's different from it.

There are two reasons to do the keys vs. linear indices dance: one is findprev, but the other is that I want to hash index-value pairs to add a bit more information about the structure of the array. And of course, we cannot introduce a hashing difference between IndexLinear and IndexCartesian arrays. The fact that we then also allow arrays to opt into optimizations via findprev is a nice bonus, especially since it's going to be a pain to completely re-write your own hash optimization that has exactly the same behavior.

keyidx === nothing && break
end

return h
Expand Down