Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN when hashed should have a random hash value #4510

Closed
dgryski opened this issue Jan 16, 2013 · 12 comments
Closed

NaN when hashed should have a random hash value #4510

dgryski opened this issue Jan 16, 2013 · 12 comments
Labels
E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Milestone

Comments

@dgryski
Copy link
Contributor

dgryski commented Jan 16, 2013

20:55:17 <dgryski> I saw the discussion about NaN's in the weekly meeting minutes.  Did anybody bring up Go's behaviour?  NaN's get a random hash value: http://research.swtch.com/randhash
20:56:07  <bstrie> dgryski: interesting
20:59:55  <bstrie> dgryski: would you mind filing an issue to add this to the table(s) in the stdlib? I don't know if if solves the general issue that was being  discussed, but it sounds neat regardless

http://research.swtch.com/randhash for more discussion.

@graydon
Copy link
Contributor

graydon commented Jan 24, 2013

I believe this is exactly addressing the problem that came up. Thanks for pointing it out! I'm happy to follow Russ' lead here.

@nikomatsakis
Copy link
Contributor

I... am not sure about this! At least let's discuss it. I am trying to remember what precisely we concluded when discussing this after a recent meeting. @pcwalton ?

@nikomatsakis
Copy link
Contributor

I am going to close this, since I believe the problem is adequately and better addressed by #5283.

@graydon graydon reopened this Mar 11, 2013
@graydon
Copy link
Contributor

graydon commented Mar 11, 2013

No float-keyed hashtables?

@thestinger
Copy link
Contributor

@graydon: I think there are ways to work around it, like implementing TotalOrd for floats but failing (or using a condition?) if a NaN shows up.

@graydon
Copy link
Contributor

graydon commented Mar 11, 2013

The underlying issue isn't the equality, it's the hashing itself. It's worth reading the linked article in detail if you haven't yet -- the behavior it describes is intended to maintain the operational invariants of a map.

@thestinger
Copy link
Contributor

@graydon: The hashing isn't the direct cause of the issue though, NaN != NaN so every one inserted into the hash table will be a new entry. Randomizing the hash maintains the O(1) expected search time, but either preventing the insertion of NaN or not using the IEEE754 equality rules (and saying NaN == NaN) would also solve it (and you wouldn't end up with keys you can't retrieve).

@nikomatsakis
Copy link
Contributor

@graydon TotalOrd would have a defined ordering and equality for NaN (just as e.g. Java does; check out the rules for Double.equals(), Double.compareTo() and Double.hashCode()). So a float-keyed hashtable would work just fine.

@nikomatsakis
Copy link
Contributor

Note that there is no need to fail dynamically when a NaN is encountered. The behavior of TotalOrd (or just Ord, as pcwalton suggested) is just different from the behavior of partial-comparisons (i.e., the < and == operators) when it comes to NaN. This is what Java and Racket do and it seems like a reasonable compromise. Certainly better than randomly hashing NaN---this just leads to filling up your hashtable rather than actual sensible behavior.

@nikomatsakis
Copy link
Contributor

(Not that the situations in Java and Racket are directly analogous either. Well, Java basically is. Racket somewhat less so, and I don't claim a deep understand of their solution, but @jbclements and I did some experimentation and found that there is some special treatment around NaN to deal with these inconsistencies.)

@graydon
Copy link
Contributor

graydon commented Mar 12, 2013

Ok, so after some discussion I'm reasonably OK with closing this, but I'll make notes here:

  • IEEE 754 has two kinds of equality: partial and total. The totalOrder predicate is defined to match the partial order but extend it to cover a few cases (paraphrasing doc -- it's not a free spec):
    • totalOrder(-0, +0) == true and totalOrder(+0, -0) == false
    • totalOrder(-NaN, y) == true and totalOrder(x, +Nan) == true
    • signaling NaN sorts below quiet, negative below positive, lesser payload below greater
    • other exponent comparison cases to handle multiple-representation possibilities, hopefully only relevant in decimal floating point
  • totalOrder is either slow or simply unexpected to users much of the time when dealing with floats
    • In particular, we would like <, <=, ==, !=, >= and > to use the partial order (for speed alone?)
  • for containers, a total order is required for the data structure API to work reasonably (eg. inserting an element lets you retrieve it, inserting it twice overwrites, rebalancing operations don't lose nodes, etc.)
  • other languages have ad-hoc tricks to "totalize" their comparators when inserting into hashtables and trees. Java overrides the boxed Double type (but not the primitive double), haskell defines a compare operator that behaves differently on NaN than the relational operators, but is used for containers, etc.
  • In our case we're going to do the following:
    • define a nice TotalOrd trait
    • have it raise a condition when faced with partially-ordered values such as NaN: float::partial_order::cond.raise_default((a,b) || float::ieee754r::totalOrder(a,b))
    • a user who wants something other than the 754 totalOrder predicate can install one (something faster? unlikely going via conditions; more likely they'll just want to fail)
    • otherwise they get that predicate, which might be slow but is at least total

This matches the precedent of other languages and I think the intent of the IEEE committee in adding totalOrder. It does not involve random hashing; the hash value arrived-at by "just hashing the byte-representation" should respect the same equality that totalOrder respects; at least I think that's what the intent of totalOrder is (to sort representations more-or-less by their bit patterns). So I think we can just hash the bytes.

@graydon graydon closed this as completed Mar 12, 2013
@nikomatsakis
Copy link
Contributor

Just to add something I wrote on IRC: my one concern with using conditions is that the method used to hash/compare NaN should really be associated with the data structure, but conditions are dynamically scoped, so making things customizable via conditions opens the door to data structure incoherence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Projects
None yet
Development

No branches or pull requests

4 participants