-
Notifications
You must be signed in to change notification settings - Fork 344
Power Collections - Specialized Hashmaps API Proposal #2415
Comments
Thanks for kicking this off. API
I assume you mean on the comparers, because you made FindEntry and RemoveEntry virtual. For the comparer, you are calling through an interface. I am not sure of the relative cost of that vs. a delegate or virtual call (@jkotas?). If it's non trivial one could imagine a dictionary specialized specifically for, say, StringComparer.OrdinalIgnoreCase. As an alternative to inheritance, partial classes might be sufficient to create specializations like this. There is always a tradeoff between customization and performance, I suggest strong bias here to performance rather than trying to find another balance different to As a general point ideally the API would be as similar as possible to the existing PerformanceAll that matters for performance is actual measurements (especially against For collision - chaining, linear probing, quadratic probing, robin hood, cuckoo etc - it finds robin hood fastest or near fastest for lookup heavy work, and competitive for read/write. Footprint seems comparable for all probing strategies. For robin hood, it discarded various probing heuristics in favor of linear search, and rehash instead of tombstones (I am not sure whether this is the same as this). For growth - if I read correctly (Fig 5), performance at 90% capacity does not seem much worse than 70% so perhaps resizing can be delayed a little longer than the 72% that System.Collections.Hashtable used. For layout, your prototype has key and value together (AoS). Since we are getting a chance to modernize,, it would be nice to consider possible SIMD friendliness - the paper suggests SoA is more friendly because the keys are already packed. Without SIMD, they are similar except at low load factors where AoS wins.
Any particular reason? Some collision resolution strategies require a primal size, right? And growth strategy should be based on measurements. We should probably use ArrayPool - although if the arrays are of structs specific to this implementation, it would need its own pool. Depending on the real world importance of excess space and the cost of expansion, it may be useful to be able to specify initial capacity, possibly to offer EnsureCapacity() and TrimExcess() (as Performance scenariosIt would be helpful to create (or reuse) some benchmark scenarios, bearing in mind -
For the last two we can probably dig up some statistical data based on a corpus we have. @benaadams @jkotas @vancem for thoughts. |
I think the most common case by far is to use a dictionary internally rather than pass it to 3rd party API. Also as a datapoint in a quick grep of corefx\src*\ref**cs, not many API accept
Do you mean a bidirectional map? That seems something to worry about later. (If by PropertyDictionary you mean this, then it is not a bidirectional map, it is a map of objects that know their keys). I think the way forward is to "modernize" the API of |
@danmosemsft
I was not intending for FindEntry and RemoveEntry to be virtual; is there something that implies virtual that I am not aware of? The comparer interface is applied as a generic type parameter parameter, so that a struct comparer implementation can be used to get the compiler to specialize, de-virtualize, and likely inline as well without too much API compromise.
Modulo prime is a pretty good strategy for handling hash keys with poor entropy in the low bits (memory addresses of objects, for example). The problem is it it requires a DIV instruction with a best case latency of several dozen cycles, which is an unwelcome price to pay when you've already got good low bit entropy. A less resilient but faster entropy mixing function like multiplying by 2654435769 (2^32/phi) can be used, or left to the consumer.
I mean a map of objects that know their keys (which at least in my use cases means a set of values/objects with an "index" of some trait for fast lookup). I would like a bidirectional map too, but I think it would have to be a different structure. |
API Proposal for #2406
Rationale
The BCL Dictionary<,> class serves fairly well as a "safe default" hashmap implementation. But that flexibility and resistance to misuse comes at a substantial cost to best case performance, and it is further weighed down by the need to maintain backwards compatibility with decisions that, after 15+ years of hindsight, seem rather less than ideal. The performance ceiling for even naive hash map implementations is easily several times faster than
Dictionary<K,V>
; scenarios that demand high performance need better performing collections and will gladly tolerate fewer guard rails to get them.Goals
(int, TKey)
tuple keys can be used in the rare scenarios where it does make senseunsafe
behaviorsprotected
so that library builders (such as myself) can safely & confidently hand these out to consumers without copying.Proposed API
Base hashmap class. No public members - support a quality hashmap implementation to serve as the base for a wide variety of specialized collections, both power collections library provided and consumer implemented.
Basic Dictionary implementation, except faster & with single lookup add & update support
Counting Dictionary
TransformationCache - where you lookup up transformed values using the transformation input (e.g. a span of UTF8 bytes looking up strings)
Indexed Set - support indexed lookup, where the key is derived from the value (e.g. msbuild PropertyDictionary)
\\TODO
Super Basic, Broken Prototype
This prototype is provided as a comprehension aid & basis for discussion; it is entirely untested and has rather substantial deficiencies.
https://gist.github.com/Zhentar/eac2d9078860c29c58575e04fbe1deca
Open Questions
IDictionary<K,V>
implementation? or should it be read only given less robust handling of poor implementations?PropertyDictionary
reminded me that it's not an uncommon scenario (I've done it plenty of times myself), and would be easily supportedThe text was updated successfully, but these errors were encountered: