Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Proposal: Provide specialized collections aka Power Collections #2406

Closed
danmoseley opened this issue Jul 30, 2018 · 39 comments
Closed

Proposal: Provide specialized collections aka Power Collections #2406

danmoseley opened this issue Jul 30, 2018 · 39 comments

Comments

@danmoseley
Copy link
Member

@ViktorHofer commented on Thu Jul 19 2018

Proposal

We are hitting limitations with our in-built collections all over the places, either performance based (i.e. allocation) or feature based). After talking with @jkotas offline we agreed that it would be ideal if we would provide commonly used specialized collections (like a Dictionary<string, string> or a PriorityQueue) as a source based nuget package. That would allow us to provide basic enhanced functionality over the general purpose inbox types but still the ability to tweak them on demands.

Considerations:

  • What kind of collection types would benefit from specialization?
  • Where do we see gaps to our competition (again, performance or feature wise)?
  • Where would such a package be hosted and how would it be deployed? (corefx vs community corefx repo)
  • Is a source based nuget package feasible and are we easily able to use it inside corefx?

Example

Currently working in Regex to avoid allocation I stumbled upon places where Unicode categories are statically stored in a Dictionary<string, string>. Trying to get the value out of it involves allocation because there is currently no way to pass a ReadOnlySpan as a key in to get the value. Providing a specialized version would allow us to add an additional overload for ReadOnlySpan.

References

Dictionary/List ref api additions: dotnet/coreclr#17405 (comment)
PriorityQueue: https://github.com/dotnet/corefx/issues/574#issuecomment-308491122 & #1850
OrderedDictionary<,>: https://github.com/dotnet/corefx/issues/26634

cc @jkotas @karelz @benaadams @danmosemsft


@danmosemsft commented on Thu Jul 19 2018

Perhaps we could have a Dictionary implementation specialized for keys of type int -- that would be used by k-nucleotide (given they accept use of Long2IntOpenHashMap by Java 😕 ).

Why do you propose a source based package?

@stephentoub


@danmosemsft commented on Thu Jul 19 2018

cc @AnthonyLloyd for that.


@prajaybasu commented on Thu Jul 19 2018

I hope adding OrderedDictionary<TKey, TValue> is also being considered. (Related: #26634)


@ViktorHofer commented on Thu Jul 19 2018

OrderedDictionary<TKey, TValue> would also be nice. (Related: #26634)

Thanks I added it to the links.

Why do you propose a source based package?

One of the requirements for such a package would be that we can change it frequently for our inbox stuff (i.e. adding new apis that accept types that we just added in master) but still have a stable version that ideally targets netstandard. We did the same with Span by cross-compiling for netstandard and netcoreapp. That is definitely an approach we can think about but during my discussion with @jkotas he pointed out that source based nuget packages are more versatile as consumer will always want to fine-tune their collections which is easily possible with that approach.


@AnthonyLloyd commented on Fri Jul 20 2018

Possibly:

  1. A counting Dictionary<K,int> (or even Dictionary<int,int>) with Increment(K key) and Decrement(K key). For reference counting etc.
  2. An aggregating Dictionary<K,V,R>(Func<V,R> seed, Func<R,V,R> accum) with Add(K v, V v).

Possibly 2) could be used in Enumerable.GroupBy.


@benaadams commented on Thu Jul 19 2018

A fast Type Dictionary<Type, ...> e.g. https://github.com/benaadams/Ben.TypeDictionary

For DI etc


@AnthonyLloyd commented on Fri Jul 20 2018

@benaadams I always assumed the following through some magic had good perf

static class TypeHolder<T>
{
    public static string Value;
}

TypeHolder<int>.Value = "int";
TypeHolder<string>.Value = "string";

@benaadams commented on Fri Jul 20 2018

@AnthonyLloyd alas the generic is a process wide singleton (can be only one); and doesn't work well with GC


@benaadams commented on Fri Jul 20 2018

Where would such a package be hosted and how would it be deployed?

corefxlab, myget; graduate to corefx, nuget

Being in corefx doesn't mean it can't be a nuget only package

community corefx repo

Only if MS doesn't want to support it going forward and/or wants to hand commit/management rights over to non-MS. Don't think that necessarily helps the .NET OSS community as now you have an officially sanctioned OSS section; where does that leave regular repos?


@shaggygi commented on Sun Jul 22 2018

I still think it would be beneficial to have Immutable Parent/Child Collection, but understand it is not specifically related to this thread.

@ViktorHofer, @jkotas to expand on what I'm looking for, it appears @sharwell was working on something similar here... https://github.com/tunnelvisionlabs/dotnet-trees

Few links:
https://stackoverflow.com/questions/9304404/how-to-create-a-completely-immutable-tree-hierarchy-construction-chicken-and-eg
http://concurrencyfreaks.blogspot.com/2013/10/immutable-data-structures-are-not-as.html


@AndyAyersMS commented on Sun Jul 22 2018

For fast dictionaries consider different hash -> bucket mapping strategies. The current dictionary's modulo a prime is slow (about 1/3 the cycle cost for a successful TKey=int lookup these days).

Custom vs/default comparators should be supported via subtyping or parameterization and not by options passed to the constructor, so the jit doesn't need heroics to inline comparators. If via subyping, the fast dictionary classes should be sealed.

For types where hash recomputation is cheap (ints), avoid storing the hash.

.Net settled long ago on the parallel array setup but different approaches may have appeal, eg a chunked array setup that is able to avoid LOH and should be cheaper to grow. Maybe also consider approaches that allow some mutate on read so if there are collisions frequently accessed elements can be made cheaper to find by migrating them earlier in the collision chains.


@ViktorHofer commented on Mon Jul 23 2018

Thanks Andy, whoever is going to implement a fast dictionary will benefit a lot from your info.

I propose we talk about this in the Design/API review meeting. If we agree to provide these collections either inbox or as a nuget package I believe we could implement them (together with community members who are interested) fairly quick.


@Zhentar commented on Tue Jul 24 2018

There are two specialized hash maps I'd like to see.

  1. Fast ulong keyed map. Basically, applying all the optimizations Andy mentioned. My use case is string atomization while parsing Utf8 streams; when I've got a token I need to stringify, I hash the raw bytes with a high quality 64-bit hash function so that the risk of collision is low enough (even accounting for the birthday paradox) that I can treat hash equality as value equality. I'd also prefer an open addressing hash table algorithm that only takes one cache miss on random lookups ( unfortunately this last part is incompatible with a fast K-Nuc dictionary mentioned in Dictionary CQ coreclr#15419 because K-Nuc benefits tremendously from coalesced chaining preserving insertion order locality)

  2. string keyed map which assumes reference equality for equal values. As long as my strings are atomized it would be nice to get an extra bit of efficiency out of dictionaries using them as keys.


@danmosemsft commented on Tue Jul 24 2018

Last time we discussed this it got wedged on the discussion of whether there should be a new repo. @joshfree @weshaggard is corefxlab a suitable place to put experimental code that has some infrastructure (CI) and the ability to package for nuget and might graduate to corefx later?
@karelz


@ViktorHofer commented on Tue Jul 24 2018

If we ship this as a normal nuget package (not source based) this should live in corefx as we want to use it internally in master. Also as a buid from source requirement I believe we must live build netcoreapp asset.


@joshfree commented on Tue Jul 24 2018

@danmosemsft yes corefxlab is the appropriate place for something like this to begin while we’re unsure if it should be promoted to corefx or not.


@mikedn commented on Wed Jul 25 2018

string keyed map which assumes reference equality for equal values

How's that different from a normal dictionary using an appropriate key comparer? (apart from the cost of interface calls that the normal dictionary incurs).


@Zhentar commented on Wed Jul 25 2018

(apart from the cost of interface calls that the normal dictionary incurs).

In a sufficiently tight loop, that's enough :) (combined with the assumption that it's worthwhile to store both the key and the hash code and check for the equality of both)


@danmosemsft commented on Mon Jul 30 2018

Here's what we are going to do. We do want such a package for extended collections. I will move this issue to corefxlab. We will create a package with a name something like System.Collections.Extensions (or System.Collections.Experimental) and put types in a temporary namespace (eg System.Collections.Extensions.Experimental) so we have the option of moving some into the BCL if they prove worthy. This will publish to myget to start.

The goal of this is to develop useful collections, but it would be nice if the kNucleotide maintainer would allow it also.

@joshfree
Copy link
Member

We may want to consider resurrecting the MultiValueDictionary corefxlab prototype as part of a System.Collections.Specialized / Power Collections package

@sharwell
Copy link
Member

I'm a big fan of supporting B+-trees for storage density, low worst-case allocation rates, and ability to avoid the large object heap. I'm not sure about the source-based distribution though.

@terrajobst
Copy link
Member

I'm generally supportive, provided:

  1. We don't use the System namespace. Otherwise, we'll run into trouble if we ever have to put collection types proven to be useful into the product. Yes, we can type forward down, but I'd rather have a clean-split to avoid having to deal with that level of fragility.

  2. We shouldn't be using a source distribution model. For starters, this means supporting any other language but C# is virtually impossible. And second, we're investing in linker technology right now. We should make sure it plays nicely with it, which should be doable as there shouldn't be any crazy, statically undiscoverable interdependencies.

@danmosemsft @joshfree Can we ship signed packages out of CoreFxLab?

@Drawaes
Copy link
Contributor

Drawaes commented Jul 30, 2018

Should be able to design a single writer multiple reader dictionary. This is a pattern that comes up to and time again for caches but concurrent dictionary adds a decent overhead.

@danmoseley
Copy link
Member Author

Can we ship signed packages out of CoreFxLab?

@dagood can maybe answer that. For now what matters is that corefxlab already supports packaging and pushing to myget automatically.

@danmoseley
Copy link
Member Author

@terrajobst what is the API review process like for this work. My assumption is that it does not come into play unless and until we want to graduate types into CoreFX. Is that right?

@joshfree
Copy link
Member

Can we ship signed packages out of CoreFxLab?

We did for the SignalR alpha releases.

@dagood
Copy link
Member

dagood commented Jul 30, 2018

Can we ship signed packages out of CoreFxLab?

As a check, the latest System.Binary.Base64 0.1.0-preview2-180728-5 isn't signed, and the DLL inside isn't authenticode signed either. I don't know what it would take to enable this/these in the corefxlab build.

dotnet/corefx#28991 is where package signing was added to corefx. Maybe @mmitche knows more.

@mmitche
Copy link
Member

mmitche commented Jul 30, 2018

I don't know about corefxlab, but it's really just another signing submission with a different cert.

@joshfree
Copy link
Member

@dagood @mmitche we can chat offline

@omariom
Copy link
Contributor

omariom commented Jul 31, 2018

How about specialized concurrent collections? Bounded vs unbounded, single producer/consumer vs multiple ones.

@cdmihai
Copy link

cdmihai commented Aug 1, 2018

How about a linked block list for lists that just grow and never / seldomly get removed from? MSBuild's evaluator captures evaluated state in lists that grow up to hundreds of thousands of elements (and in out-of-VS scenarios never have elements removed). In this case, it's wasteful to grow the list. Instead, the linked block list can link the arrays instead of discarding, re-allocating, copying.

@danmoseley
Copy link
Member Author

@cdmihai I'm quite proud of those two!

@danmoseley
Copy link
Member Author

@terrajobst I'm thinking we drive this as follows:

  1. Solicit issues in this repo for proposed types, using roughly the API review format. Some already have had good discussion in Corefx that we can build from.
  2. We select two or three to start with that have clear value and a reasonably clear shape, and invite volunteers to offer PR's, with tests, perf measurements etc
  3. Iterate over time based on feedback on the experimental package. Add more types.
  4. Where types seem proven and worthy, graduate them to CoreFX proposals where they would go through formal API review for the BCL.
    @jfree @ahsonkhan is that roughly the pattern Span, Pipeline,... followed? Without formal API review until graduation? Was that a good model?

@ahsonkhan
Copy link
Member

Where types seem proven and worthy, graduate them to CoreFX proposals where they would go through formal API review for the BCL.
@jfree @ahsonkhan is that roughly the pattern Span, Pipeline,... followed? Without formal API review until graduation? Was that a good model?

We had architects guiding the project along within corefxlab (stewards), but yes, that is the pattern that Span, Pipelines, etc., followed especially during the experimentation and prototype phase. Yes, it worked relatively well. If the API sets starts getting too large, I would stage the graduation to CoreFX and the formal API review by splitting into components.

@Drawaes
Copy link
Contributor

Drawaes commented Aug 2, 2018

And I think it helped with community involvement as the early stages could take heavy contributions from the community without large upfront debates.

@Zhentar
Copy link

Zhentar commented Aug 3, 2018

Solicit issues in this repo for proposed types, using roughly the API review format

Do you have a template or examples? I'd be willing to get something started for a low overhead hashmap.

@danmoseley
Copy link
Member Author

danmoseley commented Aug 3, 2018

@Zhentar look at the API review markdown doc in the corefx repo, or prismatic easier to look at active issues in corefx marked with API-ready-for-review label. Top post on those will be examples of the format:
https://github.com/dotnet/corefx/issues?q=is%3Aopen+is%3Aissue+label%3Aapi-ready-for-review

@akraines
Copy link

akraines commented Aug 4, 2018

@cdmihai See https://github.com/jamesqo/BlockList

@danmoseley
Copy link
Member Author

As a datapoint, I made a dump of the types that were in the Wintellect PowerCollections https://gist.github.com/danmosemsft/aa88e4ad837a98ab786e679e446a28ea
Amongst those there were Deque. MultiDictionary, OrderedBag, OrderedDictionary, OrderedSet. Some are since obsolete (Triple, Set)

@shaggygi
Copy link
Contributor

@danmosemsft From the list, RedBlackTree, and link (http://www.jot.fm/issues/issue_2005_03/column6/) is interesting. This might be something similar I link above. 🤞🤞🤞

@danmoseley
Copy link
Member Author

@shaggygi that is an internal implementation type for the OrderedBag, etc. You can browser with ILSPY if you are interested: https://www.nuget.org/packages/SoftUni.Wintellect.PowerCollections/

@benaadams
Copy link
Member

A string "intern" dictionary; that you give it Spans and it gives strings back and has Least frequent recently used discards (possibility with ageing so they stay away from Gen2)

@danmoseley
Copy link
Member Author

@benaadams I think a collection optimized for string interning would be interesting. We added HashSet<T>.TryGetValue to help but that wouldn't help if you had a Span. The policy and how to configure it seems like the hard part. There was some discussion here and in my comment below. Perhaps you'd like to start an issue and sketch a proposal? I can't find an existing one. It seems like potentially a good fit for this package.

@Zhentar
Copy link

Zhentar commented Aug 19, 2018

@benaadams @danmosemsft one of the use cases I intended to include in #2415 (but apparently left out, updated now) was to support my own string intern collection, although I didn't include any sort of expiration mechanism (for my usage keeping all of the strings for the life of the collection is suitable).

@TylerBrinkley
Copy link
Contributor

TylerBrinkley commented Aug 20, 2018

@benaadams While not a true string "intern" dictionary, my proposal for a StringKeyedDictionary<T> type, #2438, may work for what you want.

@cdmihai
Copy link

cdmihai commented Aug 20, 2018

Regarding the string cache, that would be quite interesting for MSBuild. We have a complicated string cache (bucketed MRU lists) which is addressed via a very Span looking interface. Many of my recent Span-ification attempts died because at some point, I needed to put the span in some lookup dictionary, and many times that lookup dictionary was the string cache.

It would be really awesome if I could change the string lookup to be able to be addressed by spans. Most of our Span.ToStrings, come from small atoms representing function names or types, which are highly re-ocurring, so being able to cache the Span.ToStrings would probably save a bunch of strings. https://github.com/dotnet/corefx/issues/31302 would help me write a ROM based comparer. The only drawback would be calling ROM.Span a million (literally) times.

@benaadams
Copy link
Member

I was seeing something like:

public partial class InternDictionary
{
    public string this[ReadOnlySpan<char> key] { get; }
}

So you'd always get a string back; if it already existed you'd get that one, if it was new it would add the ROS.ToString() and give you it back.

Having ReadOnlyMemory keys would be problematic as the would retain the whole chunk of data that they referred to the window of; rather than just the window, which might be quite undesirable.

@Zhentar
Copy link

Zhentar commented Aug 20, 2018

For my own intern dictionary, I just use an xxHash64 of the UTF8 bytes as the key, to avoid keeping the bytes around and to cut the cost of comparing equality of UTF8 bytes against UTF16 strings.

@ViktorHofer
Copy link
Member

So you'd always get a string back; if it already existed you'd get that one, if it was new it would add the ROS.ToString() and give you it back.

Essentially that's the desired data structure that made me write up the proposal to begin with ;)

@svick
Copy link
Contributor

svick commented Aug 21, 2018

@Zhentar With that approach, what happens when there is a collision?

@Zhentar
Copy link

Zhentar commented Aug 21, 2018

@svick theoretically speaking, you would get the wrong string back. But the reason I use a 64-bit hash function is that it makes the probability of collision negligible.

@mgravell
Copy link
Member

mgravell commented Aug 21, 2018 via email

@jnm2
Copy link
Contributor

jnm2 commented Aug 23, 2018

@Zhentar I've actually seen that approach go painfully wrong in a commercial third party library.

@Drawaes
Copy link
Contributor

Drawaes commented Aug 23, 2018

It's basically a non starter for this approach for any system or application I have ever written. Performance is important but correctness is usually more so.

@safern
Copy link
Member

safern commented Aug 27, 2018

The package is now live, containing MultiValueDictionary as the only type as of now:

https://dotnet.myget.org/feed/dotnet-corefxlab/package/nuget/Microsoft.Experimental.Collections

I guess it is a good start. I will close this issue, since the purpose of it was to track the package addition and have some discussion regarding interesting types. But to have better isolated/specialized discussions I think we should open separate issues for type proposals to discuss them and then if agreed, add them to the new package 😄

@safern safern closed this as completed Aug 27, 2018
@TimLovellSmith
Copy link

64-bit hash function is that it makes the probability of collision negligible

For those who like maths, from table courtesy of 'https://en.wikipedia.org/wiki/Birthday_attack'
You have a 50% chance of a collision within 5.1 × 10^9 entries. And that's definitely not a negligible chance, at a dictionary size which is quite achievable on modern hardware.

@jnm2
Copy link
Contributor

jnm2 commented Sep 4, 2018

Nice. Looks like I'd start getting worried around 200k entries.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests