Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: Add AddOrUpdate to Dictionary<K,V> #26848

Closed
AnthonyLloyd opened this issue Jul 17, 2018 · 33 comments
Closed

API Proposal: Add AddOrUpdate to Dictionary<K,V> #26848

AnthonyLloyd opened this issue Jul 17, 2018 · 33 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Collections
Milestone

Comments

@AnthonyLloyd
Copy link

Add the AddOrUpdate method that is on ConcurrentDictionary to Dictionary. This would improve performance in use cases such as counting keys.

K-Nucleotide in the benchmarksgame does this and has an ugly workaround currently.

Proposed API

I'm proposing some kind of Update method for Dictionary<K,V> which could up to double the performance of this operation.

After looking at ConcurrentDictionary<K,V> I think it could be the same as AddOrUpdate:

public TValue AddOrUpdate(TKey key, TValue addValue, Func<TKey, TValue, TValue> updateValueFactory);

Example

The usage example for counting keys would look like:

dictionary.AddOrUpdate(key, 1, (_,v) => v + 1)

Discussion

there are other common cases where the value in the dictionary is another collection such as a List<T> which would need the other AddOrUpdate on ConcurrentDictionary<K,V>:

public TValue AddOrUpdate(TKey key, Func<TKey, TValue> addValueFactory, Func<TKey, TValue, TValue> updateValueFactory);

dictionary.AddOrUpdate(key, _ => new List<T>{value}, (_,v) => v.Add(value));

since this request is for performance the second function may not be so relevant and would require further evaluation.

Other options could be:

  1. as @omariom says ref could be possible. GetOrAdd(key)++ but may only be useful for value types?
  2. A method that returns the entry (or a new one) in the dictionary so you can set the value.
  3. More specific Dictionary in specialized such as a count dictionary with increment (only if this is common use case). I've also thought that specific int or long key dictionary could be optimized since the storage of the hash could be removed.
@dsyme
Copy link

dsyme commented Jul 17, 2018

Here's a link to the access-the-internals hack that @AnthonyLloyd used to get the perf benefits of AddOrUpdate https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/knucleotide-csharpcore-9.html

@dsyme
Copy link

dsyme commented Jul 17, 2018

@AnthonyLloyd Could you give an indication of the degree of the perf benefits available here?

@danmoseley
Copy link
Member

How is this different to just setting via the indexer?

@dsyme
Copy link

dsyme commented Jul 17, 2018

Thats what I was wondering too :)

@AnthonyLloyd
Copy link
Author

AnthonyLloyd commented Jul 17, 2018

It's roughly twice as fast because you need to effectively do the same operation twice. (The Func on AddOrUpdate may bring this back though)
In the benchmarksgame it comes out more than twice as fast because the hack needs to be done with pointers.

You need to do the following:

dictionary[key] = dictionary.TryGetValue(key, out var i) ? i + 1 : 1;

but you get a perf improvement if you do the following:

public class Wrapper
{
    public int V;
    public Wrapper(int v) { V = v; }
}

if (dictionary.TryGetValue(key, out var w))
    w.V++;
else
    dictionary.Add(key, new Wrapper(1));

@karelz
Copy link
Member

karelz commented Jul 17, 2018

@AnthonyLloyd it would be useful to post the API shape you expect - see API review process for details and example.

@omariom
Copy link
Contributor

omariom commented Jul 17, 2018

May be ref TValue GetOrAdd(TKey key);?

@danmoseley
Copy link
Member

@AnthonyLloyd so the motivation is, setting a value that may depend on any existing value? And this new API would allow this to be done with only one key lookup? If so that makes sense, although there is a high bar for touching Dictionary as it gets instantiated many times (for many different value types) so every extra method bloats our assemblies significantly. We would need good evidence that it is worth it.

Would this eliminate the need for the reflection in your latest KNucleotide submission (which makes it very hard to read 😃 )? Would it give better perf for that scenario, or comparable perf?

For API review, the top post would need to be in the format described at https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md

@danmoseley
Copy link
Member

By the way @AnthonyLloyd , many thanks for your work on the C# submissions to CLBG.

@AnthonyLloyd
Copy link
Author

Yes it's an update based on a prior value. It would simplify the KNucleotide code and would possibly have almost comparable perf.

I'm not sure of other evidence out there. Personally in most systems I have added this as an extension method using the TryGetValue code above. I've also seen it often discussed on stackoverflow.

As you say Dictionary is used a lot hence performance should be carefully considered but any improvements could possibly have a large impact for people.

@karelz
Copy link
Member

karelz commented Jul 18, 2018

I've also seen it often discussed on stackoverflow.

Can you share some top links demonstrating how popular and upvoted it is?

@vancem
Copy link
Contributor

vancem commented Jul 18, 2018

First, I think that the basic use case (A table where you want to accumulation (typically increment, but you may be adding to a list or doing some other accumulation), is pretty common (I have run into it myself on numerous occasions and have always lamented the fact that that I either have to look up the key twice (once to get the value and once to update it), or create some sort of wrapper (this is what I do when I care about perf).

I note that Concurrent Dictionary already has an AddOrUpdate method. (They needed it for atomicity).

So the existing proposal is reasonable as it 'follows suit'. However the CurrnetDictionary API was created before we had the ref-return capability in C#, Now that we have it I think the @omariom idea of

     ref TValue GetOrAdd(TKey key)

Is better (simpler, more efficent, no delgate in the API, covers the importnat scenarios). The caller can do the creation or update as needed, but in common scnearios (e.g. when TValue is an int or other numeric type), nothing special is needed. If you wanted to have an overload to supply a factory to create TValue I would not have a problem with that (since I think it will be the rare case, and avoids testing logic after the call).

Thus I am conflicted. What I can say is I believe in the scenario (we should do something). I guess my best case would be ad GetOrAdd to both DIctionary and ConcurrentDictionary.

To address @danmosemsft point. It is definately true that Dictionary<K, V> is instantiated alot, but the cost of an additional unused method on this instantiation is mostly pay-for-play (you do have to reserve a slot for the method (probably a pointer and view bytes), but otherwise it is pay for play.

But in general I like it.

@vancem
Copy link
Contributor

vancem commented Jul 18, 2018

@svick
Copy link
Contributor

svick commented Jul 18, 2018

@vancem From the discussion at the end of https://github.com/dotnet/corefx/issues/25189, it seems to me it's not clear yet that having ref-returning methods on mutable collections is a good idea, since the ref local can become stale if the underlying collection is resized. I think that should be resolved first.

Also, how would GetOrAdd work for ConcurrentDictionary? How could it possibly be thread-safe?

@stephentoub
Copy link
Member

Also, how would GetOrAdd work for ConcurrentDictionary? How could it possibly be thread-safe?

If a delegate is provided, it may run multiple times, but the dictionary guarantees that only one will be published/returned and in a way that maintains the consistency of the data structure.

@svick
Copy link
Contributor

svick commented Jul 18, 2018

@stephentoub Right, but ConcurrentDictionary already has AddOrUpdate for that. How would a delegate-based ConcurrentDictionary.GetOrAdd be different from that?

I was assuming @vancem was suggesting adding ref-returning GetOrAdd to ConcurrentDictionary.

@stephentoub
Copy link
Member

stephentoub commented Jul 18, 2018

Right, but ConcurrentDictionary already has AddOrUpdate for that. How would a delegate-based ConcurrentDictionary.GetOrAdd be different from that?

I don't understand the question. It already has both AddOrUpdate:
https://github.com/dotnet/corefx/blob/84d044411b400044f75ee90c744a7f9bd6e4fe1d/src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentDictionary.cs#L1143
and GetOrAdd:
https://github.com/dotnet/corefx/blob/84d044411b400044f75ee90c744a7f9bd6e4fe1d/src/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentDictionary.cs#L999

@svick
Copy link
Contributor

svick commented Jul 18, 2018

@stephentoub I didn't realize a method called GetOrAdd already existed in ConcurrentDictionary.

@vancem said:

I guess my best case would be [to add] GetOrAdd to both DIctionary and ConcurrentDictionary.

My question is: what is the API that would be added to ConcurrentDictionary?

@stephentoub
Copy link
Member

My question is: what is the API that would be added to ConcurrentDictionary?

I don't know; I don't think there's anything to add there for this. Whether or not we add a ref-returning method to Dictionary, we definitely should not add such a method to ConcurrentDictionary.

@omariom
Copy link
Contributor

omariom commented Jul 18, 2018

To compare ref returning method with what we normally do for the case when value is collection.

if (!dict.TryGetValue(key, out var values))
{
    dict.Add(key, values = new List<int>());
}
values.Add(value);
ref var values = ref dict.GetOrAdd(key);

if (values == null)
{
    values = new List<int>();
}
values.Add(value);

The last lines can be replaced with

(values = values ?? new List<int>()).Add(5);

but at the cost of additional read and write every time (writing reference values to the heap has it own cost). This issue could be cured by null coalescing assignment..

(values ??= new List<int>()).Add(5);

@vancem
Copy link
Contributor

vancem commented Jul 18, 2018

Given the issue with the byref being silently invalidated by updates to the dictionary, is pretty ugly. Also given that ConcurrentDictionary has exactly the AddOrUpdate method, already we should stick with that rather than a ref-returning version.

@AnthonyLloyd
Copy link
Author

@vancem tho this is primarily for performance since we can add extension methods to get the API.

@danmoseley
Copy link
Member

The original proposal was for AddOrUpdate taking an add value and an update delegate. Is GetOrAdd also proposed - @AnthonyLloyd I guess it is potentially useful but just doesn't happen to come up in kNucleotide?

I agree that AddOrUpdate taking an add delegate is more marginal (because it is only useful when the value cannot be trivially set) and I suggest we not take that further (in this issue)

@danmoseley
Copy link
Member

To address @danmosemsft point. It is definately true that Dictionary<K, V> is instantiated alot, but the cost of an additional unused method on this instantiation is mostly pay-for-play (you do have to reserve a slot for the method (probably a pointer and view bytes), but otherwise it is pay for play.

That is true for JIT'd environments, but is it also true for AOT? I had assumed @jkotas was pushing back on any addition to the Dictionary class, not just growth in existing methods, for this reason.

@AnthonyLloyd
Copy link
Author

Yes I'd say this should be thought of much wider. I like what @vancem said about it being an accumulation.
It made me think that this could improve performance of Enumerable.GroupBy but I could be very wrong.

@dotnetchris
Copy link

GetOrAdd is far more useful than AddOrUpdate, Please don't lose GetOrAdd.

@ufcpp
Copy link
Contributor

ufcpp commented Nov 1, 2018

@AnthonyLloyd
Copy link
Author

@dotnetchris @ufcpp We have RefDictionary (dotnet/corefxlab#2458) which shows very good perf. Around 50% fast as thought above.

@dotnetchris
Copy link

@AnthonyLloyd are you bringing up RefDictionary because RefDictionary is built such that you can use the indexer to check for existence safely? And that GetOrAdd only becomes relevant to thread safety concerns of ConcurrentDictionary?

@AnthonyLloyd
Copy link
Author

RefDictionary (now DictionarySlim) returns a ref so you can either set the new value or update. It removes the need to TryGetValue and then Add/Update which is essentially twice the work you should need to do.

It looks like DictionarySlim is the right solution to fixing this type of perf problem and the PR seems close complete. I think I should close this now.

@dotnetchris
Copy link

dotnetchris commented Nov 12, 2018

@AnthonyLloyd That's a much better solution.

Thank you all for stepping back and actually curing the disease instead of just making the medicine taste better.

@dotnetchris
Copy link

Thinking about this further, this also makes DictionarySlim support simple concurrent writes very easily since you can lock on the instance that is returned from update.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Collections
Projects
None yet
Development

No branches or pull requests