Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Add readonly Capacity property to HashSet<T> and Dictionary<K,V> and TrimExcess to HashSet<T> #66426

Closed
theodorzoulias opened this issue Mar 10, 2022 · 14 comments · Fixed by #85877
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Collections
Milestone

Comments

@theodorzoulias
Copy link
Contributor

theodorzoulias commented Mar 10, 2022

Background and motivation

The complexity of enumerating a HashSet<T> or a Dictionary<TKey, TValue> is determined not by its Count, but by its internal capacity. There are scenarios where both of these structures are enumerated frequently, and during their lifetime they might grow very large and then shrink to a small size. In these scenarios it makes sense to call TrimExcess at a point when the current Count has become too small compared to the current capacity, in order to speed up the enumeration of the collection.

If a Capacity property was exposed, it could be used for example like this:

dict.Remove(key);
if (dict.Count < dict.Capacity >> 2) dict.TrimExcess(dict.Count << 1);

In the above example the capacity of the dictionary is reduced to double its size, when its size becomes a quarter of its capacity. So a dictionary with capacity 1000 will have its capacity reduced to 500 when the Count drops below 250.

Currently it's not possible to know the internal capacity of these collections, without keeping track of their Count after every operation that affects their size. Which is quite cumbersome. Or by using reflection, which is not efficient and safe.

Also currently it's not possible to trim a HashSet<T> to a specific capacity, like it's possible with a Dictionary<TKey, TValue>. The HashSet<T>.TrimExcess method has no overload with a capacity parameter.

API Proposal

namespace System.Collections.Generic
{
    public class HashSet<T>
    {
        /// <summary>
        /// Gets the total number of elements the internal data structure can hold
        /// without resizing.
        /// </summary>
        public int Capacity { get; }

        /// <summary>
        /// Sets the capacity of this set to hold up a specified number of elements
        /// without any further expansion of its backing storage.
        /// </summary>
        public void TrimExcess(int capacity);
    }

    public class Dictionary<TKey, TValue>
    {
        /// <summary>
        /// Gets the total number of elements the internal data structure can hold
        /// without resizing.
        /// </summary>
        public int Capacity { get; }
    }
}

Alternative Designs

A TrimExcess overload with a loadFactor parameter would also do the job:

public void TrimExcess(double loadFactor);

...but its usage would not be completely obvious, and it could collide with the existing overload that has an int capacity parameter. Related issue: #23744

Risks

None that I am aware of.

@theodorzoulias theodorzoulias added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Mar 10, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Mar 10, 2022
@ghost
Copy link

ghost commented Mar 10, 2022

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

The complexity of enumerating a HashSet<T> or a Dictionary<TKey, TValue> is determined not by its Count, but by its internal capacity. There are scenarios where both of these structures are enumerated frequently, and during their lifetime they might grow very large and then shrink to a small size. In these scenarios it makes sense to call TrimExcess at a point when the current Count has become too small compared to the current capacity, in order to speed up the enumeration of the collection.

If a Capacity property was exposed, it could be used for example like this:

dict.Remove(key);
if (dict.Count < dict.Capacity >> 2) dict.TrimExcess();

Currently it's not possible to know the internal capacity of these collections, without keeping track of their Count after every operation that affects their size. Which is quite cumbersome. Or by using reflection, which is not efficient and safe.

API Proposal

public class HashSet<T>
{
    /// <summary>
    /// Gets the total number of elements the internal data structure can hold without resizing.
    /// </summary>
    public int Capacity { get; }
}

public class Dictionary<TKey, TValue>
{
    /// <summary>
    /// Gets the total number of elements the internal data structure can hold without resizing.
    /// </summary>
    public int Capacity { get; }
}

API Usage

As shown above.

Alternative Designs

A TrimExcess overload with a loadFactor parameter would also do the job:

public void TrimExcess(double loadFactor);

...but its usage would not be completely obvious, and it could collide with the existing overload that has an int capacity parameter. Related issue: #23744

Risks

None that I am aware of.

Author: theodorzoulias
Assignees: -
Labels:

api-suggestion, area-System.Collections, untriaged

Milestone: -

@danmoseley
Copy link
Member

The usual concern with these very commonly instantiated generic types is that making them larger grows the generated code many times over in a typical app. Clearly the added code here would be minimal but we would need some evidence that this is a common enough need.

@theodorzoulias
Copy link
Contributor Author

@danmoseley my actual recent case of needing this API was with a HashSet<T>. I don't remember ever needing it with Dictionary<TKey, TValue>, so as far as I am concerned adding it only on the HashSet<T> would be OK.

@danmoseley
Copy link
Member

@jkotas what would your concern level be about the bytes generated by a simple getter over a field? int Capacity { get { return _entries.Length; }}

@jkotas
Copy link
Member

jkotas commented Mar 10, 2022

I am ok with it. It has to also check for _entries == null.

@theodorzoulias
Copy link
Contributor Author

In retrospect adding the HashSet<T>.Capacity alone will not result in a totally satisfactory solution to my problem, without the HashSet<T>.TrimExcess overload that has an int capacity parameter. This overload currently exists only in the Dictionary<TKey, TValue>, and not in the HashSet<T>. The existing parameterless HashSet<T>.TrimExcess method leaves (almost) no space for future expansion in the collection, and as a result the next Add is likely to trigger an expensive doubling of the collection's capacity.

The scenario that I want to cover involves a HashSet<T> with a constantly changing Count, that never settles down to a final Count during its life time. The desirable shrinking behavior would be to half its capacity when the Count has become a quarter of its capacity. For example a collection with capacity 1000 should have its capacity reduced to 500 when the Count drops below 250.

set.Remove(entry);
if (set.Count < set.Capacity >> 2) set.TrimExcess(set.Count << 1);

An API that might be an even better solution to my problem, if it existed, could be a HashSet<T> with linked nodes, similar to the LinkedDictionary<TKey, TValue> proposed here. That API would offer an enumeration complexity determined by its Count, and not by its capacity. And in my case I care mainly about the performance of enumerating the collection, not about the memory it consumes.

@eiriktsarpalis
Copy link
Member

I'm seeing a couple of options here:

  1. Update the API proposal to include a HashSet<T>.TrimExcess(int capacity) overload (seems fairly straighforward to incorporate and there is precedent in Dictionary).
  2. Close this issue in favor of other solutions?

@eiriktsarpalis eiriktsarpalis added the needs-author-action An issue or pull request that requires more info or actions from the author. label Mar 14, 2022
@ghost
Copy link

ghost commented Mar 14, 2022

This issue has been marked needs-author-action since it may be missing important information. Please refer to our contribution guidelines for tips on how to report issues effectively.

@theodorzoulias
Copy link
Contributor Author

@eiriktsarpalis having the HashSet<T>.TrimExcess(int capacity) API would be really nice. It would open more options at managing the capacity of the collection. Should I include it in this proposal, or create a new API proposal?

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Mar 14, 2022
@eiriktsarpalis
Copy link
Member

Updating this one should suffice, thanks.

@theodorzoulias
Copy link
Contributor Author

@eiriktsarpalis done!

@eiriktsarpalis eiriktsarpalis removed untriaged New issue has not been triaged by the area owner needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration labels Mar 23, 2022
@eiriktsarpalis eiriktsarpalis added this to the Future milestone Mar 23, 2022
@eiriktsarpalis eiriktsarpalis self-assigned this Mar 23, 2022
@eiriktsarpalis eiriktsarpalis added the api-ready-for-review API is ready for review, it is NOT ready for implementation label Mar 23, 2022
@danmoseley danmoseley changed the title [API Proposal]: Add readonly Capacity property to HashSet<T> and Dictionary<K,V> [API Proposal]: Add readonly Capacity property to HashSet<T> and Dictionary<K,V> and TrimExcess to HashSet<T> Mar 30, 2022
@theodorzoulias
Copy link
Contributor Author

I found the related issue #27618, where an additional usage scenario for this API was mentioned by @Grauenwolf in a comment: object pooling scenarios.

@danmoseley danmoseley removed the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 3, 2022
@bartonjs
Copy link
Member

bartonjs commented Apr 27, 2023

Video

We looked at some other types in the namespace that had the same pattern of EnsureCapacity(int)+TrimExcess(void) and squared it off to have both get_Capacity+TrimExcess(int).

namespace System.Collections.Generic
{
    public partial class HashSet<T>
    {
        /// <summary>
        /// Gets the total number of elements the internal data structure can hold
        /// without resizing.
        /// </summary>
        public int Capacity { get; }

        /// <summary>
        /// Sets the capacity of this set to hold up a specified number of elements
        /// without any further expansion of its backing storage.
        /// </summary>
        public void TrimExcess(int capacity);
    }

    public partial class Dictionary<TKey, TValue>
    {
        /// <summary>
        /// Gets the total number of elements the internal data structure can hold
        /// without resizing.
        /// </summary>
        public int Capacity { get; }
    }

    public partial class Queue<T>
    {
        public int Capacity { get; }
        public void TrimExcess(int capacity);
    }

    public partial class Stack<T>
    {
        public int Capacity { get; }
        public void TrimExcess(int capacity);
    }
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 27, 2023
@hrrrrustic
Copy link
Contributor

What about PriorityQueue? It also has EnsureCapacity and TrimExsess(), does it make sense to include it to this API for consistency?

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 6, 2023
@jeffhandley jeffhandley modified the milestones: Future, 9.0.0 Aug 10, 2023
@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Nov 22, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Nov 22, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Collections
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants