-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: slices: add a reusable slices.Contains() #66022
Comments
Change https://go.dev/cl/568095 mentions this issue: |
Can’t this already be achieved by calling slices.Sort and then slices.BinarySearch? Binary search also assumes the slice is orderable while slices.Contains just requires comparability. |
The key to the problem is reusability, which seems not to be possible using slices.BinarySearch. |
This approach seems practical. I frequently encounter a similar scenario: I have two substantial slices, denoted as 'a' and 'b'. The elements in 'a' are instances of the UserInfo structure, while 'b' consists of string names. I need to identify which elements in 'a' have a UserInfo.Name that matches any entry in 'b'. Initially, I employed a method involving two nested loops, utilizing slices.Contains() within the inner loop. I noticed a significant performance drawback with this approach.Subsequently, I experimented with employing a map or binary search at the outer layer to enhance efficiency. While this alternative did enhance performance, it introduced additional code that lacked the same level of intuitiveness (more intricate than using Contains). It would be great if there was a similar method in the standard library. But I feel that for different scenarios, it is difficult to ensure that its performance is better than using Contains multiple times in a loop. |
This seems like it will have surprising performance characteristics. Copying a large slice, whether into another slice or a map, reuses a lot of memory. That does not seem implied by a name like Also, of course, we can only use binary search on a slice for an ordered type, but In general it seems simpler to have a function that converts a slice to a map (as in the functions in #61899 and #61900) and then just use the map. That is explicit about what is happening and seems about as easy to use as this proposal. |
Thanks for the comment. This situation does occur very frequently in the code. The existing method does not perform well enough or requires additional complexity. However, it is indeed difficult to accurately measure the performance |
Thanks for the comments and guidance, I did initially consider using map instead of binary search because map has the smallest time complexity. But after benchmark, it seems that binary search performance is better and stable. Maybe I should continue to locate the reason for the sudden deterioration of map performance and wait for the conclusions of #61899 and #61900. |
Hi @cuishuang, also, if a generic set ends up landing in the standard library (#47331), that could be a simple and intuitive way to do this, including I would guess there would be an easy way to turn a slice into a set. |
Thanks! I have been thinking about how to reduce the cost of writing elements in a slice to a map or set. This should be the biggest cost when using hash (but overall, it should be better than the cost of sorting the slice) |
For constant time performance, sets are typically implemented internally as a hash map. If that ends up being the case in go, then you’d end up with the same performance concerns. |
Also there should be containers/heap/v2. If anyone has time, they can write up the proposal for that, which would basically be a similar API but generic and with a rangefunc iterator. |
@earthboundkid that sounds like #47632, or a mutually exclusive alternative to it. I'm not convinced it would need a range func iterator, but we can discuss on that issue if you like. |
If you care about performance and your algorithm is repeatedly searching a slice performing membership tests, you're using the wrong data structure. The solution is to use the right data structure (a map, as others have pointed out), not to make the slow operation on the wrong data structure more complicated. |
Thanks, I will change the binary search method to map method later. |
Proposal Details
Since the introduction of the slices package into the standard library, slices.Contains() has become one of its most frequently used functions. However, this function is merely a simple wrapper around slices.Index(), which iterates through the slice to find the result, leading to a time complexity of O(n).
While slices.Index() provides more information (the position of the element within the slice), slices.Contains() only needs to know whether the element is present in the slice. Therefore, there are more efficient ways than wrapping slices.Index(), especially since the exact position of the element is redundant information for slices.Contains(). This is particularly true when dealing with a vast number of elements and multiple loops, as seen in:
(I have also looked into bytes/strings.Contains(), which are wrappers around bytes/strings.Index() too, but unlike slices.Index(), strings.Index() does not simply iterate through but adjusts its approach based on the size of the parameters.)
There are two approaches:
(For more details, refer to sort: add Find #50340. Although not highly relevant, it can provide some ideas)
My initial attempt to optimize slices.Contains() involved using either binary search or a map. However, the issue is that slices.Contains() is only used once, lacking memorability, and a single use is far from offsetting the sorting overhead of binary search and the high cost of creating and maintaining a map.
For this common scenario, I suggest adding a new function, slices.ContainsUseManyTimes(), which takes a slice as an input and returns a function, similar to
func ContainsUseManyTimes(sli []int64) func(int64) bool
(though the function name could be more elegant).This way, for the initial scenario mentioned, we can:
Regarding performance:
ContainsUseManyTimes could use either binary search or a map, but performance tests show that using a map leads to a sudden performance drop with 9 million elements (the reason is still under consideration), while binary search remains efficient and stable.
Here is a simplified performance test:
contains.go:
contains_test.go:
Note: The above benchmark does not cover slice sorting or the cost of creating a map. The essence of this optimization is when it is necessary to determine whether M elements are in a slice of N elements multiple times (M and N are very large), the one-time overhead of slice sorting and map creation and maintenance is shared through the number of times. Therefore, depending on the size of N, M, and element type, there is a turning point. Only when this point is met, the effect of using this method Only then will it be absolutely improved
Final code, similar to the CL https://go-review.googlesource.com/c/go/+/568095
Suggestions and comments are welcome.
The text was updated successfully, but these errors were encountered: