Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: slices: add a reusable slices.Contains() #66022

Open
cuishuang opened this issue Feb 29, 2024 · 14 comments
Open

proposal: slices: add a reusable slices.Contains() #66022

cuishuang opened this issue Feb 29, 2024 · 14 comments
Labels
Milestone

Comments

@cuishuang
Copy link
Contributor

cuishuang commented Feb 29, 2024

Proposal Details

Since the introduction of the slices package into the standard library, slices.Contains() has become one of its most frequently used functions. However, this function is merely a simple wrapper around slices.Index(), which iterates through the slice to find the result, leading to a time complexity of O(n).

While slices.Index() provides more information (the position of the element within the slice), slices.Contains() only needs to know whether the element is present in the slice. Therefore, there are more efficient ways than wrapping slices.Index(), especially since the exact position of the element is redundant information for slices.Contains(). This is particularly true when dealing with a vast number of elements and multiple loops, as seen in:

(I have also looked into bytes/strings.Contains(), which are wrappers around bytes/strings.Index() too, but unlike slices.Index(), strings.Index() does not simply iterate through but adjusts its approach based on the size of the parameters.)

func bar() {
    var slice []int

    rand.Seed(time.Now().UnixNano())

    for i := 0; i < 100_000_000; i++ {
        slice = append(slice, rand.Intn(100_000_000))
    }

    var existSli []int

    for i := 0; i < 100_000; i++ {
        for j := 0; j < 100_000; j++ {

            exist := slices.Contains(slice, j)
            if exist {
                existSli = append(existSli, j)
            }
        }
    }
}

There are two approaches:

  1. If the slice is ordered, then use binary search with a time complexity of O(logn). However, sorting incurs additional overhead.

(For more details, refer to sort: add Find #50340. Although not highly relevant, it can provide some ideas)

  1. Use a hash map, turning slice elements into map keys, which can reduce the time complexity to O(1). However, it's important to note that creating and maintaining a map incurs higher overhead than using a slice. Therefore, for smaller n values and only use once case, using a map might not be more efficient.

My initial attempt to optimize slices.Contains() involved using either binary search or a map. However, the issue is that slices.Contains() is only used once, lacking memorability, and a single use is far from offsetting the sorting overhead of binary search and the high cost of creating and maintaining a map.

For this common scenario, I suggest adding a new function, slices.ContainsUseManyTimes(), which takes a slice as an input and returns a function, similar to func ContainsUseManyTimes(sli []int64) func(int64) bool (though the function name could be more elegant).

This way, for the initial scenario mentioned, we can:

func bar() {
	var slice []int

	rand.Seed(time.Now().UnixNano())

	for i := 0; i < 100_000_000; i++ {
		slice = append(slice, rand.Intn(100_000_000))
	}

	f := ContainsUseManyTimes(slice)

	var existSli []int

	for i := 0; i < 100_000; i++ {
		for j := 0; j < 100_000; j++ {

			if f(j) {
				existSli = append(existSli, j)
			}
		}
	}

}

func ContainsUseManyTimes(sli []int64) func(int64) bool {

	var sortSli []int64
	copy(sortSli, sli)

	sort.Slice(sortSli, func(i, j int) bool { return sortSli[i] < sortSli[j] })

	return func(target int64) bool {
		low := 0
		high := len(sortSli) - 1

		for low <= high {
			mid := low + (high-low)/2
			midVal := sortSli[mid]

			if midVal < target {
				low = mid + 1
			} else if midVal > target {
				high = mid - 1
			} else {
				return true
		}
		return false
	}
}

Regarding performance:

ContainsUseManyTimes could use either binary search or a map, but performance tests show that using a map leads to a sudden performance drop with 9 million elements (the reason is still under consideration), while binary search remains efficient and stable.

Here is a simplified performance test:

contains.go:

func ContainsFuncByMap(sli []int64) func(int64) bool {

	m := make(map[int64]struct{})
	for _, v := range sli {
		m[v] = struct{}{}
	}

	return func(target int64) bool {
		if _, ok := m[target]; ok {
			return true
		} else {
			return false
		}
	}
}


func ContainsFuncBinarySearch(sli []int64) func(int64) bool {

	sortSli := make([]int64, len(sli))
	copy(sortSli, sli)

	sort.Slice(sortSli, func(i, j int) bool { return sortSli[i] < sortSli[j] })

	return func(target int64) bool {
		low := 0
		high := len(sortSli) - 1

		for low <= high {
			mid := low + (high-low)/2
			midVal := sortSli[mid]

			if midVal < target {
				low = mid + 1
			} else if midVal > target {
				high = mid - 1
			} else {
				return true
			}
		}
		return false
	}
}

contains_test.go:

func BenchmarkContains(b *testing.B) {
	for i := 0; i < b.N; i++ {
		slices.Contains(sli, target)
	}
}

func BenchmarkContainsFuncByMap(b *testing.B) {
	f := ContainsFuncByMap(sli)
	for i := 0; i < b.N; i++ {
		f(target)
	}
}

func BenchmarkContainsFuncBinarySearch(b *testing.B) {
	f := ContainsFuncBinarySearch(sli)
	for i := 0; i < b.N; i++ {
		f(target)
	}
}
➜  map-vs-slice git:(main) ✗ go test -test.bench=".*" -benchmem    (N is the length of slice, N=100)
goos: darwin
goarch: arm64
pkg: mapvsslice
BenchmarkContains-8                     35311144                33.32 ns/op            0 B/op          0 allocs/op
BenchmarkContainsFuncByMap-8            173432636                6.919 ns/op           0 B/op          0 allocs/op
BenchmarkContainsFuncBinarySearch-8     524108920                2.262 ns/op           0 B/op          0 allocs/op
PASS
ok      mapvsslice      5.750s
➜  map-vs-slice git:(main) ✗ go test -test.bench=".*" -benchmem     (N=10000)
goos: darwin
goarch: arm64
pkg: mapvsslice
BenchmarkContains-8                       373540              3171 ns/op               0 B/op          0 allocs/op
BenchmarkContainsFuncByMap-8            172989003                6.936 ns/op           0 B/op          0 allocs/op
BenchmarkContainsFuncBinarySearch-8     535104032                2.329 ns/op           0 B/op          0 allocs/op
PASS
ok      mapvsslice      5.650s
➜  map-vs-slice git:(main) ✗ go test -test.bench=".*" -benchmem  (N=1000000)
goos: darwin
goarch: arm64
pkg: mapvsslice
BenchmarkContains-8                         3715            322815 ns/op               0 B/op          0 allocs/op
BenchmarkContainsFuncByMap-8            160653786                7.422 ns/op           0 B/op          0 allocs/op
BenchmarkContainsFuncBinarySearch-8     505987872                2.249 ns/op           0 B/op          0 allocs/op
PASS
ok      mapvsslice      6.672s
➜  map-vs-slice git:(main) ✗ go test -test.bench=".*" -benchmem   (N=100000000)
goos: darwin
goarch: arm64
pkg: mapvsslice
BenchmarkContains-8                           92          12866297 ns/op               0 B/op          0 allocs/op
BenchmarkContainsFuncByMap-8                   1        12020805167 ns/op       3088798200 B/op  1518423 allocs/op
BenchmarkContainsFuncBinarySearch-8     529186568                2.314 ns/op           0 B/op          0 allocs/op
PASS
ok      mapvsslice      18.064s

Note: The above benchmark does not cover slice sorting or the cost of creating a map. The essence of this optimization is when it is necessary to determine whether M elements are in a slice of N elements multiple times (M and N are very large), the one-time overhead of slice sorting and map creation and maintenance is shared through the number of times. Therefore, depending on the size of N, M, and element type, there is a turning point. Only when this point is met, the effect of using this method Only then will it be absolutely improved

Final code, similar to the CL https://go-review.googlesource.com/c/go/+/568095

Suggestions and comments are welcome.

@gopherbot gopherbot added this to the Proposal milestone Feb 29, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/568095 mentions this issue: slices: add a reusable slices.Contains()

@jaloren
Copy link

jaloren commented Feb 29, 2024

Can’t this already be achieved by calling slices.Sort and then slices.BinarySearch? Binary search also assumes the slice is orderable while slices.Contains just requires comparability.

@cuishuang
Copy link
Contributor Author

Can’t this already be achieved by calling slices.Sort and then slices.BinarySearch? Binary search also assumes the slice is orderable while slices.Contains just requires comparability.

The key to the problem is reusability, which seems not to be possible using slices.BinarySearch.

@hishope
Copy link

hishope commented Feb 29, 2024

This approach seems practical. I frequently encounter a similar scenario: I have two substantial slices, denoted as 'a' and 'b'. The elements in 'a' are instances of the UserInfo structure, while 'b' consists of string names. I need to identify which elements in 'a' have a UserInfo.Name that matches any entry in 'b'. Initially, I employed a method involving two nested loops, utilizing slices.Contains() within the inner loop.

I noticed a significant performance drawback with this approach.Subsequently, I experimented with employing a map or binary search at the outer layer to enhance efficiency. While this alternative did enhance performance, it introduced additional code that lacked the same level of intuitiveness (more intricate than using Contains).

It would be great if there was a similar method in the standard library. But I feel that for different scenarios, it is difficult to ensure that its performance is better than using Contains multiple times in a loop.

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Mar 1, 2024
@ianlancetaylor
Copy link
Member

This seems like it will have surprising performance characteristics. Copying a large slice, whether into another slice or a map, reuses a lot of memory. That does not seem implied by a name like ContainsUseManyTimes.

Also, of course, we can only use binary search on a slice for an ordered type, but Contains permits any comparable type.

In general it seems simpler to have a function that converts a slice to a map (as in the functions in #61899 and #61900) and then just use the map. That is explicit about what is happening and seems about as easy to use as this proposal.

@cuishuang
Copy link
Contributor Author

This approach seems practical. I frequently encounter a similar scenario: I have two substantial slices, denoted as 'a' and 'b'. The elements in 'a' are instances of the UserInfo structure, while 'b' consists of string names. I need to identify which elements in 'a' have a UserInfo.Name that matches any entry in 'b'. Initially, I employed a method involving two nested loops, utilizing slices.Contains() within the inner loop.

I noticed a significant performance drawback with this approach.Subsequently, I experimented with employing a map or binary search at the outer layer to enhance efficiency. While this alternative did enhance performance, it introduced additional code that lacked the same level of intuitiveness (more intricate than using Contains).

It would be great if there was a similar method in the standard library. But I feel that for different scenarios, it is difficult to ensure that its performance is better than using Contains multiple times in a loop.

Thanks for the comment. This situation does occur very frequently in the code. The existing method does not perform well enough or requires additional complexity. However, it is indeed difficult to accurately measure the performance

@cuishuang
Copy link
Contributor Author

cuishuang commented Mar 1, 2024

This seems like it will have surprising performance characteristics. Copying a large slice, whether into another slice or a map, reuses a lot of memory. That does not seem implied by a name like ContainsUseManyTimes.

Also, of course, we can only use binary search on a slice for an ordered type, but Contains permits any comparable type.

In general it seems simpler to have a function that converts a slice to a map (as in the functions in #61899 and #61900) and then just use the map. That is explicit about what is happening and seems about as easy to use as this proposal.

Thanks for the comments and guidance, I did initially consider using map instead of binary search because map has the smallest time complexity. But after benchmark, it seems that binary search performance is better and stable.

Maybe I should continue to locate the reason for the sudden deterioration of map performance and wait for the conclusions of #61899 and #61900.

@thepudds
Copy link
Contributor

thepudds commented Mar 1, 2024

Hi @cuishuang, also, if a generic set ends up landing in the standard library (#47331), that could be a simple and intuitive way to do this, including I would guess there would be an easy way to turn a slice into a set.

@cuishuang
Copy link
Contributor Author

Hi @cuishuang, also, if a generic set ends up landing in the standard library (#47331), that could be a simple and intuitive way to do this, including I would guess there would be an easy way to turn a slice into a set.

Thanks! I have been thinking about how to reduce the cost of writing elements in a slice to a map or set. This should be the biggest cost when using hash (but overall, it should be better than the cost of sorting the slice)

@jaloren
Copy link

jaloren commented Mar 1, 2024

For constant time performance, sets are typically implemented internally as a hash map. If that ends up being the case in go, then you’d end up with the same performance concerns.

@earthboundkid
Copy link
Contributor

Also there should be containers/heap/v2. If anyone has time, they can write up the proposal for that, which would basically be a similar API but generic and with a rangefunc iterator.

@cespare
Copy link
Contributor

cespare commented Mar 1, 2024

@earthboundkid that sounds like #47632, or a mutually exclusive alternative to it. I'm not convinced it would need a range func iterator, but we can discuss on that issue if you like.

@adonovan
Copy link
Member

adonovan commented Mar 4, 2024

If you care about performance and your algorithm is repeatedly searching a slice performing membership tests, you're using the wrong data structure. The solution is to use the right data structure (a map, as others have pointed out), not to make the slow operation on the wrong data structure more complicated.

@cuishuang
Copy link
Contributor Author

If you care about performance and your algorithm is repeatedly searching a slice performing membership tests, you're using the wrong data structure. The solution is to use the right data structure (a map, as others have pointed out), not to make the slow operation on the wrong data structure more complicated.

Thanks, I will change the binary search method to map method later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

9 participants