Refactor IntSets #20456

mbauman · 2017-02-04T22:22:35Z

The element type of IntSet is now Int. IntSets now only support integers in the range 1:typemax(Int). This is technically a breaking change on 32-bit systems, where it was previously possible to store some Int64s larger than typemax(Int32). Note that all documentation currently says that IntSet stores Ints, and 32 bit systems currently fail pretty spectacularly for integers larger than 2^36 (a 2GB data structure). So this breakage is pretty oblique.
Complete deprecation of stored zeros.
Complete deprecation of complement; removes all support for inverted IntSets
Refactor internals to rely on a BitVector, allowing the use of highly optimized map methods. IntSet is now immutable. This significantly improves performance across varying densities and sizes. These are compared against a modified Base with deprecation warnings removed for a fairer comparison. Testing code available here.
Add more tests and organize into testsets.
Improve hashing; hash(IntSet([1])) is now distinct from hash(IntSet([65]))

This is a continuation of #10065. Now that complements are fully removed, making IntSet immutable solves the performance issue. I am keeping the name the same within this PR as it vastly simplifies comparisons between the two implementations; the name can later be changed to IndexSet if still desired. The naming story is now a bit more complicated since we support offset indices, but a future change could perhaps allow wrapping any AbstractVector{Bool} and base the supported Ints on those indices. Very few methods depend upon BitArray internals.

I was surprised how easy it was to revive my IntSets package. The diffstat is deceiving:

$ git diff base | diffstat -s
 5 files changed, 183 insertions(+), 291 deletions(-)
$ git diff test | diffstat -s
 1 file changed, 268 insertions(+), 127 deletions(-)

* Complete deprecation of stored zeros; IntSets now only support integers in the range `1:typemax(Int)` * Complete deprecation of `complement`; removes all support for inverted IntSets * Refactor internals to rely on a BitVector, allowing the use of highly optimized `map` methods. `IntSet` is now immutable. This significantly improves performance across varying [densities](http://imgur.com/a/uqv8A) and [sizes](http://imgur.com/a/iEgcr). These are compared against a modified Base with deprecation warnings removed for a fairer comparison. Testing code [available here](https://github.com/mbauman/IntSets.jl/tree/b50a7c97abbe9786e33221f723e107e266f31fe4/test). * Add more tests and organize into testsets. * Improve hashing; `hash(IntSet([1]))` is now distinct from `hash(IntSet([65]))` This is a continuation of #10065. Now that complements are fully removed, making IntSet immutable solves the performance issue. I am keeping the name the same within this PR as it vastly simplifies comparisons between the two implementations; the name can later be changed to `IndexSet` if still desired. The naming story is now a bit more complicated since we support offset indices, but a future change could perhaps allow wrapping any `AbstractVector{Bool}` and base the supported `Int`s on those indices. Very few methods depend upon BitArray internals.

ararslan · 2017-02-04T22:39:05Z

base/intset.jl

-    IntSet() = new(zeros(UInt32,256>>>5), 256, false)
+immutable IntSet <: AbstractSet{Int}
+    bits::BitVector
+    IntSet() = new(fill!(BitVector(256), false))


You should be able to just do new(falses(256)), as falses returns a BitVector.

ararslan · 2017-02-04T22:40:40Z

base/intset.jl

+# An internal function for setting the inclusion bit for a given integer n >= 0
+@inline function _setint!(s::IntSet, idx::Integer, b::Bool)
+    if idx > length(s.bits)
+        !b && return s # setting a bit to zero outside the set's bits is a no-op


b || return s avoids the negation, though I'm sure that's so insignificant that it isn't even worth mentioning

ararslan · 2017-02-04T22:41:29Z

base/intset.jl

+
+# An internal function that takes a pure function `f` and maps across two BitArrays
+# allowing the lengths to be different and altering b1 with the result
+function _matched_map!{F}(f::F, b1::BitArray, b2::BitArray)


The F parameter shouldn't be necessary I would think?

We definitely want these functions to inline and specialize; the type parameter ensures that happens.

It does? I didn't think there would be a difference here. Is there anywhere I can read about this in the docs?

Sorry to butt in, this was just very surprising to me.

Hm. There is some sort of difference here that I thought I had measured before… but in this case we're just punting to map so it's totally overkill. I'll ditch the specialization.

Yeah I thought the type parameters in cases such as this used to make a difference but at some point something changed so that they didn't matter, but I don't have anything concrete to point to.

@iamed2 AFAIK the rule is that method specialization on function arguments happen only when 1) they are called from the body of the method, or 2) a type parameter is explicitly used. I'm not sure whether that's mentioned in the manual.

@nalimilan Thanks. It makes sense to leave certain compiler implementation details out of the manual if they're expected to change, but if this is a concern for performance it should maybe get a mention?

As I said, it might already be in the manual, so better check that first.

Ref. #19137 re. specialization on function arguments. Best!

ararslan · 2017-02-04T22:44:18Z

base/intset.jl

-
-"""
-    symdiff!(s1, s2)
+length(s::IntSet) = sum(s.bits)


countnz might be more efficient for BitVectors than sum, but I'm not entirely sure.

It's exactly the same, sum is just defined as countnz.

ararslan · 2017-02-04T22:45:32Z

base/intset.jl

+    end
+    while i > 0
+        h = hash(bc[i], h)
+	i -= 1


This is indented with a single tab but it should be a multiple of 4 spaces

ararslan · 2017-02-04T22:46:15Z

test/intset.jl

+    data_in = (1,5,100)
+    s = IntSet(data_in)
+    data_out = collect(s)
+    @test all(map(d->in(d,data_out), data_in))


all takes a function, so you can do all(d->in(d, data_out), data_in)

carlobaldassi · 2017-02-04T22:31:24Z

base/inference.jl

-                min_pc = next(W, Int64(pc) + 1)[1]
-                if min_pc >= W.limit
+                min_pc = next(W, Int64(pc))[2]
+                if done(W, min_pc)


Is this change related to this PR?

Unfortunately yes. This code depends upon the internal iteration state.

carlobaldassi · 2017-02-04T22:32:08Z

base/intset.jl

-    IntSet() = new(zeros(UInt32,256>>>5), 256, false)
+immutable IntSet <: AbstractSet{Int}
+    bits::BitVector
+    IntSet() = new(fill!(BitVector(256), false))


falses(256)?

carlobaldassi · 2017-02-04T22:37:55Z

base/intset.jl

-    end
-    print(io, "])")
+copy(s1::IntSet) = copy!(IntSet(), s1)
+function copy!(to::IntSet, from::IntSet)


Throughout Base, the arguments are called dest and src. It may be a good time to use them in IntSet too.

carlobaldassi · 2017-02-04T22:41:28Z

base/intset.jl

    end
+    unsafe_setindex!(s.bits, b, idx) # Use @inbounds once available


@inbounds could be used now, right?

Much nicer!

carlobaldassi · 2017-02-04T22:44:03Z

base/intset.jl

+@inline function _resize0!(b::BitVector, newlen::Integer)
+    len = length(b)
+    resize!(b, newlen)
+    len < newlen && unsafe_setindex!(b, false, len+1:newlen) # resize! gives dirty memory


@inbounds could be used here too.

carlobaldassi · 2017-02-04T23:01:37Z

base/intset.jl

-        n = Int64(i)
+@inline function in(n::Integer, s::IntSet)
+    if 1 <= n <= length(s.bits)
+        unsafe_getindex(s.bits, n)


@inbounds return s.bits[n]?

carlobaldassi · 2017-02-04T23:07:36Z

base/intset.jl

+    end
+    while i > 0
+        h = hash(bc[i], h)
+	i -= 1


There seems to be a tab instead of spaces here.

carlobaldassi · 2017-02-04T23:16:56Z

base/intset.jl

 end
+eltype(s::IntSet) = Int
+sizehint!(s::IntSet, n::Integer) = (_resize0!(s.bits, n); s)


Isn't this going to resize the set no matter what? I think it needs to resize to isempty(s) ? n : max(n, last(s)), we don't want to throw away elements larger than n.

Very good catch. Fixed and tested.

mbauman · 2017-02-05T00:08:11Z

Thank you @carlobaldassi and @ararslan. I've pushed a change that addresses your comments and hopefully makes 32 bit CI a bit happier. Unfortunately, it does make this an obliquely breaking PR — the eltype is now Int32 on 32 bit systems, meaning that you can no longer store values bigger than typemax(Int32).

JeffBezanson · 2017-02-05T18:07:20Z

base/intset.jl

    else
-        n = Int64(ccall(:bitvector_next, UInt64, (Ptr{UInt32}, UInt64, UInt64), s.bits, i, s.limit))


I think most of the C functions used in here can be deleted from support/bitvector.* now.

I was hoping so myself, but I think it's used by flisp in the parser.

Some of the functions are used by flisp, but most aren't. I think everything after bitvector_get in bitvector.c can be deleted.

Ah, nice, thanks!

Following #20456 this is a better and more accurate name.

ararslan reviewed Feb 4, 2017

View reviewed changes

carlobaldassi reviewed Feb 4, 2017

View reviewed changes

Address review comments; use === where possible

bb3198a

mbauman added breaking This change will break code collections Data structures holding multiple items, e.g. sets labels Feb 5, 2017

JeffBezanson reviewed Feb 5, 2017

View reviewed changes

JeffBezanson added this to the 0.6.0 milestone Feb 5, 2017

mbauman added 2 commits February 5, 2017 22:28

Remove no-longer needed C bitvector functions

bb78c47

Remove TypeVar in _matched_map!

651d3df

JeffBezanson merged commit 40831ac into master Feb 7, 2017

mbauman deleted the mb/intset-refactor branch February 7, 2017 16:34

mbauman added a commit that referenced this pull request Feb 8, 2017

Rename IntSet to PositiveIntSet

800fc2e

Following #20456 this is a better and more accurate name.

mbauman mentioned this pull request Feb 8, 2017

Rename IntSet to PositiveIntSet #20512

Closed

mbauman mentioned this pull request Mar 3, 2017

Arraypocalypse Now and Then #13157

Closed

27 tasks

rfourquet mentioned this pull request Sep 2, 2017

fix IntSet tests invalid since #23138 #23559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor IntSets #20456

Refactor IntSets #20456

mbauman commented Feb 4, 2017 •

edited

Loading

ararslan Feb 4, 2017

ararslan Feb 4, 2017

ararslan Feb 4, 2017

mbauman Feb 5, 2017

iamed2 Feb 5, 2017

mbauman Feb 5, 2017

ararslan Feb 5, 2017

nalimilan Feb 6, 2017

iamed2 Feb 6, 2017

nalimilan Feb 6, 2017

Sacha0 Feb 6, 2017

ararslan Feb 4, 2017

carlobaldassi Feb 4, 2017

ararslan Feb 4, 2017

ararslan Feb 4, 2017

carlobaldassi Feb 4, 2017

mbauman Feb 5, 2017

carlobaldassi Feb 4, 2017

carlobaldassi Feb 4, 2017

carlobaldassi Feb 4, 2017

mbauman Feb 5, 2017

carlobaldassi Feb 4, 2017

carlobaldassi Feb 4, 2017

carlobaldassi Feb 4, 2017

carlobaldassi Feb 4, 2017

mbauman Feb 5, 2017

mbauman commented Feb 5, 2017

JeffBezanson Feb 5, 2017

mbauman Feb 5, 2017

JeffBezanson Feb 5, 2017

mbauman Feb 6, 2017

		end
		unsafe_setindex!(s.bits, b, idx) # Use @inbounds once available

		else
		n = Int64(ccall(:bitvector_next, UInt64, (Ptr{UInt32}, UInt64, UInt64), s.bits, i, s.limit))

Refactor IntSets #20456

Refactor IntSets #20456

Conversation

mbauman commented Feb 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbauman commented Feb 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbauman commented Feb 4, 2017 •

edited

Loading