IntSet should be a wrapper around a BitArray #3039

StefanKarpinski · 2013-05-07T03:38:00Z

This will, in particular, allow the primes function to construct the sequence of integers as a BitArray and then return an IntSet, which is a more desirable API.

quinnj · 2014-08-08T11:01:30Z

IntSet already wraps an array of bits or were you thinking it should wrap a literal BitArray?

StefanKarpinski · 2014-08-08T14:58:05Z

Yes, the idea was to use a BitArray as the internal representation of IntSet.

ggggggggg · 2014-08-21T23:52:43Z

I took a swing at this trying to use only the public api (not accessing the fields) of BitArray. It was pretty easy, and reduces the amount of code used. However many operations are slower. I believe this is mostly due to the fact that operations like b[1:n] &= b2[1:n] with b and b2 as BitVectors make copies of parts of the underlying chunks vector in the BitVectors. Previously these were implemented as loops over a Vector{Uint32} with no allocation.

The paths forward would be to try to speed up b[a:b] &= b2[a:b] like operations for BitVector or to
reproduce the specialized loops in IntSet by accessing .chunks inside of BitArray. I'm pretty sure I could do the second one, the first sounds harder but potentially more generally useful. Thoughts?

Iteration is slightly faster, which is surprising given that in IntSet next is based on a ccall, and I had to add a few more if statements into next. That suggest the julia implementation of findnext(x::BitVector) may be faster than the c version of bitvector_next. Maybe because it works on Uint64 rather than Uint32?

Here is a gist with my code so far: https://gist.github.com/ggggggggg/46a7e57ccff05d3a1558

a = IntSetBitVector(rand(1:10000000,1000));
b = IntSetBitVector(rand(1:10000000,1000));
println("IntSetBitVector")
@time [a==b for j = 1:100];
@time [union(a,b) for j=1:100];
@time [for i in a end for j=1:1000];
c = IntSet(rand(1:10000000,1000));
d = IntSet(rand(1:10000000,1000));
println("IntSet")
@time [c==d for j = 1:100];
@time [union(c,d) for j=1:100];
@time [for i in c end for j=1:1000];

gives

IntSetBitVector
elapsed time: 0.343220277 seconds (334580096 bytes allocated, 18.23% gc time)
elapsed time: 1.219215981 seconds (1246033244 bytes allocated, 23.56% gc time)
elapsed time: 0.467787536 seconds (64006024 bytes allocated, 7.24% gc time)
IntSet
elapsed time: 0.015644148 seconds (369468 bytes allocated)
elapsed time: 0.389509454 seconds (209822064 bytes allocated, 13.41% gc time)
elapsed time: 0.653751078 seconds (64000048 bytes allocated)

ggggggggg · 2014-10-02T04:21:09Z

I decided to go ahead and directly access .chunks occasionally to avoid allocation. Now the BitVector based version is roughly 2x faster than the existing version in most cases. Oddly == is way faster, but I'm not sure I believe the speedup, since it seems like too much.
Here is my gist again: https://gist.github.com/ggggggggg/46a7e57ccff05d3a1558
Next step is to build Julia from source again and actually submit a pull request.

Timings for ==, union, for i in s end, intersect, and symdiff in order.

IntSetBitVector
elapsed time: 0.001024593 seconds (15904 bytes allocated)
elapsed time: 0.392070556 seconds (209824332 bytes allocated, 6.14% gc time)
elapsed time: 0.735553523 seconds (64000048 bytes allocated)
elapsed time: 0.505063664 seconds (210015040 bytes allocated, 26.61% gc time)
elapsed time: 0.376684838 seconds (209733696 bytes allocated, 5.82% gc time)
IntSet
elapsed time: 0.028398684 seconds (348796 bytes allocated)
elapsed time: 0.72030733 seconds (209820848 bytes allocated, 16.64% gc time)
elapsed time: 1.357721465 seconds (64010800 bytes allocated)
elapsed time: 0.766078242 seconds (210047836 bytes allocated, 20.69% gc time)
elapsed time: 0.602836014 seconds (209751312 bytes allocated)

carlobaldassi · 2014-10-02T17:25:29Z

@ggggggggg Sorry for not getting to this sooner.

It seems that in a number of places (union, ==, maybe others) you are doing loops over chunks using ranges like 1:div(lim, 64), and then comparing the non-overlapping end portions of the two sets. E.g.:

for i = 1:div(lim, 64)
    s.b.chunks[i] |= s2.b.chunks[i]
end
s2.fill1s && (s.b[lim+1:end] = true)

this skips the elements which may lay between div(lim,64)+1 and lim, i.e. the final, incomplete chunk of the BitArray. If you look into the bitarray.jl code, there are almost always specialized bits of code for this end-portion of the chunks (the ones which use the @_msk_end macro).

As you said in a previous message, it would probably be nice to have operators like &!(b1,b2) which do the equivalent of b1 &= b2. I sometimes needed those as well. Until we have Array Views, something like b2[1:lim] would still perform extra allocation, but we're on the verge of removing that problem, so when that happens we might go back to your previous, simpler version of IntSetBitVector.

ggggggggg · 2014-10-03T14:56:19Z

I agree that relying on BitArrayViews views would be nicer, and if that is in the medium term horizon it is probably the right choice. It is my understanding that ArrayViews do not imply BitArrayViews, but perhaps there is a plan to create BitArrayViews.

However I tried to avoid the problem with skipping limits between div(lim,64)+1 and lim by making sure those elements don't exist since lim = integer*64. I made sure the constructor and sizehint always maintain that condition, which I think is enough to ensure it is true.

The other piece of functionality that really belongs in BitVector is what I did in last which really should call the non-existant findprev(v::BitVector,n:Integer), which would be just like findnext but work backwards.

function sizehint(s::IntSetBitVector, top::Integer)
    if top >= length(s.b)
        lim = iceil(top/64)*64
        oldsize = length(s.b)
        if oldsize < lim
            assert(lim%64==0) # this is assumed to be true elsewhere
            resize!(s.b, lim)
            s.b[oldsize+1:end] = s.fill1s
        end
    end
    s
end

JeffBezanson · 2017-04-07T23:01:20Z

This has been done.

ivarne added the up for grabs label Aug 13, 2014

danluu mentioned this issue Oct 3, 2014

next / bitvector_next gets stuck with IntSet, causing infinite loop #8570

Closed

mbauman mentioned this issue Feb 4, 2015

WIP: Refactor IntSets to use BitVectors #10065

Closed

2 tasks

ViralBShah removed the feature label Feb 14, 2015

JeffBezanson closed this as completed Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IntSet should be a wrapper around a BitArray #3039

IntSet should be a wrapper around a BitArray #3039

StefanKarpinski commented May 7, 2013

quinnj commented Aug 8, 2014

StefanKarpinski commented Aug 8, 2014

ggggggggg commented Aug 21, 2014

ggggggggg commented Oct 2, 2014

carlobaldassi commented Oct 2, 2014

ggggggggg commented Oct 3, 2014

JeffBezanson commented Apr 7, 2017

IntSet should be a wrapper around a BitArray #3039

IntSet should be a wrapper around a BitArray #3039

Comments

StefanKarpinski commented May 7, 2013

quinnj commented Aug 8, 2014

StefanKarpinski commented Aug 8, 2014

ggggggggg commented Aug 21, 2014

ggggggggg commented Oct 2, 2014

carlobaldassi commented Oct 2, 2014

ggggggggg commented Oct 3, 2014

JeffBezanson commented Apr 7, 2017