Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Refactor IntSets to use BitVectors #10065

Closed
wants to merge 3 commits into from
Closed

Conversation

mbauman
Copy link
Sponsor Member

@mbauman mbauman commented Feb 4, 2015

This WIP is to merge my IntSets.jl project into base. Here's what this entails:

  • Use BitVectors instead of a mismash of bitvector.c and ad-hoc Julia code.
  • IntSet is much better defined to only hold integers in 0:typemax(Int)-1. Before it wasn't consistently defined, particularly when it came to complement IntSets.
  • Invert the complement IntSet semantics. Instead of filling ones outside the range of the IntSet, simply reverse what a set bit means. This means that inverting an IntSet is O(1) instead of O(n) and fixes a few bugs.
  • For mathematical functions (intersect, union, etc.), use BitVector functions to work 64 bits at a time instead of working bit-by-bit.
  • Increase test-coverage to 100%
  • Would close IntSet should be a wrapper around a BitArray #3039 (with a new constructor)

There are still two outstanding issues here that I could use some help with: The only thing left is a performance regression:

  • Julia fails to launch with a segfault when the sysimg.{dll/dylib/so} is unavailable. It happens during the compilation of an inference function that makes heavy use of IntSets, but I can't tell why with my rudimentary LLDB/LLVM knowledge. My hunch is that I've missed some functions that are needed but load after inference.jl in the sysimg, but it successfully bootstraps… so I'm not sure. The new inference bootstrap separation made this a cinch to debug. The culprits were some redundant Base. module-qualified names (since Base doesn't exist yet).
  • While many functions are way faster, there's still a performance gap for a few functions; most notably ~2x for in. See these albums for a comparison against the current master with varying density and size. Some of this is due to the addition of an unnecessary GC frame (propagate effect_free information out of functions #9974), but I think most of it is from the extra dereference and perhaps some more-frequent cache misses. Particularly confusing to me is in — there are fewer branches and the LLVM/native code looks much better (especially once I manually inline the getindex call to eliminate the GC frame), but it still lags in performance. That said, it doesn't seem to have an effect on overall compilation times from looking at the linalg test suite.

@JeffBezanson
Copy link
Sponsor Member

Very nice. The thorough performance comparison is pretty impressive!

@tkelman
Copy link
Contributor

tkelman commented Feb 4, 2015

Remember to add intset to test/runtests.jl

@mbauman
Copy link
Sponsor Member Author

mbauman commented Apr 16, 2015

Interesting thought from #1032: If we can somehow manage to remove support for 0 in IntSet, we could almost implement find(::BitArray) as a very fast and lazy conversion to an IntSet.

I'm not sure how practical that would be since IntSet doesn't implement indexing (and cannot do so efficiently). And it wouldn't satisfy the current meaning of find because of that. But if we're clever, it might simplify and improve the performance of logical indexing.

@quinnj
Copy link
Member

quinnj commented Apr 16, 2015

Maybe we just create a new type IndexSet that has those behaviors?

@StefanKarpinski
Copy link
Sponsor Member

IndexSet is really a better name for this type anyway.

@mbauman mbauman force-pushed the mb/intsetrefactor branch 2 times, most recently from 304d113 to 5a53338 Compare May 20, 2015 19:42
* Use BitVectors instead of a mismash of bitvector.c and ad-hoc Julia code.
* Invert the complement IntSet semantics.  Instead of filling ones outside the range of the IntSet, simply reverse what a set bit means.
* For mathematical functions (intersect, union, etc.), use BitVector functions to work 64 bits at a time instead of working bit-by-bit.
* Increase test-coverage to 100%
@mbauman
Copy link
Sponsor Member Author

mbauman commented May 21, 2015

Bump! With inference decoupled from base, this became very easy to debug! All tests pass now. The only remaining issue is the performance gap on in.

@dpsanders
Copy link
Contributor

Bump... I spent time yesterday writing tests for IntSets, only to discover that @mbauman had already done it and completely rewritten everything! +1! (Maybe you can separate out the tests and get those merged first.)

@mbauman
Copy link
Sponsor Member Author

mbauman commented Jul 21, 2015

Shucks. Sorry for the duplicated effort. Don't let this long-languishing work in progress hold up your testing work! Feel free to push your own tests and/or grab tests from here.

If you do grab tests from here, you'll run into some complement IntSet bugs. I have a hunch that nobody is using complement IntSets very thoroughly.

Maybe we should deprecate the complement functionality? This is a core piece of functionality that's required for inference (otherwise I don't think it'd be in base). It could make sense to simplify IntSet and move this work into a package for richer IndexSet-like functionality.

@JeffBezanson
Copy link
Sponsor Member

+1 for removing the complement feature. It was written back in the days when we were bored and just looking for neat stuff to do.

@StefanKarpinski
Copy link
Sponsor Member

Agree with removing the complement stuff. I also suspect this type should be renamed to IndexSet since it is really not a generic IntSet but really a set of indices.

@dpsanders
Copy link
Contributor

My effort is at #12247.

The complement functionality is neat, but I can see that it doesn't really belong in Base.
My tests are quite focused around the particular current implementation.

@ScottPJones
Copy link
Contributor

I'd agree if this only supports 0 and above and is really only used for indices it should just be added as IndexSet, with no complement. IntSet to me implies it needs to allow any integer value in the set, not just >= 0.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Jul 21, 2015

Perhaps we entirely deprecate IntSet and introduce IndexSet? That would allow us to gracefully remove the ability to store zero, too.

@ScottPJones
Copy link
Contributor

SGTM 👍

@StefanKarpinski
Copy link
Sponsor Member

I'm entirely convinced that we should remove the ability to store zero...

@mbauman
Copy link
Sponsor Member Author

mbauman commented Jul 21, 2015

Wonderful. Let's do it. It's actually not as bad as I had been thinking since we already branch for a bounds check. We can check for zero within that branch with no extra overhead to the other branch.

else
(s.bits[n>>5 + 1] & (UInt32(1)<<(n&31))) != 0
ifelse((idx <= 0) | (idx > typemax(Int)), false, s.inverse)
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't you missing the return statements here?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably be more clear, yes, but this works just fine as it is.

The result of the last expression in a function body is used as the return value (if not explicitly specified). In this case, the last expression is the if block.

The result of an if block is similarly determined by the last expression of the traversed branch (or nothing if no branches were traversed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I knew that the return value of the last expression was used, but I though that an if block would have returned nothing.

Sorry for the noise.

@mbauman
Copy link
Sponsor Member Author

mbauman commented Aug 9, 2015

Closing in favor of JuliaCollections/DataStructures.jl#114 (moving IntSet to DataStructures) and #12270 (deprecating complement and stored zeros, on the path to renaming IntSet → IndexSet).

@mbauman mbauman closed this Aug 9, 2015
mbauman added a commit that referenced this pull request Feb 4, 2017
* Complete deprecation of stored zeros; IntSets now only support integers in the range `1:typemax(Int)`
* Complete deprecation of `complement`; removes all support for inverted IntSets
* Refactor internals to rely on a BitVector, allowing the use of highly optimized `map` methods. `IntSet` is now immutable. This significantly improves performance across varying [densities](http://imgur.com/a/uqv8A) and [sizes](http://imgur.com/a/iEgcr). These are compared against a modified Base with deprecation warnings removed for a fairer comparison. Testing code [available here](https://github.com/mbauman/IntSets.jl/tree/b50a7c97abbe9786e33221f723e107e266f31fe4/test).
* Add more tests and organize into testsets.
* Improve hashing; `hash(IntSet([1]))` is now distinct from `hash(IntSet([65]))`

This is a continuation of #10065. Now that complements are fully removed, making IntSet immutable solves the performance issue. I am keeping the name the same within this PR as it vastly simplifies comparisons between the two implementations; the name can later be changed to `IndexSet` if still desired. The naming story is now a bit more complicated since we support offset indices, but a future change could perhaps allow wrapping any `AbstractVector{Bool}` and base the supported `Int`s on those indices. Very few methods depend upon BitArray internals.
@mbauman mbauman mentioned this pull request Feb 4, 2017
@mbauman mbauman deleted the mb/intsetrefactor branch January 11, 2021 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IntSet should be a wrapper around a BitArray
8 participants