Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize DisjointSet #37835

Merged
merged 6 commits into from
May 2, 2024
Merged

Optimize DisjointSet #37835

merged 6 commits into from
May 2, 2024

Conversation

gmou3
Copy link
Contributor

@gmou3 gmou3 commented Apr 20, 2024

The methods find and union of the class DisjointSet_of_hashables pass through an unnecessary layer of input checks that slow down the code. I avoided these checks, cythonized more of the code, and edited the docstrings.

I can attest to the fact that these changes improve speed, and in a follow-up PR (#37839) I mention a test that confirms this.

Copy link

github-actions bot commented Apr 20, 2024

Documentation preview for this PR (built with commit e24a492; changes) is ready! 🎉
This preview will update shortly after each push to this PR.

@mantepse
Copy link
Collaborator

mantepse commented Apr 21, 2024

Wouldn't it make sense to simply remove the input checks?

(I think not, but I cannot really substantiate my intuition. It may make sense to leave them in, because cython is essentially not debugable.)

@gmou3
Copy link
Contributor Author

gmou3 commented Apr 21, 2024

Yes, I would prefer to remove them altogether. However, union is a reserved word in C, so I was tempted to create these alternatives. Any suggestions welcome.

Copy link
Collaborator

@tscrim tscrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some things to try and make it a bit faster.

src/sage/sets/disjoint_set.pyx Outdated Show resolved Hide resolved
src/sage/sets/disjoint_set.pyx Outdated Show resolved Hide resolved
src/sage/sets/disjoint_set.pyx Outdated Show resolved Hide resolved
src/sage/sets/disjoint_set.pyx Outdated Show resolved Hide resolved
src/sage/sets/disjoint_set.pyx Outdated Show resolved Hide resolved
@gmou3
Copy link
Contributor Author

gmou3 commented Apr 21, 2024

@mantepse I only kept the version with no input checks and renamed union to join. Note that the Disjoint Set data structure is also called a union-find data structure or merge-find data structure (is merge a better name? - see https://en.wikipedia.org/wiki/Disjoint-set_data_structure).

@tscrim Thanks for the suggestions; I want to make it as fast as possible. I did implement the recommended changes and I think it is a tiny bit faster. (It is not easy to come up with good tests.)

@gmou3 gmou3 force-pushed the disjointset_optimize branch 3 times, most recently from 5592d82 to ae06474 Compare April 21, 2024 22:29
@tscrim
Copy link
Collaborator

tscrim commented Apr 22, 2024

Thank you. Although I am -1 on renaming union to join. I believe union is a fairly standard name for this and a name conflict with a reserved C keyword is a poor argument IMO (as we are in Cython/Python and it is a method, which can have the same name as reserved keywords).

@mantepse
Copy link
Collaborator

Yes, please keep union-find.

Typecast `union` as `void` to a-void the "C struct/union cannot be declared cpdef" error.
@gmou3
Copy link
Contributor Author

gmou3 commented Apr 22, 2024

Ok, I managed a workaround and now union works.

@videlec
Copy link
Contributor

videlec commented Apr 22, 2024

Calling OP_find(OrbitPartition * p, int n) without being sure that n is in the range of p might result in a SEGFAULT. While this is fine for C code but it is not fine for pure Python code. Did you try something like

sage: e = DisjointSet(5)
sage: e.find(2**30)

@gmou3
Copy link
Contributor Author

gmou3 commented Apr 22, 2024

It causes a segmentation fault indeed. What should be done?

This avoids potential SEGFAULTs from Python calls
@gmou3
Copy link
Contributor Author

gmou3 commented Apr 22, 2024

@videlec I think you will be satisfied with the last changes.

@tscrim
Copy link
Collaborator

tscrim commented Apr 22, 2024

I believe @videlec's point is that user-facing code should not segfault on bad input (of course, internal code should not either, but there we can sanitize/control the input). So I believe your changes should have addressed this.

@gmou3
Copy link
Contributor Author

gmou3 commented Apr 23, 2024

Yes, I think it is good now.

@videlec
Copy link
Contributor

videlec commented Apr 23, 2024

Looking at it a bit more, I don't quite understand the purpose of this PR. If you want something fast(er), why don't you use directly the OrbitPartition C class? The DisjointSet provides one extra level of indirection which is a lot of overhead.

@gmou3
Copy link
Contributor Author

gmou3 commented Apr 23, 2024

Looking at it a bit more, I don't quite understand the purpose of this PR. If you want something fast(er), why don't you use directly the OrbitPartition C class? The DisjointSet provides one extra level of indirection which is a lot of overhead.

To some extent this is a fair point. However, the DisjointSet provides some extra functionality - e.g., the handling of hashables. Without any compromise in input checking, this PR still optimizes (at least) the union and find methods of the DisjointSet_of_hashables class. These methods performed unnecessary checks by passing through the union and find of DisjointSet_of_integers.

Given that now I call OP_find and OP_union directly from the find and union of DisjointSet_of_hashables, I wonder whether the _find and _union of DisjointSet_of_integers are still useful. Maybe if someone was to find them useful they should directly use the OrbitPartition class instead?

From `find` and `union` of `DisjointSet_of_hashables`, respectively
@videlec
Copy link
Contributor

videlec commented Apr 23, 2024

Looking at it a bit more, I don't quite understand the purpose of this PR. If you want something fast(er), why don't you use directly the OrbitPartition C class? The DisjointSet provides one extra level of indirection which is a lot of overhead.

To some extent this is a fair point. However, the DisjointSet provides some extra functionality - e.g., the handling of hashables. Without any compromise in input checking, this PR still optimizes (at least) the union and find methods of the DisjointSet_of_hashables class. These methods performed unnecessary checks by passing through the union and find of DisjointSet_of_integers.

Indeed, that looks like a good optimization!

Given that now I call OP_find and OP_union directly from the find and union of DisjointSet_of_hashables, I wonder whether the _find and _union of DisjointSet_of_integers are still useful. Maybe if someone was to find them useful they should directly use the OrbitPartition class instead?

I would not bother keeping them.

@mantepse
Copy link
Collaborator

Given that now I call OP_find and OP_union directly from the find and union of DisjointSet_of_hashables, I wonder whether the _find and _union of DisjointSet_of_integers are still useful. Maybe if someone was to find them useful they should directly use the OrbitPartition class instead?

I would not bother keeping them.

Maybe you could add a note in the docstring of find and union that these are the methods to use if no checking is needed?

@gmou3
Copy link
Contributor Author

gmou3 commented Apr 23, 2024

@videlec Cool!

@mantepse:

Given that now I call OP_find and OP_union directly from the find and union of DisjointSet_of_hashables, I wonder whether the _find and _union of DisjointSet_of_integers are still useful. Maybe if someone was to find them useful they should directly use the OrbitPartition class instead?

I would not bother keeping them.

Maybe you could add a note in the docstring of find and union that these are the methods to use if no checking is needed?

I removed them and referred to OP_find and OP_join. One should be able to use those directly and almost as easily.

Remove `_find` & `_union`.
Remove `DisjointSet_of_integers` `_d` attribute from `DisjointSet_of_hashables` (and adjust code).
Add notes recommending `OP_find` and `OP_union` if no input checking needed.
Touch-up docstrings.
@videlec
Copy link
Contributor

videlec commented Apr 26, 2024

Looks good to me. Thanks.

Copy link
Collaborator

@tscrim tscrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take @videlec's comment to mean a positive review.

@vbraun vbraun merged commit ac9c1b0 into sagemath:develop May 2, 2024
13 of 14 checks passed
@mkoeppe mkoeppe added this to the sage-10.4 milestone May 2, 2024
vbraun pushed a commit to vbraun/sage that referenced this pull request May 11, 2024
    
A metamorphosis of `graphic_matroid.py` into `graphic_matroid.pyx` and
`graphic_matroid.pxd`.
Things should be somewhat faster; test, e.g., the `GraphicMatroid`'s
`_circuit` function. An indirect and diluted test that shows improvement
can be seen by running `%time matroids.CompleteGraphic(8).circuits()`
(~half the time).

`For the reviewer:` The changes of this PR are only inside the
`matroids` module. The rest come from sagemath#37835.

### ⌛ Dependencies

sagemath#37835: Deeper improvements of `DisjointSet`.
    
URL: sagemath#37839
Reported by: gmou3
Reviewer(s): gmou3, Kwankyu Lee, Travis Scrimshaw
vbraun pushed a commit to vbraun/sage that referenced this pull request May 12, 2024
    
A metamorphosis of `graphic_matroid.py` into `graphic_matroid.pyx` and
`graphic_matroid.pxd`.
Things should be somewhat faster; test, e.g., the `GraphicMatroid`'s
`_circuit` function. An indirect and diluted test that shows improvement
can be seen by running `%time matroids.CompleteGraphic(8).circuits()`
(~half the time).

`For the reviewer:` The changes of this PR are only inside the
`matroids` module. The rest come from sagemath#37835.

### ⌛ Dependencies

sagemath#37835: Deeper improvements of `DisjointSet`.
    
URL: sagemath#37839
Reported by: gmou3
Reviewer(s): gmou3, Kwankyu Lee, Travis Scrimshaw
vbraun pushed a commit to vbraun/sage that referenced this pull request May 12, 2024
    
A metamorphosis of `graphic_matroid.py` into `graphic_matroid.pyx` and
`graphic_matroid.pxd`.
Things should be somewhat faster; test, e.g., the `GraphicMatroid`'s
`_circuit` function. An indirect and diluted test that shows improvement
can be seen by running `%time matroids.CompleteGraphic(8).circuits()`
(~half the time).

`For the reviewer:` The changes of this PR are only inside the
`matroids` module. The rest come from sagemath#37835.

### ⌛ Dependencies

sagemath#37835: Deeper improvements of `DisjointSet`.
    
URL: sagemath#37839
Reported by: gmou3
Reviewer(s): gmou3, Kwankyu Lee, Travis Scrimshaw
@gmou3 gmou3 deleted the disjointset_optimize branch May 25, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants