RealSet: Faster operations by scan-line (merging) techniques #32181

mkoeppe · 2021-07-11T18:10:41Z

Adapt from:

Extend Union and Intersection from pairwise computation to multiple realsets with scan-line algorithm. In order to compare the performance of the scan-line and orginial methods, we use different dataset to test the CPU time the random generator and timing test code can find in this page:


Function	Dataset	Scan-line	original
union	1000 closed bounded disjoint interval	0.005281	0.00749
union	1000 open bounded disjoint interval	0.004339	0.006646
union	1000 mixed bounded disjoint interval	0.005025	0.006646
intersection	300 closed bounded disjoint interval	0.001182	4.376048
intersection	300 open bounded disjoint interval	0.001225	4.359314
intersection	300 mixed bounded interval	0.001537	4.407478

We used scan-line methods to replace the original normalize and also make normalize optionally. After replacing it, we test it with 500,000 random intervals, the original methods take 151 second to construct it; scan-line methods only took 18 seconds.
Replace difference,symmetric_difference and are_pairwise_disjoint with scan-line methods. We use 1000 realsets each realsets containing 1000 random intervals as the testing dataset. The original methods found the difference between two real sets took 11 seconds on average, and the new methods only took 0.021 seconds on average; symmetric_differnce also has a similar result.
Implement convex_hull and is_connected in RealSet.

Depends on #34252

CC: @yuan-zhou

Component: geometry

Author: Yueqi Li, Yuan Zhou

Branch/Commit: 1ef7dd1

Reviewer: Matthias Koeppe

Issue created by migration from https://trac.sagemath.org/ticket/32181

The text was updated successfully, but these errors were encountered:

NicoleYueqiLi · 2022-05-19T03:51:12Z

Branch: u/gh-NicoleYueqiLi/realset__faster_operations_by_scan_line__merging__techniques

sagetrac-git · 2022-07-18T21:42:13Z

Commit: 625ac58

sagetrac-git · 2022-07-18T21:42:13Z

Branch pushed to git repo; I updated commit sha1. New commits:

sagetrac-git · 2022-07-26T19:56:29Z

Changed commit from 625ac58 to 33385c8

sagetrac-git · 2022-07-26T19:56:29Z

Branch pushed to git repo; I updated commit sha1. New commits:

`3161339`	`1. fixed pycode test fail`
`33385c8`	Adding examples for helper funtion `_scan_line_union`, `_scan_line_intersection`, `_scan_line_difference`, `_scan_line_symmetric_difference`

mkoeppe · 2022-07-27T07:46:40Z

comment:8

If you merge the most current beta (9.7.beta6), you can use https://trac.sagemath.org/wiki/ReleaseTours/sage-9.7#Validatingdocstringwith.sage-tox-erst

sagetrac-git · 2022-07-28T19:54:35Z

Branch pushed to git repo; I updated commit sha1. New commits:

`61862a9`	`Change scan-line methods to tree structure, improve inner function's documentation`

sagetrac-git · 2022-07-28T19:54:35Z

Changed commit from 33385c8 to 61862a9

mkoeppe · 2022-07-28T20:17:56Z

comment:10

tox -e relint will tell you that EXAMPLE:: should be EXAMPLES:: according to our style guide (whether there is a single or there are multiple examples)

yuan-zhou · 2022-07-29T18:39:24Z

comment:11

The current code does not pass the doc-tests.
Also, please follow https://doc.sagemath.org/html/en/developer/coding_basics.html. For example, in the text part, convex_hull should be fully-referenced so that it is clickable after build, i.e. :meth:~sage.sets.real_set.RealSet.convex_hull

yuan-zhou · 2022-07-30T03:50:05Z

comment:12

I took a look through the code, but didn't run the code.
I have some first comments.

The patchbot currently shows 31 doctests failures and 4 lint errors. To be fixed!
In the docstrings of _scan_left_endpoint and _scan_right_endpoint: What does an event mean. Reformat the text after output. Describe the value of delta. Include example(s) where tag is given.
RealSet constructor with option normalized=True. What does this assume? Describe it in the docstring of __classcall__ or that of RealSet. Mention that is does not apply to the manifold case. Include doctests/tests where normalized=True is given, and examples where arguments are of type InternalRealInterval.
I have doubts about where if not normalized is treated. (But of course, this depends on how you define normalized=True). I thought if normalized=True and it's not of the manifold case (lines 1267-1283), one can just return UniqueRepresentation.__classcall__(cls, *intervals) before line 1199 intervals = [], which skips all treatments on args.
When normalized is not True, may simplify and insert the code in def normalize here, then remove the normalize function as it is only used once here. Alternatively, keep def normalize and move all treatment of args from __classcall__ to normalize.
RealSet.__classcall__: Line 1188-1190 change to just: normalized = kwds.pop('normalized', False). L1194: "... take no keyword arguments other than 'normalized'.""
RealSet._prep: L1710 elif. L1717 elif
RealSet doesn't need methods _scan_left_endpoint and _scan_right_endpoint. These are methods of the class InternalRealInterval.
RealSet._scan_interval Rename the function, to something like _scan, or _scan_intervals, or _generate_events. Rewrite the function: for i in self._intervals: yield i._scan_left_endpoint(tag); yield i._scan_right_endpoint(tag). No more merge/sorting needed.
Ideally, all scan methods are generators. Currently, some are generators, some return lists.
In _scan_line_union, lines 2104-2122, the two while loops look very suspicious to me. I think they do not belong here. Moreover, the lines 2118-2119 in the second while loop might be wrong. You claimed that these two while loops take care of the special cases RealSet(x=-infinity) and RealSet(x<=-infinity), but the input scan to this function should never be in such form. Such empty (nonsense) special cases should already been eliminated in around line 1209 in __classcall__
In union_of_realsets and other similar functions: Lines 2175 to 2178 can be simplified to scan = merge(RealSet(s)._scan() for s in real_set_collection)
_scan_line_intersection. Add more special examples to doctest to cover the cases where intersection is just one point, or just empty. For example: [1,2] ∩ [2,3], [2,3] ∩ (3,4), (3,4) ∩ [4,5), [4,5) ∩ (5,6). Lines 2252-2261 (and other similar places) can be simplified to

if now_on:
  (on_x, on_epsilon) = (x, epsilon) 
else: 
  if was_on and (on_x, on_epsilon) < (x, epsilon): 
    yield InternalRealInterval(...)
  (on_x, on_epsilon) = (None, None)

Questions about 'tag'. It is said to be a real number in the docstrings of many function, but is used as True or False for the difference function only. I think tag can be eliminated throughout the code (and do some special treatment for the difference.) Also, why is it called index at some places? That's not used either, so can be removed as well. On a related note, given a collection of real sets, after generating and merging the their scan, you can no longer match the list of events back to the original real sets, because index was not passed to tag in the events. I do not like the current design.
_scan_line_union, _scan_line_intersection, _scan_line_symmetric_difference (and part of are_pairwise_disjoint) all look redundant. They all can be replaced by a new method RealSet._scan_to_intervals(scan, indicator) that takes scan as a generator of sorted events and indicator as a number 1 or n (or perhaps as a lambda function for are_pairwise_disjoint), and outputs a tuple intervals of type InternalRealInterval.
According to your ticket description, the old union method is slower than the new one. Is this true for the union of two real sets? If so, why is the old one kept in the code? If not, please update the description of the ticket, and include comments in the code. In general, are there examples where the original method is faster? Please specify the examples, and change the code to choose strategically between the two versions according to the input.
def convex_hull: correct but wasteful. One does not to sort all intervals, only the left most and right most intervals of each real set matter.
def is_connected docstring: Return whether self is a connected set. OUTPUT: Boolean. True if self is a single interval.
def are_pairwise_disjoint: Revise.
Throughout the code, (for example in _scan_line_union and _scan_line_intersection and many other functions,) the docstrings need significant improvements! The current ones make little sense. They are often copy-pasted from other places which do not correspond to the function. Input and output descriptions are not exact, variable undefined, variable mistaken (eg. realsets v.s real_set_collection) type not specified, code-style incorrect. Often bad and incomplete examples. EXAMPLES:: plural. At the beginning of the file, "Yueqi Li (2022-06-27): ..." Rephrase to one or two lines. Some places in the docstrings ":meth:~sage.sets.real_set.RealSet.convex_hull"
Please make sure that your docstrings follow https://doc.sagemath.org/html/en/developer/coding_basics.html.
Throughout the code: clean up the commented-out code.
Throughout the code: :class:RealInterval should be :class:InternalRealInterval
New method: relative closure in a superset?

yuan-zhou · 2022-07-30T04:40:33Z

Changed branch from u/gh-NicoleYueqiLi/realset__faster_operations_by_scan_line__merging__techniques to u/yzh/realset__faster_operations_by_scan_line__merging__techniques

yuan-zhou · 2022-07-30T04:42:00Z

comment:14

Merge-in SageMath version to 9.7.beta6.

New commits:

`ed5c560`	`Merge branch 'develop' into t/32181/realset__faster_operations_by_scan_line__merging__techniques`

yuan-zhou · 2022-07-30T04:42:00Z

Changed commit from 61862a9 to ed5c560

sagetrac-git · 2022-07-31T06:52:48Z

Branch pushed to git repo; I updated commit sha1. New commits:

`cd9cc4c`	`trivial typo fix`
`f6b334f`	`remove outdated patch for the broken comparison between infinity and RLF (fixed in ticket 11506)`
`58aa1ef`	`add bug example regarding empty InternalRealInterval. Will not fix for this internal class`
`7a65f08`	`fix bug regarding empty real sets`

sagetrac-git · 2022-07-31T06:52:48Z

Changed commit from ed5c560 to 7a65f08

sagetrac-git · 2022-07-31T07:07:36Z

Branch pushed to git repo; I updated commit sha1. New commits:

`1bbb04f`	`acknowledge authors, reference`

sagetrac-git · 2022-07-31T07:07:36Z

Changed commit from 7a65f08 to 1bbb04f

sagetrac-git · 2022-07-31T13:26:46Z

Branch pushed to git repo; I updated commit sha1. New commits:

`dda8f6b`	`rewrite implementation of scan-line algo for real set methods, by adapting the code from https://github.com/mkoeppe/cutgeneratingfunctionology`

sagetrac-git · 2022-08-04T17:59:56Z

Changed commit from 523b49a to 3f6a0ac

mkoeppe · 2022-08-04T18:00:49Z

comment:41

Replying to @yuan-zhou:

Do you mean bool(i==1)? I don't know if it's always needed, but I thought there is nothing wrong to add it.

i is just an integer, right? Converting to bool is only needed in comparison when you are comparing SR elements, which have overloaded comparison operators for forming relational expressions.

mkoeppe · 2022-08-04T18:01:42Z

comment:42

If you want, you can add a deprecation for the normalize method

sagetrac-git · 2022-08-04T18:03:38Z

Changed commit from 3f6a0ac to 79f2014

sagetrac-git · 2022-08-04T18:03:38Z

Branch pushed to git repo; I updated commit sha1. New commits:

`79f2014`	`revise the docstring of RealSet.is_connected`

sagetrac-git · 2022-08-04T18:08:57Z

Branch pushed to git repo; I updated commit sha1. New commits:

`1ef7dd1`	`remove unnecessary bool()`

sagetrac-git · 2022-08-04T18:08:57Z

Changed commit from 79f2014 to 1ef7dd1

mkoeppe · 2022-08-04T19:04:41Z

comment:45

Here's a microbenchmark:

Before:

sage: from itertools import count
sage: c = count()
sage: %timeit RealSet((0, next(c)))
82.9 µs ± 3.75 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
sage: %timeit RealSet((0, next(c))).union([next(c), next(c)])
210 µs ± 7.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

With this ticket:

sage: from itertools import count
sage: c = count()
sage: %timeit RealSet((0, next(c)))
86.2 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
sage: %timeit RealSet((0, next(c))).union([next(c), next(c)])
257 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

yuan-zhou · 2022-08-04T19:39:31Z

comment:46

Does the overhead look acceptable on the above examples?

mkoeppe · 2022-08-04T19:40:01Z

comment:47

     def __init__(self, lower, lower_closed, upper, upper_closed, check=True):
         """
         Initialize ``self``.
[...]
             if (upper_closed and upper == infinity):
                 raise ValueError('interval cannot be closed at +oo')
+            # TODO: take care of the empty set case.

What do you have in mind there?

yuan-zhou · 2022-08-04T19:44:31Z

comment:49

Somewhere in the original code (mul ?), the empty interval is set to (0, 0)
Replying to @mkoeppe:

     def __init__(self, lower, lower_closed, upper, upper_closed, check=True):
         """
         Initialize ``self``.
[...]
             if (upper_closed and upper == infinity):
                 raise ValueError('interval cannot be closed at +oo')
+            # TODO: take care of the empty set case.

What do you have in mind there?

mkoeppe · 2022-08-04T21:23:49Z

comment:51

Is this syntax self.intersection(0,1) new? Is it really needed?

+        elif len(real_set_collection) == 2:
+            a, b = real_set_collection
+            # allow self.intersection(0,1) syntax
+            try:
+                a.n()
+                b.n()
+                sets.append(RealSet(a, b))
+            except (AttributeError, ValueError, TypeError):
+                sets.append(RealSet(a))
+                sets.append(RealSet(b))
+        else:

yuan-zhou · 2022-08-04T21:26:19Z

comment:52

I didn't like it either, since the new code spends a lot of effect to take care of that. I wanted to eliminate that. However, the original code allowed for it (contained doctests of that kind).
Replying to @mkoeppe:

Is this syntax self.intersection(0,1) new? Is it really needed?

+        elif len(real_set_collection) == 2:
+            a, b = real_set_collection
+            # allow self.intersection(0,1) syntax
+            try:
+                a.n()
+                b.n()
+                sets.append(RealSet(a, b))
+            except (AttributeError, ValueError, TypeError):
+                sets.append(RealSet(a))
+                sets.append(RealSet(b))
+        else:

mkoeppe · 2022-08-04T21:30:22Z

comment:53

OK, you are right, the old code did allow it. So we have to keep it.

mkoeppe · 2022-08-04T21:34:57Z

comment:54

-    def intersection(self, *other):
+    def intersection(self, *real_set_collection):
         """

I think you can simplify it further by making it

     def intersection(*real_set_collection)

When used as a bound method, RealSet(1, 3).intersection(RealSet(2, 4)), the self would just be the first element of real_set_collection.

But it could also be used as an unbound method RealSet.intersection(RealSet(1, 3), RealSet(2,4)), and you could also handle the case RealSet.intersection(), which should give the real line as the intersection over an empty family

mkoeppe · 2022-08-04T21:44:08Z

comment:55

(Just a suggestion, it's fine as it)

mkoeppe · 2022-08-04T21:51:41Z

comment:56

Replying to @yuan-zhou:

Does the overhead look acceptable on the above examples?

Yes, I think it's fine.

My guess is that it could be improved by a custom version of heapq with a fast path for near-trivial cases, or even cythonized.

Let's keep this for a possible follow-up ticket.

mkoeppe · 2022-08-04T21:53:22Z

Reviewer: Matthias Koeppe

yuan-zhou · 2022-08-04T21:58:05Z

comment:58

That's a good suggestion, but I incline to leave it as is on this ticket. The reason is the treatments for special cases such as RealSet.union(RealSet(0,1)), RealSet.union((0,1)), RealSet.union(0, 1), and RealSet(0,1).union(1, 3)==RealSet.union(RealSet(0,1), 1, 3) could be too complicated at the moment. Will discuss the deprecation of the use of RealSet(0,1).union(1, 3) on another ticket later.
Replying to @mkoeppe:

-    def intersection(self, *other):
+    def intersection(self, *real_set_collection):
         """
I think you can simplify it further by making it
     def intersection(*real_set_collection)
When used as a bound method, RealSet(1, 3).intersection(RealSet(2, 4)), the self would just be the first element of real_set_collection.

But it could also be used as an unbound method RealSet.intersection(RealSet(1, 3), RealSet(2,4)), and you could also handle the case RealSet.intersection(), which should give the real line as the intersection over an empty family

yuan-zhou · 2022-08-04T21:59:43Z

comment:59

Thank you! Replying to @mkoeppe:

yuan-zhou · 2022-08-05T16:27:22Z

comment:60

Should def normalize(intervals) be @staticmethod

yuan-zhou · 2022-08-05T16:51:51Z

comment:61

Never mind. Without @staticmethod, the following can apply.

sage: r = RealSet(RealSet(2, 6)[0], RealSet(4, 10)[0], normalized=True)
sage: r
(2, 6) ∪ (4, 10)
sage: r.normalize()
((2, 10),)

Replying to @yuan-zhou:

Should def normalize(intervals) be @staticmethod

vbraun · 2022-08-06T07:44:13Z

Changed branch from u/yzh/realset__faster_operations_by_scan_line__merging__techniques to 1ef7dd1

mkoeppe added this to the sage-9.4 milestone Jul 11, 2021

mkoeppe added c: geometry labels Jul 11, 2021

mkoeppe modified the milestones: sage-9.4, sage-9.5 Jul 19, 2021

mkoeppe modified the milestones: sage-9.5, sage-9.6 Dec 14, 2021

mkoeppe modified the milestones: sage-9.6, sage-9.7 Mar 5, 2022

This comment has been minimized.

Sign in to view

mkoeppe added s: positive review and removed s: needs review labels Aug 4, 2022

vbraun removed the s: positive review label Aug 6, 2022

vbraun closed this as completed in 855db58 Aug 6, 2022

yuan-zhou mentioned this issue Jan 7, 2023

RealSet: deprecate using a, b to construct the open interval (a, b) in union and intersection methods #34280

Open

RealSet: Faster operations by scan-line (merging) techniques #32181

RealSet: Faster operations by scan-line (merging) techniques #32181

Comments

mkoeppe commented Jul 11, 2021

NicoleYueqiLi mannequin commented May 19, 2022

sagetrac-git mannequin commented Jul 18, 2022

sagetrac-git mannequin commented Jul 18, 2022

This comment has been minimized.

sagetrac-git mannequin commented Jul 26, 2022

sagetrac-git mannequin commented Jul 26, 2022

mkoeppe commented Jul 27, 2022

sagetrac-git mannequin commented Jul 28, 2022

sagetrac-git mannequin commented Jul 28, 2022

mkoeppe commented Jul 28, 2022

yuan-zhou commented Jul 29, 2022

yuan-zhou commented Jul 30, 2022

yuan-zhou commented Jul 30, 2022

yuan-zhou commented Jul 30, 2022

yuan-zhou commented Jul 30, 2022

sagetrac-git mannequin commented Jul 31, 2022

sagetrac-git mannequin commented Jul 31, 2022

sagetrac-git mannequin commented Jul 31, 2022

sagetrac-git mannequin commented Jul 31, 2022

sagetrac-git mannequin commented Jul 31, 2022

sagetrac-git mannequin commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

sagetrac-git mannequin commented Aug 4, 2022

sagetrac-git mannequin commented Aug 4, 2022

sagetrac-git mannequin commented Aug 4, 2022

sagetrac-git mannequin commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

yuan-zhou commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

This comment has been minimized.

yuan-zhou commented Aug 4, 2022

This comment has been minimized.

mkoeppe commented Aug 4, 2022

yuan-zhou commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

mkoeppe commented Aug 4, 2022

yuan-zhou commented Aug 4, 2022

yuan-zhou commented Aug 4, 2022

yuan-zhou commented Aug 5, 2022

yuan-zhou commented Aug 5, 2022

vbraun commented Aug 6, 2022