-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary Predicates Introduction and Benchmark Notebook #1156
Binary Predicates Introduction and Benchmark Notebook #1156
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -0,0 +1,1157 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #8. [display(widgets.HBox(row)) for row in rows]
There's not need to use list comprehension here. Just
for row in rows:display(widgets.HBox(row))
Reply via ReviewNB
@@ -0,0 +1,1157 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace "+" mark with "-", I think "Point-point relationship" is more typical in showing a binary relationship.
Reply via ReviewNB
@@ -0,0 +1,1157 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a broken output in this cell? Or it's just the reviewNB was not able to show it.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah it looks like ReviewNB can't display the HBoxed output areas that are defined by the import ipywidgets as widgets
call. It shows a visualization of the geometries used in test.
@@ -0,0 +1,1157 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are only two relationships that a point can have with another point: equality or inequality.
I understand what you mean here. Although for those who don't have much context in binpreds, point-point can also be disjoint, or intersects. But they can all be reduced to equal/inequality. Perhaps rephrase the word here for general audience?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it another pass.
@@ -0,0 +1,1157 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three relationships that a Point can have with a LineString: Disjoint, Point is equal to one of the LineString boundary points, and point falls between the LineString boundary points.
Is there a name for the latter 2 relationships?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I'm specifically aware of. I assume that you're looking for one of the binary predicate names since I used disjoint for the first name?
Before I go further, may I ask who's the intended audience of this notebook? The first section is a deep dive into the situations between each possible combination of geometries, and this section makes me feel that it reveals too much of the underlying implementation detail to GIS users. If this is for documentation purposes, it also doesn't seem like it's clear enough that this is how the predicate is implemented. |
@@ -0,0 +1,1157 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. """Times the speed of producing the test data as well as the predicate execution.
Why should you time the data production? Benchmark should only time the kernel.
Reply via ReviewNB
@@ -0,0 +1,1157 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #9. 'GeoPandas predicate time',
I have forgotten the difference between "predicate time" and "time". You said at one point you want to remove one of them?
Reply via ReviewNB
@@ -0,0 +1,1157 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notebook is largely benchmark, not binpred introduction. |
Just providing a super high level review here - the notebook is incredibly valuable for us as a team, it can alert us to regressions, track improvements, and show where we should put our focus in the future assuming further acceleration is needed. I think the other comments are correct in that this isn't the best "story" for highlighting the binpreds. Team facing, it's truly awesome, I think we could just also benefit from you making a more story-focused version that highlights some but not all of the binpreds more like the geocoding notebook. Let's work as a team to figure it out a good story - you've done amazing work to get the binpreds into cuSpatial and I want as many people as possible to see/touch it and have the i get it 💡 moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatted with @thomcom offline and walked through the changes in this PR. In addition to the notebook, there are multiple performance optimization in the binary predicates and the GeoSeries infrastructure that the code base can all benefits.
python/cuspatial/cuspatial/tests/binpreds/test_binpred_test_dispatch.py
Outdated
Show resolved
Hide resolved
.devcontainer/devcontainer.json
Outdated
@@ -86,4 +86,6 @@ | |||
} | |||
} | |||
} | |||
|
|||
"forwardPorts": [8080] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, I use notebooks in the dev container often and didn't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was needed to use snakeviz
with it, for some reason.
I think CI might be failing on the notebooks step because it doesn't know to look in subdirs like this was moved to. Does it make sense for this to be in https://github.com/rapidsai/cuspatial/tree/branch-23.08/python/cuspatial/benchmarks/api? |
It might pass CI if I put it in that folder? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this PR introduces a new benchmarking notebook.
Is this notebook intended to be run in CI?
I think historically we've skipped benchmark notebooks in CI. See cuml
and cugraph
for reference:
- https://github.com/rapidsai/cudf/blob/017beb0be18c917c70f60c947e48d29a6e6f96b7/ci/test_notebooks.sh#L37
- https://github.com/rapidsai/cuml/blob/6bf61ca83d364094eb2a075b5ae9e9d5cdc9d42b/ci/test_notebooks.sh#L42
cuspatial
has a similar variable to skip notebook below:
cuspatial/ci/test_notebooks.sh
Line 37 in a09cc0d
SKIPNBS=""
Oh, we should probably skip the cuproj_benchmark.ipynb too. |
6abf517
to
ff386c0
Compare
@ajschmidt8 your review might get this in for 23.08 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray beat me to it.
approving anyway for good measure 👍
/merge |
Thanks @raydouglass @ajschmidt8 ! Congrats @thomcom ! |
Description
Closes #1138
Closes #1141 here
Depends on #1152
Depends on #1064
Please direct your attention to the notebook since the dependencies and delayed state of CI issues over this week have put a lot of files into this PR.
This notebook demonstrates cuSpatial's new binary predicates on large datasets, benchmarking and comparing against the host implementation on GeoPandas.
In order to support the large inputs for these comparisons I had to reactivate the
pairwise_point_in_polygon
functionality that I'd previously written off. This is because quadtree doesn't support large N for NxN operations, since it is many-to-many, and brute-force would require a huge number of iterations to support such large dataframes. There are some more optimizations that can be made to speed uppairwise_point_in_polygon
, but the algorithm itself isn't sufficiently fast. It is detailed fairly closely in the notebook.Please take a look and let's have some conversations about steps forward.
Checklist