-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize set_substation_ids using DBSCAN (750x time faster) 🚀 #799
Conversation
not sure why the notebook got added in the pr, will remove it |
Great @mnm-matin ! I quite like the approach and it is efficient (also in terms of lines of code :))!!! Regarding the notebook. Could you please either create a new PR with only the intended changes or force-push here only the intended changes without the notebook? Would you be interested in adding the equivalent feature for grouping the substations? I think you could leverage a lot on the effort you made and improve the formulation quite a lot :) |
57cad55
to
e4283bd
Compare
for more information, see https://pre-commit.ci
yeah, accidentally pushed the notebook, IU have removed it now. incase I don't have enough time to come back to this, the following code snippet should help with snapping point to lines (fix_overpassing_lines): df_l = ng_lines.copy()
df_p = ng_buses.copy()
tolerance = 5000
# Buffer points to create areas for spatial join
buffer_df = gpd.GeoDataFrame(geometry=df_p.buffer(tolerance))
# Spatial join to find lines intersecting point buffers
joined = gpd.sjoin(df_l, buffer_df, how="inner", op='intersects')
# Group by index to find
grouped = joined.groupby('index_right') basically it adds the index of the buses (that are in close proximity) to the lines. This can be post-processed to reach the desired formats. The important part is using spatial indeces (sindex) and using spatial joins (sjoin). |
Great matin! :D So, I see two issues:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is ready to go, before merging, I'd wait for the other PRs and in particular #804 to be finalized.
This PR is addressing the v1.0 and it would be nice to keep the PRs more organized
Closing in favor of #845 |
optimizes set_substation_ids #445
For US buses it would take 5 min, now it only takes 0.4 sec thanks to DBSCAN from sklearn. This is possible using a tree structure (BallTree) to represent the spatial index.
The approach can be further optimized by changing the params of DBSCAN such as the number of processes (maybe set it to -1). Setting min_samples to 0 can allow one to identify outliers that have a low density of points around them.
Note: only for US, the results are a little off from the results of the original set_substations_ids, but the number of unique cluster is actually lower so it should be better (22k compared to 24k).
Also Note: using dbscan for this task is a bit like using a rocket launcher to kill a bird
Here is a less optimized version that is also much faster and does not use sklearn or dbscan. This approach can be extended to the snapping buses to line as well
thanks to @davide-f for providing the files
notebook I used for dev: https://github.com/mnm-matin/pypsa-africa/blob/substation_ids/on_pr.ipynb