optimize set_substation_ids using DBSCAN (750x time faster) 🚀 #799

mnm-matin · 2023-07-19T19:01:54Z

optimizes set_substation_ids #445

For US buses it would take 5 min, now it only takes 0.4 sec thanks to DBSCAN from sklearn. This is possible using a tree structure (BallTree) to represent the spatial index.
The approach can be further optimized by changing the params of DBSCAN such as the number of processes (maybe set it to -1). Setting min_samples to 0 can allow one to identify outliers that have a low density of points around them.

Note: only for US, the results are a little off from the results of the original set_substations_ids, but the number of unique cluster is actually lower so it should be better (22k compared to 24k).

Also Note: using dbscan for this task is a bit like using a rocket launcher to kill a bird

Here is a less optimized version that is also much faster and does not use sklearn or dbscan. This approach can be extended to the snapping buses to line as well

def set_substation_ids_eff(buses, distance_crs, tol=2000):
    """
    Function to set substations ids to buses, accounting for location
    tolerance.
    """
    buses = buses.copy()
    buses["station_id"] = -1

    # convert the geometry to the specified CRS for accurate distance calculation
    buses.geometry = buses.geometry.to_crs(distance_crs)

    # create a spatial index for efficient spatial queries
    spatial_index = buses.sindex

    station_id = 0
    for i, bus in buses.iterrows():
        if bus["station_id"] >= 0:
            continue

        # find the nearest buses within the specified tolerance
        possible_matches_index = list(spatial_index.intersection(bus.geometry.buffer(tol).bounds))
        possible_matches = buses.loc[possible_matches_index]
        precise_matches = possible_matches[possible_matches.distance(bus.geometry) <= tol]

        if precise_matches["station_id"].max() < 0:
            # when all station_ids are negative, then this is a new station id
            buses.loc[precise_matches.index, "station_id"] = station_id
            station_id += 1
        elif precise_matches["station_id"].min() < 0:
            # otherwise, when at least a station_id is non-negative, then pick the first value
            sub_id = precise_matches["station_id"].max()
            buses.loc[precise_matches.index, "station_id"] = sub_id

    return buses

thanks to @davide-f for providing the files

notebook I used for dev: https://github.com/mnm-matin/pypsa-africa/blob/substation_ids/on_pr.ipynb

mnm-matin · 2023-07-19T19:06:07Z

not sure why the notebook got added in the pr, will remove it

davide-f · 2023-07-19T21:17:44Z

Great @mnm-matin ! I quite like the approach and it is efficient (also in terms of lines of code :))!!!
Moreover we already have sklearn in the env so, it is quite smart, though, I'd make it explicit in the environment.

Regarding the notebook. Could you please either create a new PR with only the intended changes or force-push here only the intended changes without the notebook?
Otherwise the notebook remains in the history.

Would you be interested in adding the equivalent feature for grouping the substations? I think you could leverage a lot on the effort you made and improve the formulation quite a lot :)

for more information, see https://pre-commit.ci

mnm-matin · 2023-07-20T11:03:06Z

yeah, accidentally pushed the notebook, IU have removed it now.
Would be nice to merge it and see if results remain as expected.
I believe this pr is for grouping substations, do you mean set_lines_ids?

incase I don't have enough time to come back to this, the following code snippet should help with snapping point to lines (fix_overpassing_lines):

df_l = ng_lines.copy()
df_p = ng_buses.copy()

tolerance = 5000

    
# Buffer points to create areas for spatial join 
buffer_df = gpd.GeoDataFrame(geometry=df_p.buffer(tolerance))
    
# Spatial join to find lines intersecting point buffers
joined = gpd.sjoin(df_l, buffer_df, how="inner", op='intersects')
    
# Group by index to find 
grouped = joined.groupby('index_right')

basically it adds the index of the buses (that are in close proximity) to the lines. This can be post-processed to reach the desired formats. The important part is using spatial indeces (sindex) and using spatial joins (sjoin).

davide-f · 2023-07-21T11:26:09Z

Great matin! :D

So, I see two issues:

set_lines_ids that is actually a thorn and can be made more efficient, few ideas on that and
fix_overpassing_lines as you mentioned. That code seems highly efficient and very interesting to identify the overpassing lines; it is also important to track what are the buses that are related to that. However, for the current implementation all lines get selected: there is the need to ignore the ending and starting nodes by line, which is a bit tricky. However, the spatial join is interesting and to definitely consider

davide-f

This PR is ready to go, before merging, I'd wait for the other PRs and in particular #804 to be finalized.
This PR is addressing the v1.0 and it would be nice to keep the PRs more organized

davide-f · 2023-12-18T22:24:57Z

Closing in favor of #845

optimize set_substation_ids using DBSCAN

e4283bd

mnm-matin force-pushed the substation_ids branch from 57cad55 to e4283bd Compare July 20, 2023 11:01

[pre-commit.ci] auto fixes from pre-commit.com hooks

8aeb088

for more information, see https://pre-commit.ci

davide-f approved these changes Jul 23, 2023

View reviewed changes

mnm-matin changed the title ~~optimize set_substation_ids using DBSCAN~~ optimize set_substation_ids using DBSCAN (750x time faster) 🚀 Jul 25, 2023

davide-f mentioned this pull request Aug 16, 2023

re-write fix_overpassing_lines (800x faster) 🚀 #807

Closed

davide-f mentioned this pull request Aug 25, 2023

Improve computational performance of build_osm_network #845

Merged

8 tasks

davide-f closed this Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize set_substation_ids using DBSCAN (750x time faster) 🚀 #799

optimize set_substation_ids using DBSCAN (750x time faster) 🚀 #799

mnm-matin commented Jul 19, 2023 •

edited

Loading

mnm-matin commented Jul 19, 2023

davide-f commented Jul 19, 2023 •

edited

Loading

mnm-matin commented Jul 20, 2023 •

edited

Loading

davide-f commented Jul 21, 2023

davide-f left a comment

davide-f commented Dec 18, 2023

optimize set_substation_ids using DBSCAN (750x time faster) 🚀 #799

optimize set_substation_ids using DBSCAN (750x time faster) 🚀 #799

Conversation

mnm-matin commented Jul 19, 2023 • edited Loading

mnm-matin commented Jul 19, 2023

davide-f commented Jul 19, 2023 • edited Loading

mnm-matin commented Jul 20, 2023 • edited Loading

davide-f commented Jul 21, 2023

davide-f left a comment

Choose a reason for hiding this comment

davide-f commented Dec 18, 2023

mnm-matin commented Jul 19, 2023 •

edited

Loading

davide-f commented Jul 19, 2023 •

edited

Loading

mnm-matin commented Jul 20, 2023 •

edited

Loading