Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize set_substation_ids using DBSCAN (750x time faster) 🚀 #799

Closed
wants to merge 2 commits into from

Conversation

mnm-matin
Copy link
Member

@mnm-matin mnm-matin commented Jul 19, 2023

optimizes set_substation_ids #445

For US buses it would take 5 min, now it only takes 0.4 sec thanks to DBSCAN from sklearn. This is possible using a tree structure (BallTree) to represent the spatial index.
The approach can be further optimized by changing the params of DBSCAN such as the number of processes (maybe set it to -1). Setting min_samples to 0 can allow one to identify outliers that have a low density of points around them.

Note: only for US, the results are a little off from the results of the original set_substations_ids, but the number of unique cluster is actually lower so it should be better (22k compared to 24k).

Also Note: using dbscan for this task is a bit like using a rocket launcher to kill a bird

Here is a less optimized version that is also much faster and does not use sklearn or dbscan. This approach can be extended to the snapping buses to line as well

def set_substation_ids_eff(buses, distance_crs, tol=2000):
    """
    Function to set substations ids to buses, accounting for location
    tolerance.
    """
    buses = buses.copy()
    buses["station_id"] = -1

    # convert the geometry to the specified CRS for accurate distance calculation
    buses.geometry = buses.geometry.to_crs(distance_crs)

    # create a spatial index for efficient spatial queries
    spatial_index = buses.sindex

    station_id = 0
    for i, bus in buses.iterrows():
        if bus["station_id"] >= 0:
            continue

        # find the nearest buses within the specified tolerance
        possible_matches_index = list(spatial_index.intersection(bus.geometry.buffer(tol).bounds))
        possible_matches = buses.loc[possible_matches_index]
        precise_matches = possible_matches[possible_matches.distance(bus.geometry) <= tol]

        if precise_matches["station_id"].max() < 0:
            # when all station_ids are negative, then this is a new station id
            buses.loc[precise_matches.index, "station_id"] = station_id
            station_id += 1
        elif precise_matches["station_id"].min() < 0:
            # otherwise, when at least a station_id is non-negative, then pick the first value
            sub_id = precise_matches["station_id"].max()
            buses.loc[precise_matches.index, "station_id"] = sub_id

    return buses

thanks to @davide-f for providing the files

notebook I used for dev: https://github.com/mnm-matin/pypsa-africa/blob/substation_ids/on_pr.ipynb

@mnm-matin
Copy link
Member Author

not sure why the notebook got added in the pr, will remove it

@davide-f
Copy link
Member

davide-f commented Jul 19, 2023

Great @mnm-matin ! I quite like the approach and it is efficient (also in terms of lines of code :))!!!
Moreover we already have sklearn in the env so, it is quite smart, though, I'd make it explicit in the environment.

Regarding the notebook. Could you please either create a new PR with only the intended changes or force-push here only the intended changes without the notebook?
Otherwise the notebook remains in the history.

Would you be interested in adding the equivalent feature for grouping the substations? I think you could leverage a lot on the effort you made and improve the formulation quite a lot :)

@mnm-matin
Copy link
Member Author

mnm-matin commented Jul 20, 2023

yeah, accidentally pushed the notebook, IU have removed it now.
Would be nice to merge it and see if results remain as expected.
I believe this pr is for grouping substations, do you mean set_lines_ids?

incase I don't have enough time to come back to this, the following code snippet should help with snapping point to lines (fix_overpassing_lines):

df_l = ng_lines.copy()
df_p = ng_buses.copy()

tolerance = 5000

    
# Buffer points to create areas for spatial join 
buffer_df = gpd.GeoDataFrame(geometry=df_p.buffer(tolerance))
    
# Spatial join to find lines intersecting point buffers
joined = gpd.sjoin(df_l, buffer_df, how="inner", op='intersects')
    
# Group by index to find 
grouped = joined.groupby('index_right')

basically it adds the index of the buses (that are in close proximity) to the lines. This can be post-processed to reach the desired formats. The important part is using spatial indeces (sindex) and using spatial joins (sjoin).

@davide-f
Copy link
Member

Great matin! :D

So, I see two issues:

  1. set_lines_ids that is actually a thorn and can be made more efficient, few ideas on that and
  2. fix_overpassing_lines as you mentioned. That code seems highly efficient and very interesting to identify the overpassing lines; it is also important to track what are the buses that are related to that. However, for the current implementation all lines get selected: there is the need to ignore the ending and starting nodes by line, which is a bit tricky. However, the spatial join is interesting and to definitely consider

Copy link
Member

@davide-f davide-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is ready to go, before merging, I'd wait for the other PRs and in particular #804 to be finalized.
This PR is addressing the v1.0 and it would be nice to keep the PRs more organized

@mnm-matin mnm-matin changed the title optimize set_substation_ids using DBSCAN optimize set_substation_ids using DBSCAN (750x time faster) 🚀 Jul 25, 2023
@davide-f
Copy link
Member

Closing in favor of #845

@davide-f davide-f closed this Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants