-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement query location method to search for place names #59
Conversation
@rth Ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cbiehl!
Could you please merge the main branch in to enable CI and also add a few corresponding tests for this feature to test_pgeocode.py
?
A few comments below
pgeocode.py
Outdated
string containing place names to search for | ||
top_k: int | ||
maximum number of results (rows in DataFrame) to return | ||
fuzzy_threshold: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make behavior not depend on installed libraries, let's make this None by default. And if it's None, no fuzzy search is done, and the thefuzz
is not used.
You could add a sentence above parameters in the description saying that to enable fuzzy search one need to provide the fuzzy_threshold parameter. And then in the description of this parameter mention what are good default values and link to the documentation of thefuzz because right it's not very clear how to interpret this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted it, let me know if this works for you
pgeocode.py
Outdated
except ImportError: | ||
return pd.DataFrame(columns=self._data.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should fail, with something like,
except ImportError: | |
return pd.DataFrame(columns=self._data.columns) | |
except ImportError as err: | |
raise ImportError( | |
"Cannot use fuzzy search without 'thefuzz' package. " | |
"It can be installed with: pip install thefuzz" | |
) from err |
pgeocode.py
Outdated
if max_score >= threshold: | ||
return self._data[fuzzy_scores == max_score] | ||
else: | ||
return pd.DataFrame(columns=self._data.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand what is happening with the max_score there. I would have expected this to be something like,
if max_score >= threshold: | |
return self._data[fuzzy_scores == max_score] | |
else: | |
return pd.DataFrame(columns=self._data.columns) | |
mask = fuzzy_scores >= threshold | |
return self._data[mask] |
which should also work when no row matches. Though I haven't looked at thefuzz package in detail.
Otherwise threshold is never used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that as a user, you would like to obtain the best matching results according to the scores output by thefuzz
, not just all results above the threshold. But we can change it of course, then the end user is responsible for further filtering the output dataframe if there are many matches above the fuzzy threshold (or just setting a higher threshold).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted it as you suggested
pgeocode.py
Outdated
text_len = len(text) | ||
match_mask = candidates.str.lower().str.contains(text.lower()) | ||
if match_mask.sum() == 0: | ||
return pd.DataFrame(columns=self._data.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the lines below do,
Shouldn't this be just,
return self._data[match_mask]
including for an empty mask? But maybe I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the if statement is that the following three lines of code should only be executed if there are any rows containing the queried place name. The following three lines ensure that the match with the closest string length to the input query string is returned (this only makes sense if there are any matches at all). The reasoning was that an end user may want to best string match, not all rows containing the string. But we can change it of course, then the end user is responsible for any further filtering in case there are multiple rows in the return dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted this as suggested
Also let's add,
to setup.py to make it an optional dependency, and,
|
Thanks for the review! I'll push an update later this week. |
Thanks a lot @cbiehl ! |
This implements the
Nominatim.query_location
method for searching for place names with a simple string contains search and (if users install thethefuzz
package) a fuzzy string search based on edit distance. Please let me know if you need any changes or unit tests to consider a merge.Closes #29