Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement query location method to search for place names #59

Merged
merged 10 commits into from
Dec 13, 2022

Conversation

cbiehl
Copy link
Contributor

@cbiehl cbiehl commented Mar 18, 2022

This implements the Nominatim.query_location method for searching for place names with a simple string contains search and (if users install the thefuzz package) a fuzzy string search based on edit distance. Please let me know if you need any changes or unit tests to consider a merge.

Closes #29

@cbiehl
Copy link
Contributor Author

cbiehl commented Apr 1, 2022

@rth Ping

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cbiehl!

Could you please merge the main branch in to enable CI and also add a few corresponding tests for this feature to test_pgeocode.py ?

A few comments below

pgeocode.py Outdated Show resolved Hide resolved
pgeocode.py Outdated
string containing place names to search for
top_k: int
maximum number of results (rows in DataFrame) to return
fuzzy_threshold: int
Copy link
Member

@rth rth Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make behavior not depend on installed libraries, let's make this None by default. And if it's None, no fuzzy search is done, and the thefuzz is not used.

You could add a sentence above parameters in the description saying that to enable fuzzy search one need to provide the fuzzy_threshold parameter. And then in the description of this parameter mention what are good default values and link to the documentation of thefuzz because right it's not very clear how to interpret this value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapted it, let me know if this works for you

pgeocode.py Outdated
Comment on lines 360 to 361
except ImportError:
return pd.DataFrame(columns=self._data.columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we should fail, with something like,

Suggested change
except ImportError:
return pd.DataFrame(columns=self._data.columns)
except ImportError as err:
raise ImportError(
"Cannot use fuzzy search without 'thefuzz' package. "
"It can be installed with: pip install thefuzz"
) from err

pgeocode.py Outdated
Comment on lines 365 to 368
if max_score >= threshold:
return self._data[fuzzy_scores == max_score]
else:
return pd.DataFrame(columns=self._data.columns)
Copy link
Member

@rth rth Apr 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what is happening with the max_score there. I would have expected this to be something like,

Suggested change
if max_score >= threshold:
return self._data[fuzzy_scores == max_score]
else:
return pd.DataFrame(columns=self._data.columns)
mask = fuzzy_scores >= threshold
return self._data[mask]

which should also work when no row matches. Though I haven't looked at thefuzz package in detail.

Otherwise threshold is never used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that as a user, you would like to obtain the best matching results according to the scores output by thefuzz, not just all results above the threshold. But we can change it of course, then the end user is responsible for further filtering the output dataframe if there are many matches above the fuzzy threshold (or just setting a higher threshold).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapted it as you suggested

pgeocode.py Outdated
text_len = len(text)
match_mask = candidates.str.lower().str.contains(text.lower())
if match_mask.sum() == 0:
return pd.DataFrame(columns=self._data.columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the lines below do,

Shouldn't this be just,

return self._data[match_mask]

including for an empty mask? But maybe I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the if statement is that the following three lines of code should only be executed if there are any rows containing the queried place name. The following three lines ensure that the match with the closest string length to the input query string is returned (this only makes sense if there are any matches at all). The reasoning was that an end user may want to best string match, not all rows containing the string. But we can change it of course, then the end user is responsible for any further filtering in case there are multiple rows in the return dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapted this as suggested

@rth
Copy link
Member

rth commented Apr 2, 2022

Also let's add,

    extras_require={
        "fuzzy": ["thefuzz"],
    },

to setup.py to make it an optional dependency, and,

@cbiehl
Copy link
Contributor Author

cbiehl commented Apr 4, 2022

Thanks for the review! I'll push an update later this week.

@rth
Copy link
Member

rth commented Dec 13, 2022

Thanks a lot @cbiehl !

@rth rth merged commit 40301e1 into symerio:main Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query location not working at all.
2 participants