Implement query location method to search for place names #59

cbiehl · 2022-03-18T19:08:37Z

This implements the Nominatim.query_location method for searching for place names with a simple string contains search and (if users install the thefuzz package) a fuzzy string search based on edit distance. Please let me know if you need any changes or unit tests to consider a merge.

Closes #29

cbiehl · 2022-04-01T06:04:01Z

@rth Ping

rth

Thanks @cbiehl!

Could you please merge the main branch in to enable CI and also add a few corresponding tests for this feature to test_pgeocode.py ?

A few comments below

pgeocode.py

rth · 2022-04-02T15:22:35Z

pgeocode.py

+          string containing place names to search for
+        top_k: int
+          maximum number of results (rows in DataFrame) to return
+        fuzzy_threshold: int


To make behavior not depend on installed libraries, let's make this None by default. And if it's None, no fuzzy search is done, and the thefuzz is not used.

You could add a sentence above parameters in the description saying that to enable fuzzy search one need to provide the fuzzy_threshold parameter. And then in the description of this parameter mention what are good default values and link to the documentation of thefuzz because right it's not very clear how to interpret this value.

Adapted it, let me know if this works for you

rth · 2022-04-02T15:24:44Z

pgeocode.py

+        except ImportError:
+            return pd.DataFrame(columns=self._data.columns)


I think here we should fail, with something like,

Suggested change

except ImportError:

return pd.DataFrame(columns=self._data.columns)

except ImportError as err:

raise ImportError(

"Cannot use fuzzy search without 'thefuzz' package. "

"It can be installed with: pip install thefuzz"

) from err

rth · 2022-04-02T15:27:36Z

pgeocode.py

+        if max_score >= threshold:
+            return self._data[fuzzy_scores == max_score]
+        else:
+            return pd.DataFrame(columns=self._data.columns)


I don't really understand what is happening with the max_score there. I would have expected this to be something like,

Suggested change

if max_score >= threshold:

return self._data[fuzzy_scores == max_score]

else:

return pd.DataFrame(columns=self._data.columns)

mask = fuzzy_scores >= threshold

return self._data[mask]

which should also work when no row matches. Though I haven't looked at thefuzz package in detail.

Otherwise threshold is never used.

The idea was that as a user, you would like to obtain the best matching results according to the scores output by thefuzz, not just all results above the threshold. But we can change it of course, then the end user is responsible for further filtering the output dataframe if there are many matches above the fuzzy threshold (or just setting a higher threshold).

Adapted it as you suggested

rth · 2022-04-02T15:29:11Z

pgeocode.py

+        text_len = len(text)
+        match_mask = candidates.str.lower().str.contains(text.lower())
+        if match_mask.sum() == 0:
+            return pd.DataFrame(columns=self._data.columns)


I don't understand what the lines below do,

Shouldn't this be just,

return self._data[match_mask]

including for an empty mask? But maybe I'm missing something.

The reason for the if statement is that the following three lines of code should only be executed if there are any rows containing the queried place name. The following three lines ensure that the match with the closest string length to the input query string is returned (this only makes sense if there are any matches at all). The reasoning was that an end user may want to best string match, not all rows containing the string. But we can change it of course, then the end user is responsible for any further filtering in case there are multiple rows in the return dataframe.

Adapted this as suggested

rth · 2022-04-02T15:37:04Z

Also let's add,

    extras_require={
        "fuzzy": ["thefuzz"],
    },

to setup.py to make it an optional dependency, and,

it would be good to add an example of using this to the readme https://github.com/symerio/pgeocode#quickstart
please add a changelog entry under the "[unreleased]" version on the top.

cbiehl · 2022-04-04T06:54:48Z

Thanks for the review! I'll push an update later this week.

rth · 2022-12-13T22:12:41Z

Thanks a lot @cbiehl !

Implement query location method to search for place names

bcd250b

rth reviewed Apr 2, 2022

View reviewed changes

cbiehl and others added 9 commits April 5, 2022 21:00

Adapt query location text search as requested

4c1592a

Add thefuzz to setup.py and add test cases

a5a90b7

Add changelog entry for query location

40d8681

Merge branch 'symerio:main' into main

c171a64

Add query location examples to readme

cba6476

Add thefuzz dependency to test execution

b704441

Reformat long lines

b7ded1e

Merge remote-tracking branch 'upstream/main' into cbiehl/main

878655f

Wrap to shorted docstring lines

763ebe2

rth merged commit 40301e1 into symerio:main Dec 13, 2022

This was referenced Dec 13, 2022

Fix Nominatim.query_location when query column contains NaN #67

Merged

City and State Lookup #31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement query location method to search for place names #59

Implement query location method to search for place names #59

cbiehl commented Mar 18, 2022 •

edited by rth

Loading

cbiehl commented Apr 1, 2022

rth left a comment

rth Apr 2, 2022 •

edited

Loading

cbiehl Apr 5, 2022

rth Apr 2, 2022

rth Apr 2, 2022 •

edited

Loading

cbiehl Apr 5, 2022

cbiehl Apr 5, 2022

rth Apr 2, 2022

cbiehl Apr 5, 2022

cbiehl Apr 5, 2022

rth commented Apr 2, 2022

cbiehl commented Apr 4, 2022

rth commented Dec 13, 2022

		except ImportError:
		return pd.DataFrame(columns=self._data.columns)

-        except ImportError:
-            return pd.DataFrame(columns=self._data.columns)
+        except ImportError as err:
+            raise ImportError(
+                "Cannot use fuzzy search without 'thefuzz' package. "
+                "It can be installed with: pip install thefuzz"
+            ) from err

Implement query location method to search for place names #59

Implement query location method to search for place names #59

Conversation

cbiehl commented Mar 18, 2022 • edited by rth Loading

cbiehl commented Apr 1, 2022

rth left a comment

Choose a reason for hiding this comment

rth Apr 2, 2022 • edited Loading

Choose a reason for hiding this comment

cbiehl Apr 5, 2022

Choose a reason for hiding this comment

rth Apr 2, 2022

Choose a reason for hiding this comment

rth Apr 2, 2022 • edited Loading

Choose a reason for hiding this comment

cbiehl Apr 5, 2022

Choose a reason for hiding this comment

cbiehl Apr 5, 2022

Choose a reason for hiding this comment

rth Apr 2, 2022

Choose a reason for hiding this comment

cbiehl Apr 5, 2022

Choose a reason for hiding this comment

cbiehl Apr 5, 2022

Choose a reason for hiding this comment

rth commented Apr 2, 2022

cbiehl commented Apr 4, 2022

rth commented Dec 13, 2022

cbiehl commented Mar 18, 2022 •

edited by rth

Loading

rth Apr 2, 2022 •

edited

Loading

rth Apr 2, 2022 •

edited

Loading