-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better threshold metric for fuzzy_join #470
Comments
Hello, thanks for tagging me! In fact, there are a few things that I think are worth considering. Specifically, there various issues involved in using a simple threshold that I think should be kept in mind, and that would justify using relative metrics (rather than, or in conjunction with) absolute metrics. In short, deciding the threshold a priori is hard, and a wrong choice can lead to low recall (threshold too high) or low precision (threshold too low). Consider for example a case like this (X1 and X2 are distances from a target point):
In the X1 case, the neighbors are all very close to the target point, even though they actually represent false positives. In this case, even a high threshold would not be sufficient to distinguish between matches. It may be possible to mitigate somewhat the issue of having an "absolute threshold" by taking some of the nearest neighbors of a target point, averaging their distances and "normalizing" them by said average. Then, the threshold would be relative to the neighborhood of the point. I wonder if it is possible to run experiments to see if it is possible to find a good value for this "normalized threshold". I think that the best way of implementing a relative metric is the introduction of a bijective metric that relies on the fact that neighbors of a target point are ranked by their similarity to the point itself. So, if we have points
The bijective metric would iterate over all points, looking for the highest ranked neighbor. In the case of point Consider this script: import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
import seaborn as sns
X = [
[1.1, 0.5],
[0.2, 0.5],
[0.1, 0.1]
]
X = np.array(X)
X_df = pd.DataFrame(X).reset_index()
X_df.columns = ["point", "x","y"]
ax = sns.scatterplot(data=X_df, x="x", y="y", hue="point", palette="tab10")
ax.set_xlim([0, 1.5])
ax.set_ylim([0, 1.5])
neigh = NearestNeighbors(n_neighbors=2)
neigh.fit(X)
distance, neighbors = neigh.kneighbors(X, return_distance=True)
for target, closest_points in enumerate(neighbors):
_, closest = closest_points
print(f"This is point {target}")
# closest to point
c2p = closest
print(f"The closest point to point {target} is {c2p}")
# closest to closest
c2c = neighbors[closest, 1]
print(f'The closest point to {c2p} is {c2c}')
if c2c == target:
print(f"There is a bijective relationship between {c2p} and {c2c}")
print("##") This bijective metric is very effective at increasing the matching precision, and it does not penalize recall much because it is completely independent from any threshold: this means that there is no risk of having false negatives because of a threshold set too high. The main drawback of this metric is that it works well only for 1-to-1 matches, since in order to match multiple entities it becomes necessary to increase the pool of candidates to mark as "matches". For example, if both This was a very long-winded way of saying that selecting the threshold is not a simple issue, and that using only a single value can lead to unforeseen consequences. In any case, I'd love to continue the discussion later, and possibly look to implement some of these solutions in the code! |
Hi, @rcap107, thanks for this explanation!
The closest thing that came to my mind is the Local Outlier Factor in scikit-learn. It does the opposite of what you're looking for: it finds outliers by computing the local density from their nearest neighbors. You may find this section of the code helpful. One property of LOF is that it doesn't rely on a threshold. Instead, it uses a metric called I wonder if there is a way to transfer these concepts to fuzzy matching. WDYT? cc @GaelVaroquaux who might help with better understanding LOF. |
following up on the skrub meeting discussion and relevant for #821: Parametrizing the fuzzy join thresholdIn fuzzy join,
The question is how do we specify that threshold. In some cases the vectorized rows may be in a space where distances have a meaningful unit. In other cases the distances don't have a meaningful scale, so we want to rescale them to make it easier for users to select a grid of possible thresholds.
depending on that choice,
We also have the option, as is currently done now, to define a score as (1 - rescaled distance), and provide the threshold as a min score rather than a max distance. QuestionsI would love some opinions on
note that the score/distance parametrization does not change the ordering of matched pairs, so if the threshold is well set with a grid search the result should be the same -- the question is which parametrization enables easily choosing a good grid.
|
Thank you for putting all these thoughts together, @jeromedockes.
In any case, I think this is critical to emphasize this in the documentation of the |
- i. I'm +1 for sampling random rows and computing a sufficient statistic to get the reference distance. This would represent the "global distance" between the left row and the right auxiliary dataset, and we are assured that this reference distance is not close to 0.
If a candidate's distance is higher than the reference, it is safe to discard this candidate since it is worse than the global distance, so a threshold of 1 would mean "accepting a candidate's distance equal to the global distance", which is very permissive.
I also think this one is probably a good choice. As it is a bit more complex than the others and will for example require choosing a number of pairs to sample we may want to leave it for another pr?
- ii and iv. I'm +1 to keep the "fast path" and maybe even to extend it using a parameter `joining_kind` (or `kind`), whose options are "fuzzy" (default), "exact" (which calls pandas `merge`) and "asof" (which calls pandas `merge_asof`).
This would allow grid search in a pipeline and better emphasize that the fuzzy join is a generalization of the merge operation.
This would also address the use case of joining on numerical or datetime values whose distance have units.
The distance/score parameter would then only apply to the fuzzy case, making the API of the `fuzzy_join` easier to apprehend.
as they do something different and thus need different parameters wouldn't it be simpler to have different estimators, one for each kind of join? in the same way that pandas and polars have separate join and join_asof
- iii. Following my precedent suggestion of introducing a `joining_kind` parameter, I'm +1 for a score rather than a distance. "Score" in itself sounds too vague, and I guess "similarity_threshold" is more explicit for the data scientist.
similarity_threshold sounds good, still with the caveat it will be awkward for reference distances (as the nearest neighbor within the aux table as in autofuzzyjoin) that aren't guaranteed to be smaller than the actual distance (ie would yield negative scores)
In any case, I think this is critical to emphasize this in the documentation of the `fuzzy_join` and `Joiner` to better convey what skrub offers. Maybe even go as far as listing the use cases and giving our preferred way of joining.
I agree especially as the interpretation of fuzzy_join with multiple columns is not really obvious as we have a kind of compound score that mixes similarities between different things
|
My take is we should have two options, one ~absolute and one relative:
This way we can default to the absolute metric, and let people choose the relative metric if they expect roughly one good match. |
I agree with you after reading merge_asof more carefully. The However, I'm still advocating for a parameter like |
Thanks for the work and for putting together the notes! I agree with @LeoGrin on having both relative and absolute metrics: a relative metric should work better if the user does not know much about the data (here I am assuming there is some relative distance implemented), whereas an absolute metric should allow "real world" metrics to be used (e.g. join if the distance is < 1km, or if the time difference is <1min). This also means I'm in favor of disabling rescaling if necessary and useful. On the score vs distance discussion:
I am strongly against any kind of similarity metric where a threshold of 1.0 does not mean "exact matching". I find this would be extremely confusing for users (and I place myself in that group). In a similar vein, if I were to set set the distance threshold to 0 (similarity to 1), then I would expect the code to work with exact matching only, though in this case one might make the argument of "why are you joining with fuzzyjoin in the first place" 🤔 either way, I agree with @Vincent-Maladiere that there should be a parameter to perform exact matching in those cases. |
Thanks @jeromedockes for this great summary. Here is my opinion on this: i. and ii. I like the
iii. No strong opinion on this, and I agree with @rcap107 and @Vincent-Maladiere that the most important is that it's user friendly. So either way you see it we might have 1 as perfect similarity/some distance and 0 as no similarity/no distance). iv. If possible yes, but I think this is for later: it depends on the code structure that will emerge from i. - iii |
I think this has been addressed in #821 |
Improve the metric we use to threshold matches in
fuzzy_join
, to make it easier to tune for the user, and more correlated with actual matches. Right now, thematch_score
we use is directly the distance between a category and its nearest neighbor, which is hard to interpret and does not depend on other matches.Tagging @rcap107 who suggested to use relative metrics, and @jovan-stojanovic who may have some ideas!
The text was updated successfully, but these errors were encountered: