Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better threshold metric for fuzzy_join #470

Closed
LeoGrin opened this issue Jan 18, 2023 · 10 comments
Closed

Better threshold metric for fuzzy_join #470

LeoGrin opened this issue Jan 18, 2023 · 10 comments
Labels
enhancement New feature or request

Comments

@LeoGrin
Copy link
Contributor

LeoGrin commented Jan 18, 2023

Improve the metric we use to threshold matches in fuzzy_join, to make it easier to tune for the user, and more correlated with actual matches. Right now, the match_score we use is directly the distance between a category and its nearest neighbor, which is hard to interpret and does not depend on other matches.

Tagging @rcap107 who suggested to use relative metrics, and @jovan-stojanovic who may have some ideas!

@LeoGrin LeoGrin added the enhancement New feature or request label Jan 18, 2023
@rcap107
Copy link
Contributor

rcap107 commented Jan 18, 2023

Hello, thanks for tagging me!

In fact, there are a few things that I think are worth considering. Specifically, there various issues involved in using a simple threshold that I think should be kept in mind, and that would justify using relative metrics (rather than, or in conjunction with) absolute metrics.

In short, deciding the threshold a priori is hard, and a wrong choice can lead to low recall (threshold too high) or low precision (threshold too low). Consider for example a case like this (X1 and X2 are distances from a target point):

X1 = [
    0.81, 
    0.80,
    0.79
]

X2 = [
    0.45,
    0.20,
    0.05
]

In the X1 case, the neighbors are all very close to the target point, even though they actually represent false positives. In this case, even a high threshold would not be sufficient to distinguish between matches.
In the X2 case, the absolute values of the similarities are much lower, however the 2nd and 3rd closest values are clearly far less likely to be false positives. In this case, a lower threshold might have led to higher (and somewhat more reliable) recall.
I have experienced both cases in practice.

It may be possible to mitigate somewhat the issue of having an "absolute threshold" by taking some of the nearest neighbors of a target point, averaging their distances and "normalizing" them by said average. Then, the threshold would be relative to the neighborhood of the point. I wonder if it is possible to run experiments to see if it is possible to find a good value for this "normalized threshold".
Either way, this does not prevent the two cases I mentioned above from happening, and there is still the issue of selecting the threshold itself (normalized or not)

I think that the best way of implementing a relative metric is the introduction of a bijective metric that relies on the fact that neighbors of a target point are ranked by their similarity to the point itself.

So, if we have points a, b and c, with the rankings

closest(a) = [b, c]
closest(b) = [c,a]
closest(c) = [b,a]

The bijective metric would iterate over all points, looking for the highest ranked neighbor. In the case of point b, this means that closest(b)=c. The metric function then look at closest(c) and observes that closest(c)=b: there is a bijective relationship. This is a very strong indication that b and c are close to each other, and that b is closer to c than it is to a.

Consider this script:

import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
import seaborn as sns

X = [
    [1.1, 0.5],
    [0.2, 0.5],
    [0.1, 0.1]
]
X = np.array(X)
X_df = pd.DataFrame(X).reset_index()
X_df.columns = ["point", "x","y"]

ax = sns.scatterplot(data=X_df, x="x", y="y", hue="point", palette="tab10")
ax.set_xlim([0, 1.5])
ax.set_ylim([0, 1.5])

neigh = NearestNeighbors(n_neighbors=2)

neigh.fit(X)
distance, neighbors = neigh.kneighbors(X, return_distance=True)

for target, closest_points in enumerate(neighbors):
    _, closest = closest_points
    print(f"This is point {target}")
    # closest to point
    c2p = closest
    print(f"The closest point to point {target} is {c2p}")
    # closest to closest
    c2c = neighbors[closest, 1]
    print(f'The closest point to {c2p} is {c2c}')
    if c2c == target:
        print(f"There is a bijective relationship between {c2p} and {c2c}")
    print("##")

This bijective metric is very effective at increasing the matching precision, and it does not penalize recall much because it is completely independent from any threshold: this means that there is no risk of having false negatives because of a threshold set too high.

The main drawback of this metric is that it works well only for 1-to-1 matches, since in order to match multiple entities it becomes necessary to increase the pool of candidates to mark as "matches". For example, if both a and c are viable matches for b, but a isn't a match for c, in order to match both a and c to b, we'd also have to force a match between a and c, even though this match is incorrect.

This was a very long-winded way of saying that selecting the threshold is not a simple issue, and that using only a single value can lead to unforeseen consequences. In any case, I'd love to continue the discussion later, and possibly look to implement some of these solutions in the code!

@Vincent-Maladiere
Copy link
Member

Hi, @rcap107, thanks for this explanation!

It may be possible to mitigate somewhat the issue of having an "absolute threshold" by taking some of the nearest neighbors of a target point, averaging their distances, and "normalizing" them by said average. Then, the threshold would be relative to the neighborhood of the point. I wonder if it is possible to run experiments to see if it is possible to find a good value for this "normalized threshold".

The closest thing that came to my mind is the Local Outlier Factor in scikit-learn. It does the opposite of what you're looking for: it finds outliers by computing the local density from their nearest neighbors. You may find this section of the code helpful.

One property of LOF is that it doesn't rely on a threshold. Instead, it uses a metric called contamination that the user can set to select the ratio of outliers in its data set.

I wonder if there is a way to transfer these concepts to fuzzy matching. WDYT?

cc @GaelVaroquaux who might help with better understanding LOF.

@jeromedockes
Copy link
Member

following up on the skrub meeting discussion and relevant for #821:

Parametrizing the fuzzy join threshold

In fuzzy join,

  • we vectorize the join attributes
  • we pair each row in main_table with its nearest neighbor from aux_table
  • we compare the distance that separates the pair to a threshold to accept the match or say that no match was found for that particular row

The question is how do we specify that threshold.

In some cases the vectorized rows may be in a space where distances have a meaningful unit.
For example if we are joining on a single datetime column (and do not rescale the vectorized column), the threshold could be provided in hours: reject a match if the main row and the aux row are more than 24 hours apart.
However, the main use case of fuzzy join is joining on string columns, and in the case of a single numeric column we can probably do something better than fitting a NearestNeighbors anyway (eg for dates one can use pandas merge_asof).

In other cases the distances don't have a meaningful scale, so we want to rescale them to make it easier for users to select a grid of possible thresholds.
To do so, we rescale our distance by dividing it by some reference distance.
For example, we can rescale it by the distance to the second nearest neighbor.
In that case, a threshold of 0.25 would mean "accept the match if the second nearest neighbor is at least 4 times farther away than the match candidate".
In that example, useful thresholds are between 0 (vectorized representations are equal) and 1 (accept all matches).
Other possible choices for the reference distance are

  • distance between the auxiliary row and its nearest neighbor in the auxiliary table (the choice made by autofuzzyjoin)
  • distance between the worst matched pair ie max_{m in main rows} min_{a in aux rows} d(m, a): what is currently used in the main branch of skrub
  • median distance between randomly sampled rows
  • etc.

depending on that choice,

  • the reference distance may sometimes be equal to 0 (rescaling then results in a division by 0)
  • the reference distance may sometimes be smaller than the actual distance (for example in the first bullet point above it can happen), in which case thresholds higher than 1.0 may make sense and a threshold of 1.0 does not always mean accept all matches

We also have the option, as is currently done now, to define a score as (1 - rescaled distance), and provide the threshold as a min score rather than a max distance.

Questions

I would love some opinions on

    1. which of the reference distances do we want to support, and if we want more than 1 which should be the default
    1. should we allow disabling rescaling (not just of distances, but also standard scaling of the columns themselves during vectorization) to enable the "distance in seconds" use case
    1. do we prefer scores (higher is a better match, 1 is equal) or distances (lower is a better match, 0 is equal)

note that the score/distance parametrization does not change the ordering of matched pairs, so if the threshold is well set with a grid search the result should be the same -- the question is which parametrization enables easily choosing a good grid.

    1. in another issue we mentioned a fast path that uses a regular pandas/polars join when the user sets min_score=1.0. However they are not the same, the normal equijoin requires the input strings to be equal whereas score=1 requires their vectorized representations to be equal. Do we want to ignore that difference, to have another parameter, to have a different (much lighter) class for doing usual non-fuzzy joins, or to forgo the fast path?

@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Nov 15, 2023

Thank you for putting all these thoughts together, @jeromedockes.

  • i. I'm +1 for sampling random rows and computing a sufficient statistic to get the reference distance. This would represent the "global distance" between the left row and the right auxiliary dataset, and we are assured that this reference distance is not close to 0.

    If a candidate's distance is higher than the reference, it is safe to discard this candidate since it is worse than the global distance, so a threshold of 1 would mean "accepting a candidate's distance equal to the global distance", which is very permissive.

  • ii and iv. I'm +1 to keep the "fast path" and maybe even to extend it using a parameter joining_kind (or kind), whose options are "fuzzy" (default), "exact" (which calls pandas merge) and "asof" (which calls pandas merge_asof).

    This would allow grid search in a pipeline and better emphasize that the fuzzy join is a generalization of the merge operation.

    This would also address the use case of joining on numerical or datetime values whose distance have units.

    The distance/score parameter would then only apply to the fuzzy case, making the API of the fuzzy_join easier to apprehend.

  • iii. Following my precedent suggestion of introducing a joining_kind parameter, I'm +1 for a score rather than a distance. "Score" in itself sounds too vague, and I guess "similarity_threshold", between 0 (most permissive) and 1 (most strict) is more explicit for the data scientist.

However, the main use case of fuzzy join is joining on string columns, and in the case of a single numeric column we can probably do something better than fitting a NearestNeighbors anyway (eg for dates one can use pandas merge_asof).

In any case, I think this is critical to emphasize this in the documentation of the fuzzy_join and Joiner to better convey what skrub offers. Maybe even go as far as listing the use cases and giving our preferred way of joining.

@jeromedockes
Copy link
Member

jeromedockes commented Nov 15, 2023 via email

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Nov 16, 2023

I would love some opinions on
which of the reference distances do we want to support, and if we want more than 1 which should be the default

My take is we should have two options, one ~absolute and one relative:

  • for the relative metric, I would push "rescale it by the distance to the second nearest neighbor" as it guarantees that the denominator is larger than the numerator. (Maybe we can consider the second non-identical nearest neighbor to avoid being super strict if there are duplicates?) Another idea would be "rescale it by the median distance from this row to randomly sampled rows in the aux table", but I prefer the former.
  • for the absolute distance, I don't have a strong opinion but "median distance between randomly sampled rows" seems good.

This way we can default to the absolute metric, and let people choose the relative metric if they expect roughly one good match.

@Vincent-Maladiere
Copy link
Member

as they do something different and thus need different parameters wouldn't it be simpler to have different estimators, one for each kind of join? in the same way that pandas and polars have separate join and join_asof

I agree with you after reading merge_asof more carefully. The tolerance and direction parameters wouldn't make sense to fuzzy_join (or only in the direction="nearest" case).

However, I'm still advocating for a parameter like exact_match=True for cross-val and performance purposes.

@rcap107
Copy link
Contributor

rcap107 commented Nov 17, 2023

Thanks for the work and for putting together the notes!

I agree with @LeoGrin on having both relative and absolute metrics: a relative metric should work better if the user does not know much about the data (here I am assuming there is some relative distance implemented), whereas an absolute metric should allow "real world" metrics to be used (e.g. join if the distance is < 1km, or if the time difference is <1min).

This also means I'm in favor of disabling rescaling if necessary and useful.

On the score vs distance discussion:

  • I really dislike "score" as a term because I find it ambiguous and that it does not necessarily imply "higher is closer" the way "similarity" would.
  • Between distance and similarity, I prefer "distance", as it can be applied in more cases and a distance of 0 conveys the same meaning no matter the actual metric/rescaling. A "similarity" must be rescaled/altered in some way to be in the range (-inf, 1].

I am strongly against any kind of similarity metric where a threshold of 1.0 does not mean "exact matching". I find this would be extremely confusing for users (and I place myself in that group).

In a similar vein, if I were to set set the distance threshold to 0 (similarity to 1), then I would expect the code to work with exact matching only, though in this case one might make the argument of "why are you joining with fuzzyjoin in the first place" 🤔 either way, I agree with @Vincent-Maladiere that there should be a parameter to perform exact matching in those cases.

@jovan-stojanovic
Copy link
Member

Thanks @jeromedockes for this great summary. Here is my opinion on this:

i. and ii. I like the median distance between randomly sampled rows for a relative metric.
For absolute metrics, I see two options:

  • have a parameter join_type = {date, numerical, string, mix} or something that will allow you to have absolute metrics in the case when joins are only performed on dates for instance and then adapt the score/distance parameter to this. There may be too complex parameters in this case, so this could be a problem. But if you manage to simplify it it could work.
  • Like @jeromedockes pointed out, an option is to have different joiners for each data type and thus have adapted absolute metrics join_dates, join_nums etc. And thus, keep the fuzzy_join as a string-focused joiner which allows other types but joins them imperfectly.

iii. No strong opinion on this, and I agree with @rcap107 and @Vincent-Maladiere that the most important is that it's user friendly. So either way you see it we might have 1 as perfect similarity/some distance and 0 as no similarity/no distance).

iv. If possible yes, but I think this is for later: it depends on the code structure that will emerge from i. - iii

@jeromedockes
Copy link
Member

I think this has been addressed in #821

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants