Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Interaction term between two correlated comparisons #2413

Open
V-Lamp opened this issue Sep 18, 2024 · 4 comments
Open

[FEAT] Interaction term between two correlated comparisons #2413

V-Lamp opened this issue Sep 18, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@V-Lamp
Copy link

V-Lamp commented Sep 18, 2024

Is your proposal related to a problem?

Terms with independent comparisons, e.g. Postcode and City can be very correlated, so two independent comparisons for postcode & city will lead to overestimating match score when both match, or underestimating it when only one matches.

Describe the solution you'd like

Some mechanism to score the interaction between two comparisons (usually negatively, like in term frequency).

Describe alternatives you've considered

So far I have put city as a lower comparison level to postcode, but I expect this problem of correlated comparisons to be more general.
Ordering of levels is also very sensitive to the precision of postcodes (e.g. UK postcode vs 5 digit US zip code). So an interaction would make the model less "hand-tuned" due to manual ordering of levels.

Additional context

Creating an interaction term is a common mechanism in dealing with correlation in Machine Learning, e.g.
if x1 and x2 are correlated, you can add a term x1*x2 in your model, e.g. y = a*x1+b*x2+c*x1*x2+d

@V-Lamp V-Lamp added the enhancement New feature or request label Sep 18, 2024
@V-Lamp V-Lamp changed the title [FEAT] Interaction term between two comparisons [FEAT] Interaction term between two comparisons for correlated comparisons Sep 18, 2024
@V-Lamp V-Lamp changed the title [FEAT] Interaction term between two comparisons for correlated comparisons [FEAT] Interaction term between two correlated comparisons Sep 18, 2024
@RobinL
Copy link
Member

RobinL commented Sep 18, 2024

I think you can do this already using this kind of syntax:

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

df = splink_datasets.fake_1000

df["postcode"] = df["email"].str.slice(0, 3)

# Define custom comparison for postcode and city
postcode_city_comparison = cl.CustomComparison(
    output_column_name="postcode_city",
    comparison_levels=[
        cll.And(cll.NullLevel("postcode"), cll.NullLevel("postcode")),
        {
            "sql_condition": "postcode_l = postcode_r AND city_l = city_r",
            "label_for_charts": "Exact match on both postcode and city",
        },
        cll.ExactMatchLevel("postcode").configure(label_for_charts="Different city, exact match on postcode"),
        cll.ExactMatchLevel("city").configure(label_for_charts="Different postcode, exact match on city"),
        cll.ElseLevel(),
    ],
)

# Define settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    comparisons=[
        cl.NameComparison("first_name"),
        cl.NameComparison("surname"),
        cl.DateOfBirthComparison("dob", input_is_string=True),
        postcode_city_comparison,
    ],
    max_iterations=5,
)


linker = Linker(df, settings, db_api=DuckDBAPI())

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)


linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("first_name", "surname")
)
linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))

linker.visualisations.match_weights_chart()

@zmbc
Copy link
Contributor

zmbc commented Sep 20, 2024

@RobinL's solution is equivalent to method 1 in S4 of the appendix of the fastLink paper, and works great for some use-cases, but doesn't allow e.g. specifying c1 * c2 and c2 * c3 without c1 * c2 * c3. For that, the second method in that appendix based on log-linear models is the solution, and there is a discussion of adding it to Splink here: #1310

@V-Lamp
Copy link
Author

V-Lamp commented Oct 13, 2024

Thank you for the response around using AND, I think I can incorporate it. However, one complexity that usually comes in practice is that comparisons usually have more than one comparison level.

For example:

postcode_comparison = [postcode_exact_match, postcode_area_match, postcode_sector_match]
address_comparison = [address_exact_match, street_name_match, address_fuzzy_match]
location_comparison = ???

I would need to make 9 levels (3 * 3) with AND, plus the other 3 + 3 levels, to define a location_comparison (so 3*3 + 3 + 3 levels in total). The challenging thing then is to find what is the right order for these 15 comparison levels, since ordering has a very significant effect. In my case, I actually have more that 3 levels, more like 6-8.

Have you found yourself in this combinatorial explosion and then further ordering problem?

@RobinL
Copy link
Member

RobinL commented Oct 14, 2024

Yeah - you're right to highlight these challenges. It's typically best to try and order in terms of 'better matches higher' - start with the most precise matches and work your way down. Although I appreciate it's not always obvious in practice; you may need some trial and error.

I agree that the combinatorial explosion problem is real, but on a large dataset having (say) 9 comparison levels is totally fine. Ultimately, each one corresponds to two parameters to estimate, so 18 parameters is not very many at all (compared to, say, other ML approaches which can have thousands).

In our production models we tend to have between about 2 and 10 comparison levels per comparison

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants