Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear error if EM training blocking rule creates empty link table #852

Closed
ADBond opened this issue Oct 18, 2022 · 1 comment · Fixed by #934
Closed

Unclear error if EM training blocking rule creates empty link table #852

ADBond opened this issue Oct 18, 2022 · 1 comment · Fixed by #934
Labels
bug Something isn't working

Comments

@ADBond
Copy link
Contributor

ADBond commented Oct 18, 2022

If the blocking rule you use for training m values using em maximisation accidentally creates no candidate pairings, you end up with an error BinderException: Binder Error: Referenced column "nan" not found in FROM clause!.

The process seems to be that the fact that __splink__df_blocked is empty leads to (in __splink__m_u_counts) m_count and u_count in _probability_two_random_records_match both being NaN. Then when __splink__df_predict is created this is translated to the string nan which is then interpreted as a column name.

Perhaps a clearer error message could be given after the blocking step if the frame is empty (and thus also earlier failing).

But also I thought worth flagging in case there may be other situations which may lead to NaNs being wrongly interpreted as columns in ways which cause more subtle issues?

Reprex:

import pandas as pd
import numpy as np

from splink.duckdb.duckdb_linker import DuckDBLinker
import splink.duckdb.duckdb_comparison_level_library as cll
from splink import __version__

print(__version__)

np.random.seed(128)

names = ("grelt", "splong", "mithakshoff", "fringley", "blangolting")
n_rows = 200
df = pd.DataFrame(
    {
        "id": range(0, n_rows),
        "name": np.array(list(map(lambda x: np.random.choice(names, 1)[0], range(0, n_rows)))),
        "height": np.array(list(map(lambda x: np.random.uniform(100, 200, 1)[0], range(0, n_rows)))),
    }
)

print(df.head())

settings = {
    "unique_id_column_name": "id",
    "link_type": "dedupe_only",
    "comparisons": [
        {
            "output_column_name": "name",
            "comparison_levels": [
                cll.null_level("name"),
                {
                    "sql_condition": f"name_l = name_r",
                    "label_for_charts": "Name match"
                },
                cll.else_level()
            ]
        },
    ],
    "retain_intermediate_calculation_columns": True,
    "retain_matching_columns": True,
}

linker = DuckDBLinker(df, settings)

linker.debug_mode = True
linker.estimate_u_using_random_sampling(1e7)
# no values will satisfy this blocking condition:
linker.estimate_parameters_using_expectation_maximisation("l.height = r.height")
@lamaeldo
Copy link

lamaeldo commented Jul 1, 2024

Not certain this is resolved, i think i experienced the same issue today. I applied the same blocking rule for EM training that I use on large datasets to subset of those datasets, which created 0 comparison. The EM training loop ran for 1 iteration and I then got BinderException: Binder Error: Referenced column "nan" not found
Happy to provide additional details if needeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants