Unclear error if EM training blocking rule creates empty link table #852

ADBond · 2022-10-18T14:37:56Z

If the blocking rule you use for training m values using em maximisation accidentally creates no candidate pairings, you end up with an error BinderException: Binder Error: Referenced column "nan" not found in FROM clause!.

The process seems to be that the fact that __splink__df_blocked is empty leads to (in __splink__m_u_counts) m_count and u_count in _probability_two_random_records_match both being NaN. Then when __splink__df_predict is created this is translated to the string nan which is then interpreted as a column name.

Perhaps a clearer error message could be given after the blocking step if the frame is empty (and thus also earlier failing).

But also I thought worth flagging in case there may be other situations which may lead to NaNs being wrongly interpreted as columns in ways which cause more subtle issues?

Reprex:

import pandas as pd
import numpy as np

from splink.duckdb.duckdb_linker import DuckDBLinker
import splink.duckdb.duckdb_comparison_level_library as cll
from splink import __version__

print(__version__)

np.random.seed(128)

names = ("grelt", "splong", "mithakshoff", "fringley", "blangolting")
n_rows = 200
df = pd.DataFrame(
    {
        "id": range(0, n_rows),
        "name": np.array(list(map(lambda x: np.random.choice(names, 1)[0], range(0, n_rows)))),
        "height": np.array(list(map(lambda x: np.random.uniform(100, 200, 1)[0], range(0, n_rows)))),
    }
)

print(df.head())

settings = {
    "unique_id_column_name": "id",
    "link_type": "dedupe_only",
    "comparisons": [
        {
            "output_column_name": "name",
            "comparison_levels": [
                cll.null_level("name"),
                {
                    "sql_condition": f"name_l = name_r",
                    "label_for_charts": "Name match"
                },
                cll.else_level()
            ]
        },
    ],
    "retain_intermediate_calculation_columns": True,
    "retain_matching_columns": True,
}

linker = DuckDBLinker(df, settings)

linker.debug_mode = True
linker.estimate_u_using_random_sampling(1e7)
# no values will satisfy this blocking condition:
linker.estimate_parameters_using_expectation_maximisation("l.height = r.height")

The text was updated successfully, but these errors were encountered:

lamaeldo · 2024-07-01T14:17:44Z

Not certain this is resolved, i think i experienced the same issue today. I applied the same blocking rule for EM training that I use on large datasets to subset of those datasets, which created 0 comparison. The EM training loop ran for 1 iteration and I then got BinderException: Binder Error: Referenced column "nan" not found
Happy to provide additional details if needeed

ADBond added the bug Something isn't working label Oct 18, 2022

ADBond mentioned this issue Oct 31, 2022

Unclear error if estimate_probability_two_random_records_match blocking rule generates no matches #870

Closed

This was referenced Dec 9, 2022

Training blocking rule produces error #929

Closed

Empty training block error #934

Merged

ADBond closed this as completed in #934 Dec 13, 2022

ADBond mentioned this issue Aug 14, 2024

NaN trained values can break predict() #2334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear error if EM training blocking rule creates empty link table #852

Unclear error if EM training blocking rule creates empty link table #852

ADBond commented Oct 18, 2022

lamaeldo commented Jul 1, 2024

Unclear error if EM training blocking rule creates empty link table #852

Unclear error if EM training blocking rule creates empty link table #852

Comments

ADBond commented Oct 18, 2022

lamaeldo commented Jul 1, 2024