Expand EM approach to more ML algorithms? #1310

mxkus · 2023-06-09T06:47:58Z

mxkus
Jun 9, 2023

Hey there.

I was curious with regard to the naive bayes EM approach whether this is just a self-supervised learning approach which also could be implemented iteratively with any other algorithm like decision trees (E: train the tree with labels, M: reassign labels based on the predictions). Have I overlooked something?

If yes, there is are also fine-tunings to this approach where the retraining is only done on quite "certain" predictions (e.g. probability > 0.9). What do you think about this?

Best,
Max

RobinL · 2023-06-09T09:27:08Z

RobinL
Jun 9, 2023
Maintainer

Hi Max - great question!

I'll preface this by saying I'm not an expert in self-supervised ML approaches outside those use by Splink - so take what I say with a grain of salt!

With that said, I have always wondered whether there may be some mileage in applying alternative ML techniques to the results of Splink.

A while back I tried to run a basic logistic regression using the Splink comparison vector (comparison level values) as features, and the splink prediction as the 'label'. The idea was that it may be a way around the the 'conditional independence' assumption which is usually violated in practice (specifically conditional independence of comparison columns on match status). The idea being that multiple regression techniques allow for correlations between features.

I didn't try that hard, but my memory is that logistic regression didn't help much*. However:

I didn't try that hard
I didn't at the time have a very good labelled dataset to test properly whether it had worked well (e.g. AOC, ROC etc.)
I didn't try to use it iteratively in the way you suggest, it was just a one shot

I like the idea of using a ML model (e.g. decision trees) iteratively - it had never occurred to me to try and it's a good idea. It certainly sounds worth experimenting e.g. with only the p > 0.9 results. I guess whatever works best!

Finally, the biggest challenge we've faced with this sort of thing is producing robust evidence that it works, due to the absence of high quality labelled data to measure accuracy. We do have a couple of synthetic datasets relating to persons derived from wikidata that I could potentially share if they may be of use

If I remember correctly - and bear in mind it was a long time ago - there was an oddity with logistic regression that the naive bayes parameter estimates essentially 'force' the logistic regression to give the same answer. Maybe this makes sense because naive literally calculates the match_probability in a kind of 'ceteris parabus' way - e.g. the effect of a match on first_name is constant irrespective of the values taken by the other parameters. Or maybe I'm talking rubbish. I didn't understand it at the time, but I probably needed to include interaction effects in the regression (which I did not do)

2 replies

RobinL Jun 9, 2023
Maintainer

One other thought:

I'm not sure about convergence. With the naive bayes (Fellegi Sunter) approach, there is a proof that the log likelihood will always improve each iteration (strictly speaking, never worsen). (see this paper). That generally means

convergence occursand
empirically speaking, convergence is usually fairly fast. And generally the change in log likelihood is smaller each iteration.

I'm not sure whether the same would be the case with other ML models. Not saying they wouldn't 'converge', I just don't know (and I guess the definition of convergence is not always as straightforward as in the naive bayes case where you have a likelihood function)

zmbc Aug 29, 2024

My interpretation of your experiment results (with the caveat that you haven't described them in great detail) is that you ran a logistic regression without interaction terms and found it was nearly equivalent to F-S. I think that is expected -- F-S is known to be very closely related to log-linear regression, and that has been used instead of Naive Bayes. See for example Winkler 1993 and Larsen and Rubin 2001. See the latter for some statements about convergence.

I think this is more well-trodden ground than you realized, and well worth putting into Splink without caveats about it being experimental. The ability to include interaction terms is where the approach really shines, which I am guessing you did not do, or did not do on an example that really needed it. I've experienced substantial CI issues in linking pseudopeople data with parsed addresses, and could potentially put together an example where it really helps if you could share code you used for this!

Lastly, I wanted to call your attention to more recent work from the computer science literature called ZeroER. It is also an extension of F-S that includes feature dependence, though I haven't yet exactly worked out how it relates to the log-linear approach. There is a Python implementation here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand EM approach to more ML algorithms? #1310

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Expand EM approach to more ML algorithms? #1310

mxkus Jun 9, 2023

Replies: 1 comment · 2 replies

RobinL Jun 9, 2023 Maintainer

RobinL Jun 9, 2023 Maintainer

zmbc Aug 29, 2024

mxkus
Jun 9, 2023

Replies: 1 comment 2 replies

RobinL
Jun 9, 2023
Maintainer

RobinL Jun 9, 2023
Maintainer