-
Notifications
You must be signed in to change notification settings - Fork 32
Blocking
This page gives an overview over the Blockers that are provided by WInte.r. Blocking (also known as Indexing) is applied to reduce the number of records or attributes that have to be compared during a matching operation.
All blockers are defined in the de.uni_mannheim.informatik.dws.winter.matching.blockers
package and implement either the SingleDataSetBlocker
or the CrossDataSetBlocker
interface.
A SingleDataSetBlocker
uses one input dataset and is, for example, used in duplicate detection.
A CrossDataSetBlocker
uses two input datasets and is used for identity resolution and schema matching.
In cases where no blocker should be used, the NoBlocker
class can be used for identity resolution and the NoSchemaBlocker
for schema matching. These blockers simply generate all possible pairs of records.
A standard blocker uses blocking keys, which are generated from the records. Each record is assigned to one or more blocks based on the blocking key. Then, only records in the same block are compared during matching.
The implementation of the standard blocker is provided by the StandardBlocker
class. It implements both the the SingleDataSetBlocker
and the CrossDataSetBlocker
interface.
For convenience, the StandardRecordBlocker
and StandardSchemaBlocker
can be used for identity resolution and schema matching respectively.
Both extend the StandardBlocker
, but have less type parameters which makes them easier to use.
An example can be seen in the use-case example de.uni_mannheim.informatik.dws.winter.usecase.movies.Movies_IdentityResolution_Main
.
The standard blocker can receive additional correspondences to generate the blocking key or to be forwarded to the matching rule. A typical scenario is that the result of schema matching, i.e., schema correspondences, are used for identity resolution. In this case, the schema correspondences are added to every generated pair as causal correspondences. These correspondences are then available in the matching rule that is executed after the blocker.
The value-based blocker is a variaton of the standard blocker. It assumes that the blocking keys are the values from the records, using the MatchableValue
class.
The implementation of the standard blocker is provided by the ValueBasedBlocker
class. It implements both the the SingleDataSetBlocker
and the CrossDataSetBlocker
interface.
When generating pairs, causal correspondences are added for all matching values with the number of matches for each value as similarity score.
For convenience, the InstanceBasedRecordBlocker
and InstanceBasedSchemaBlocker
can be used for identity resolution and schema matching respectively.
Both extend the ValueBasedBlocker
, but have less type parameters which makes them easier to use.
An example can be seen in the use-case example de.uni_mannheim.informatik.dws.winter.usecase.movies.Movies_SimpleIdentityResolution
.
In this example, the following record exists in both datasets:
"A","Spirited Away",,,"0.0","0.0","Hayao Miyazaki","2001-01-01T00:00:00.000+01:00"
"B","Spirited Away",,,"0.0","0.0","Hayao Miyazaki","2001-01-01T00:00:00.000+01:00"
The value-based blocker creates a pair for the two versions of the record with the following causal correspondences:
Hayao Miyazaki <-> Hayao Miyazaki (1,0)
Spirited Away <-> Spirited Away (1,0)
2001-01-01T00:00:00.000+01:00 <-> 2001-01-01T00:00:00.000+01:00 (1,0)
0.0 <-> 0.0 (2,0)
The sorted Neighbourhood method sorts all records by their blocking key and then compares all records within a specified window size.
The implementation of the sorted neighbourhood method is provided by the SortedNeighbourhoodBlocker
class. It implements both the the SingleDataSetBlocker
and the CrossDataSetBlocker
interface.
The two most important measures for evaluating and comparing blocking methods are reduction ratio and pairs completeness. The reduction ratio measures which fraction of all pairs are filtered out by the blocking method as non-matches. The pairs completeness measures the fraction of the true matches that are still present after blocking. Pairs completeness thus indicates whether the blocker incorrectly filters out true matches. Winte.r writes the reduction ratio to the console when a blocker is executed. In order to measure pairs completeness, you need a gold standard containing a set of true matches. Using this gold standard, you can measure how many matches from the gold standard are contained in the pairs that are generated by the blocker and use the resulting value as an estimate of pairs completeness. Measuring pairs completeness thus involves: 1. Executing the blocker, 2. Evaluating the pairs produced by the blocker against a gold standard. This process is implemented as follows:
- Executing the blocker by calling the method
runBlocking
of theMatchingEngine
.
// Initialize Matching Engine
MatchingEngine<Movie, Attribute> engine = new MatchingEngine<>();
// Execute blocking
Processable<Correspondence<Movie, Attribute>> correspondences = engine.runBlocking(
dataAcademyAwards, dataActors, null, blocker);
- Evaluate the completeness of the blocked pairs
The blocked pairs e.g.
correspondences
are now evaluated against a goldstandard using theMatchingEvaluator
. To learn more about how to construct such a goldstandard for matching please refer to the article on general evaluation.
// Evaluate result
MatchingEvaluator<Movie, Attribute> evaluator = new MatchingEvaluator<Movie, Attribute>();
Performance perfTest = evaluator.evaluateMatching(correspondences.get(), gsTest);
The results of the evaluation are printed to the console.
Academy Awards <-> Actors
Precision: 0.0114
Recall: 0.8085
F1: 0.0225
The recall value is the pairs completeness you are looking for.
The precision can be interpreted as pairs quality, which is a secondary less relevant measure.