URI matching

As an evaluation platform for tasks like D2KB, GERBIL has to check whether URIs returned by an annotator match the URIs inside the gold standard. This is not a trivial task since theoretically many different URI can point to the same entity. Normally, this problem can be handled easily by using a central knowledge base (KB) which all annotators have to use for their annotations, e.g., the Wikipedia-IDs used by the BAT-framework [1].

One of our main goal is to use the strengths of the semantic web to keep GERBIL KB agnostic. Thus, we developed a different way to handle this problem based on URI sets, sameAs link retrieval and URI classification.

URIs are representing entities. Those entities are either exactly the same or not. We assume that two URIs pointing to the same entity are connected by a owl:sameAs property. They do not have any other connection relevant for the matching.

URI sets

In GERBIL an entity is allowed to have not only one single but a set of URIs assigned to it. If two URIs are inside of such a set, it is assumed that they are connected with an owl:sameAs property.

sameAs link retrieval

After an entity has been read from a gold standard or the answer of an annotator, its URI set is extended by retrieving additional URIs for this entity. This can be done by

using owl:sameAs triples defined inside the gold standard or the annotators response,
dereferencing the URI and search for triples containing owl:sameAs as property or
using existing mappings between established KBs, e.g., entity mappings between DBpedia and YAGO. (not implemented yet ;) but possible)

URI classification

Most of the tasks implemented in GERBIL are aiming at the linking of entities to a KB. For those tasks the set of entities can be divided into two large sets of entities. The first set are all entities that are present inside the KB. Thus, there is at least one URI of the KB that can be assigned to the URI sets of these entities. The second group contains all entities that are unknown to the KB. There is no URI of the KB that can be assigned to their URI sets.

Every URI set can be classified as either known or unknown to the KB depending on whether the set contains a URI that is part of the KB or not.

Matching definition

Two URI sets A and B are matching

if each set is classified as known by the KB and their URI sets are overlapping
or both sets are classified as unknown to the KB.

Consequences

This matching has consequences for the quality measurements done by GERBIL. Most of these measurements are based on true positives (tp), false positives (fp) and false negative counts (fn). The following table contains examples for all scenarios that can be distinguished for such a matching (in a D2KB task).

ID	gold standard	annotator	counted as
1	http://dbpedia.org/resource/Berlin	http://dbpedia.org/resource/Berlin	tp
2	http://dbpedia.org/resource/Berlin	http://dbpedia.org/resource/Berlin_2	fp, fn
3	http://dbpedia.org/resource/Berlin	http://aksw.org/notInWiki/Berlin	fp, fn
4	http://dbpedia.org/resource/Berlin	null	fn
5	http://aksw.org/notInWiki/Berlin	http://dbpedia.org/resource/Berlin	fp, fn
6	http://aksw.org/notInWiki/Berlin	http://aksw.org/notInWiki/Berlin	tp
7	http://aksw.org/notInWiki/Berlin	http://example.org/unknown/Berlin_X	tp
8	http://aksw.org/notInWiki/Berlin	null	fn

The URI sets of both columns (gold standard and annotator) could contain much more URIs. However, for these examples we concentrated on cases with single URIs. These examples can be described as follows:

In the first 4 examples the gold standard contains an entity that can be assigned with a URI of the KB (DBpedia).
1. The result of the annotator contains the correct URI.
2. The result of the annotator contains a different URI that is part of the KB, too.
3. The result of the annotator contains a URI of an unknown KB.
4. The result of the annotator contains no URI for this entity.
In the last 4 examples the gold standard contains an entity that is not known to the KB and has got artificial generated URI.
1. The result of the annotator contains the URI of a entity known to the KB.
2. The result of the annotator contains exactly the same URI.
3. The result of the annotator contains a different URI which is classified as unknown to the KB.
4. The result of the annotator contains no URI for this entity.

A practical consequence is that the evaluation of an annotator needs more processing time and memory because GERBIL tries to dereference every entity URI of the gold standard and the annotator.

References

[1] Marco Cornolti, Paolo Ferragina, Massimiliano Ciaramita. A Framework for Benchmarking Entity-Annotation Systems. In Proceedings of the International World Wide Web Conference (WWW) (Practice & Experience Track), ACM (2013).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly