Wikidata identifier: use pre-processing mechanism to remove duplicate signatures? #215
Replies: 2 comments 4 replies
-
Would this mean that if multiple formats share a signature one would keep it, or the signature would be removed from all the formats? |
Beta Was this translation helpful? Give feedback.
-
In terms of implementation, this would go most naturally into https://github.com/richardlehane/siegfried/blob/main/internal/identifier/parseable.go You'd just create a new Parseable type that inherits the base but overwrites the Signatures() method with a new version that trims duplicates. You could then add a config setting that would be picked up by the ApplyConfig() function in the same file. This would mean that as a config setting it would potentially be available to all Identifier types, but could be made default for Wikidata. Here's some pseudo code (the isDuplicate function isn't optimised but maybe this doesn't matter as for anything that happens in roy I never really care about performance too much)... type NoDuplicates struct{ Parseable }
func isDuplicate(sig frames.Signature, prev, succ []frames.Signature) bool {
for _, sig1 := range prev {
if sig.Equals(sig1) {
return true
}
}
for _, sig2 := range succ {
if sig.Equals(sig2) {
return true
}
}
return false
}
// Signatures returns a signature set with corresponding IDs and weights for the bytematcher.
func (nd NoDuplicates) Signatures() ([]frames.Signature, []string, error) {
sigs, ids, err := nd.Parseable.Signatures()
if err != nil {
return sigs, ids, err
}
rsigs := make([]frames.Signature, 0, len(sigs))
rids := make([]string, 0, len(sigs))
for i, v := range sigs {
var prev, succ []frames.Signature
if i > 0 {
prev = sigs[:i]
}
if i < len(sigs)-1 {
succ = sigs[i+1:]
}
if isDuplicate(sig, prev, succ) {
continue
}
rsigs = append(rsigs, v)
rids = append(rids, ids[i])
}
return rsigs, rids, nil
} |
Beta Was this translation helpful? Give feedback.
-
For reference ffdev-info/wikidp-issues#32.
The Wikidata identifier includes a
-wikidatadebug
flag which outputs some of the tool''s pre-processing messages. Using the pre-processing capabilities of the Wikidata implementation in Roy, we can also identify duplicate BOF sequences and remove these from the dataset. This should improve the accuracy of Siegfried's results, at the expense of some files not being identified (by less-specific signatures).Why would we ever do this?
We would be following a model adopted by PRONOM to attempt to not return multiple-identifications. In Wikidata
What are the benefits?
What are the drawbacks?
What are the alternatives?
e.g.
basis : 'byte match at 0, 5 (Wikidata reference is empty) (DUPLICATE SIGNATURE PATTERN)'
0 votes ·
Beta Was this translation helpful? Give feedback.
All reactions