-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication v2 #10
Comments
Following review, we still observe too many "duplicates" for this to be sensible in GBIF. Can we please add a filter that will exclude records that are missing a |
Despite our best effort, this doesn't seem to be enough to remove all the "duplicates". Here are some examples of records that are very suspicious:
There are also records that correspond to the same specimens. For example: https://www.gbif-uat.org/occurrence/search?catalog_number=MSB:Mamm:250248&dataset_key=10628730-87d4-42f5-b593-bd438185517f&taxon_key=5218864&advanced=1 Proposed solution
|
Deduplication of records at such scale is like renovation, one can only approach perfection asymptotically. Marie's suggestion above sounds like a pragmatic way forward, and if constructive feedback comes we would be able to revisit this. |
@ManonGros what is |
Yes |
Thank you. This approach actually complicate things since we write record by record |
Only one should be left? Or exclude all of them? |
Hm, this is not easy. As Marie says, 1000 is an artificial cut-off. We use as a sign of suspected duplication. Total exclusion at import sounds too harsh, but flattening 1000+ -> 1 will have a funny effect on the 999- counts |
OK so after checking and debating, here is what we want: Step 1: use the deduplication method created by EMBL:Step 2: remove records that are missing informationRecord has a Step 3: Aggregate the records per
|
The only field that could be stored in an extension is |
Additions to the existing processing of the sequence data
sample_accession
tofields
in the query. see: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md#example-queriessample_accession
you should check if you have already seen the samesample_accession
AND thescientific_name
in a previous record. Could probably just be a HashMap (you dont need to store the records, just the combination ofsample_accession
ANDscientific_name
). If the combination was seen before, skip the the recordAdditional data / New query
result=sequence
but setresult=wgs_set
sequence_md5
from fields (not present in this result format)The text was updated successfully, but these errors were encountered: