Deduplication v2 #10

thomasstjerne · 2021-05-05T12:48:09Z

Additions to the existing processing of the sequence data

When querying sequence data from the EMBL API, add a filter to exclude CONTIGs. E.g query=country="*"%20AND%20dataclass!="CON" to get data where country exists, but not CONTIGs.
Add sample_accession to fields in the query. see: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md#example-queries
You need to add a "seen before" filter when you iterate through the data. If a record has a sample_accession you should check if you have already seen the same sample_accession AND the scientific_name in a previous record. Could probably just be a HashMap (you dont need to store the records, just the combination of sample_accession AND scientific_name). If the combination was seen before, skip the the record

Additional data / New query

Use the same API queries as with result=sequence but set result=wgs_set
You have to remove sequence_md5 from fields (not present in this result format)
This data should be added to the above/existing data (replacing what was filtered out by the CON filter)

The text was updated successfully, but these errors were encountered:

timrobertson100 · 2021-05-21T09:36:22Z

Following review, we still observe too many "duplicates" for this to be sensible in GBIF. Can we please add a filter that will exclude records that are missing a specimen_voucher (i.e. no specimen in a museum) AND are missing a date OR location?

ManonGros · 2021-05-28T10:29:23Z

Despite our best effort, this doesn't seem to be enough to remove all the "duplicates". Here are some examples of records that are very suspicious:

There are also records that correspond to the same specimens. For example: https://www.gbif-uat.org/occurrence/search?catalog_number=MSB:Mamm:250248&dataset_key=10628730-87d4-42f5-b593-bd438185517f&taxon_key=5218864&advanced=1

Proposed solution

Aggregate the records that have the same catalogue number (specimen_voucher)
Check the number of records grouped by species/date/location. If we have more than 1,000 records for a given species at a given date and location, I would suggest to exclude them from the import.

This number (1,000 records) is completely arbitrary. That being said, it seems to me that a research project being able to sequence more than a 1,000 individuals of the same species at the same place is probably quite rare. It is possible that there could be big colonies of very small species living in the same place, and each individual of that colony would be sequenced. But I have never heard of such case. Please someone correct me if I am wrong.

dschigel · 2021-05-28T10:47:33Z

Deduplication of records at such scale is like renovation, one can only approach perfection asymptotically. Marie's suggestion above sounds like a pragmatic way forward, and if constructive feedback comes we would be able to revisit this.

mike-podolskiy90 · 2021-05-31T08:34:43Z

Check the number of records grouped by species/date/location. If we have more than 1,000 records for a given species at a given date and location, I would suggest to exclude them from the import.

@ManonGros what is species field here please? tax_id ?

ManonGros · 2021-05-31T08:46:07Z

Yes tax_id or scientific_name, I think that they should give the same result in the end.

mike-podolskiy90 · 2021-05-31T09:03:19Z

Thank you. This approach actually complicate things since we write record by record

mike-podolskiy90 · 2021-05-31T09:05:15Z

I would suggest to exclude them from the import

Only one should be left? Or exclude all of them?

dschigel · 2021-05-31T09:09:38Z

Hm, this is not easy. As Marie says, 1000 is an artificial cut-off. We use as a sign of suspected duplication. Total exclusion at import sounds too harsh, but flattening 1000+ -> 1 will have a funny effect on the 999- counts

ManonGros · 2021-06-07T09:13:24Z

OK so after checking and debating, here is what we want:

Step 1: use the deduplication method created by EMBL:

#10 (comment)

Step 2: remove records that are missing information

Record has a specimen_voucher = KEEP
Record has a date and location (coordinates and/or country) = KEEP
Everything else = exclude

Step 3: Aggregate the records per `specimen_voucher` and `scientific_name`

We should have 1 record per specimen_voucher and scientific_name.
I assume that we should use an extension to encompass the information corresponding to each EMBL entry. @thomasstjerne what would you recommend?

Step 4: Group the records left and exclude some of the groups based on number of records

Group the records per scientific_name, collection_date, location,country,identified_by, collected_by, sample_accession
If the number of record for a given group is above a threshold, exclude all records belonging to that group.

Given the distribution of the data and in attempt to not exclude to many records, we decided that the threshold = 50. See plot below:

thomasstjerne · 2021-06-07T10:56:12Z

The only field that could be stored in an extension is sample_accession . Instead I added that to the grouping in step 4.1 which would produce occurrences for each unique materialSampleID.
We can introduce an extension at a later stage, as this would require an additional api call per sequence.

mike-podolskiy90 self-assigned this May 6, 2021

mike-podolskiy90 added a commit that referenced this issue May 10, 2021

#10 New deduplication approach

dd65445

mike-podolskiy90 added a commit that referenced this issue May 11, 2021

#10 Additional data (wgs_set)

b1ea09b

mike-podolskiy90 closed this as completed Sep 12, 2021

ManonGros mentioned this issue May 4, 2022

Explore ENA Sample Metadata #16

Open

CecSve mentioned this issue Sep 23, 2022

Capture taxonomy from scientificName #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication v2 #10

Deduplication v2 #10

thomasstjerne commented May 5, 2021

timrobertson100 commented May 21, 2021 •

edited

Loading

ManonGros commented May 28, 2021

dschigel commented May 28, 2021

mike-podolskiy90 commented May 31, 2021

ManonGros commented May 31, 2021

mike-podolskiy90 commented May 31, 2021

mike-podolskiy90 commented May 31, 2021 •

edited

Loading

dschigel commented May 31, 2021

ManonGros commented Jun 7, 2021 •

edited

Loading

thomasstjerne commented Jun 7, 2021

Deduplication v2 #10

Deduplication v2 #10

Comments

thomasstjerne commented May 5, 2021

Additions to the existing processing of the sequence data

Additional data / New query

timrobertson100 commented May 21, 2021 • edited Loading

ManonGros commented May 28, 2021

dschigel commented May 28, 2021

mike-podolskiy90 commented May 31, 2021

ManonGros commented May 31, 2021

mike-podolskiy90 commented May 31, 2021

mike-podolskiy90 commented May 31, 2021 • edited Loading

dschigel commented May 31, 2021

ManonGros commented Jun 7, 2021 • edited Loading

Step 1: use the deduplication method created by EMBL:

Step 2: remove records that are missing information

Step 3: Aggregate the records per specimen_voucher and scientific_name

Step 4: Group the records left and exclude some of the groups based on number of records

thomasstjerne commented Jun 7, 2021

timrobertson100 commented May 21, 2021 •

edited

Loading

mike-podolskiy90 commented May 31, 2021 •

edited

Loading

ManonGros commented Jun 7, 2021 •

edited

Loading

Step 3: Aggregate the records per `specimen_voucher` and `scientific_name`