Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication v2 #10

Closed
thomasstjerne opened this issue May 5, 2021 · 10 comments
Closed

Deduplication v2 #10

thomasstjerne opened this issue May 5, 2021 · 10 comments
Assignees

Comments

@thomasstjerne
Copy link
Contributor

Additions to the existing processing of the sequence data

  1. When querying sequence data from the EMBL API, add a filter to exclude CONTIGs. E.g query=country="*"%20AND%20dataclass!="CON" to get data where country exists, but not CONTIGs.
  2. Add sample_accession to fields in the query. see: https://github.com/gbif/embl-adapter/blob/master/DATAMAPPING.md#example-queries
  3. You need to add a "seen before" filter when you iterate through the data. If a record has a sample_accession you should check if you have already seen the same sample_accession AND the scientific_name in a previous record. Could probably just be a HashMap (you dont need to store the records, just the combination of sample_accession AND scientific_name). If the combination was seen before, skip the the record

Additional data / New query

  1. Use the same API queries as with result=sequence but set result=wgs_set
  2. You have to remove sequence_md5 from fields (not present in this result format)
  3. This data should be added to the above/existing data (replacing what was filtered out by the CON filter)
@mike-podolskiy90 mike-podolskiy90 self-assigned this May 6, 2021
mike-podolskiy90 added a commit that referenced this issue May 10, 2021
mike-podolskiy90 added a commit that referenced this issue May 11, 2021
@timrobertson100
Copy link
Member

timrobertson100 commented May 21, 2021

Following review, we still observe too many "duplicates" for this to be sensible in GBIF. Can we please add a filter that will exclude records that are missing a specimen_voucher (i.e. no specimen in a museum) AND are missing a date OR location?

@ManonGros
Copy link
Contributor

Despite our best effort, this doesn't seem to be enough to remove all the "duplicates". Here are some examples of records that are very suspicious:

There are also records that correspond to the same specimens. For example: https://www.gbif-uat.org/occurrence/search?catalog_number=MSB:Mamm:250248&dataset_key=10628730-87d4-42f5-b593-bd438185517f&taxon_key=5218864&advanced=1

Proposed solution

  1. Aggregate the records that have the same catalogue number (specimen_voucher)
  2. Check the number of records grouped by species/date/location. If we have more than 1,000 records for a given species at a given date and location, I would suggest to exclude them from the import.

This number (1,000 records) is completely arbitrary. That being said, it seems to me that a research project being able to sequence more than a 1,000 individuals of the same species at the same place is probably quite rare. It is possible that there could be big colonies of very small species living in the same place, and each individual of that colony would be sequenced. But I have never heard of such case. Please someone correct me if I am wrong.

@dschigel
Copy link

Deduplication of records at such scale is like renovation, one can only approach perfection asymptotically. Marie's suggestion above sounds like a pragmatic way forward, and if constructive feedback comes we would be able to revisit this.

@mike-podolskiy90
Copy link
Contributor

Check the number of records grouped by species/date/location. If we have more than 1,000 records for a given species at a given date and location, I would suggest to exclude them from the import.

@ManonGros what is species field here please? tax_id ?

@ManonGros
Copy link
Contributor

Yes tax_id or scientific_name, I think that they should give the same result in the end.

@mike-podolskiy90
Copy link
Contributor

Thank you. This approach actually complicate things since we write record by record

@mike-podolskiy90
Copy link
Contributor

mike-podolskiy90 commented May 31, 2021

I would suggest to exclude them from the import

Only one should be left? Or exclude all of them?

@dschigel
Copy link

Hm, this is not easy. As Marie says, 1000 is an artificial cut-off. We use as a sign of suspected duplication. Total exclusion at import sounds too harsh, but flattening 1000+ -> 1 will have a funny effect on the 999- counts

@ManonGros
Copy link
Contributor

ManonGros commented Jun 7, 2021

OK so after checking and debating, here is what we want:

Step 1: use the deduplication method created by EMBL:

#10 (comment)

Step 2: remove records that are missing information

Record has a specimen_voucher = KEEP
Record has a date and location (coordinates and/or country) = KEEP
Everything else = exclude

Step 3: Aggregate the records per specimen_voucher and scientific_name

We should have 1 record per specimen_voucher and scientific_name.
I assume that we should use an extension to encompass the information corresponding to each EMBL entry. @thomasstjerne what would you recommend?

Step 4: Group the records left and exclude some of the groups based on number of records

  1. Group the records per scientific_name, collection_date, location,country,identified_by, collected_by, sample_accession
  2. If the number of record for a given group is above a threshold, exclude all records belonging to that group.

Given the distribution of the data and in attempt to not exclude to many records, we decided that the threshold = 50. See plot below:
distribution_records_embl

@thomasstjerne
Copy link
Contributor Author

The only field that could be stored in an extension is sample_accession . Instead I added that to the grouping in step 4.1 which would produce occurrences for each unique materialSampleID.
We can introduce an extension at a later stage, as this would require an additional api call per sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants