Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement service to merge duplicate GRSciColl entities #255

Closed
marcos-lg opened this issue Nov 12, 2020 · 4 comments
Closed

Implement service to merge duplicate GRSciColl entities #255

marcos-lg opened this issue Nov 12, 2020 · 4 comments
Assignees
Labels
GRSciColl Issues related to institutions, collections and staff

Comments

@marcos-lg
Copy link
Contributor

marcos-lg commented Nov 12, 2020

Due to the nature of how GRSciColl has been assembled, duplicate records exist. When discovered data managers need to be able to easily address the issue. A service should be made that allows a data manager to effectively delete (logically) a record indicating that it is to be considered a duplicate of another entity.

This should do the following:

  1. Mark the record as deleted
  2. Set a field (to be defined) that indicates the record is a duplicate of another
  3. Copy the identifiers from the record onto the new record so things resolve to the preferred record
  4. Copy the iDigBio machine tags from the record onto the new record so things resolve to the preferred record
  5. Add UUID of entities to delete as identifier to entities to keep (with the type "UUID).
  6. Code(s) of the deleted entity will be added as "alternative code" to the replacement. Alternative codes of the deleted entity will be migrated to the entity to keep too
  7. Update the explicit mappings to link occurrence records to GRSciColl entities
  8. Copy over information that would be lost when removing these duplicates (such as description when available). For the fields that are lists we merge the lists from both entities. Tags and comments are not migrated
  9. For a merge of institutions, move the collections to the entity to keep
  10. Update the primary institution and primary collection of the affected persons
  11. Merge contacts

There are some preconditions that have to be met in order to do the merge:

  • If both entities have an IH_IRN identifier the service will return an error. This is because we wouldn't know how to sync them with IH: if we move the identifier to the replacement this entity will be synced with 2 IH entities and the second sync will overwrite the first one; if we don't move it, then the next IH sync will create a new entity for that IRN, hence the replacement would be useless.
  • For a merge of collections only: both collections have to belong to the same institution.
@marcos-lg marcos-lg added the GRSciColl Issues related to institutions, collections and staff label Nov 12, 2020
@marcos-lg marcos-lg self-assigned this Nov 12, 2020
@timrobertson100 timrobertson100 changed the title Add replacedBy field to GRSciColl entities Implement service to merge duplicate GRSciColl entities Nov 12, 2020
marcos-lg added a commit that referenced this issue Nov 27, 2020
@marcos-lg
Copy link
Contributor Author

marcos-lg commented Dec 1, 2020

The endpoint is available now in UAT to receive POST requests at:

http://api.gbif-uat.org/v1/grscicoll/institution/{key}/merge

In the body we need to pass the replacement entity key in JSON format:

{
    "replacementEntityKey": "dd155a13-33da-46be-9f6e-07809d2ab5ab"
}

Authentication is required and the user needs to be a GRSciColl admin or editor.

Curl request example:

curl -u username:password -X POST 'api.gbif-uat.org/v1/grscicoll/institution/b779903b-f02b-45f2-91ea-b3b28a0c408e/merge' \
--header 'Content-Type: application/json' \
--data-raw '{
    "replacementEntityKey": "dd155a13-33da-46be-9f6e-07809d2ab5ab"
}'

@marcos-lg
Copy link
Contributor Author

marcos-lg commented Dec 8, 2020

To be changed:

  • Don't move the identifiers, just duplicate them and filter the deleted ones in the lookup and in the identifier resolver

@ManonGros
Copy link
Contributor

Some additional reasoning and thoughts about keeping the IDs associated with the old entry (from Skype conversation):

  1. the only reason was to keep the original record in tact so we had a view of what it looked like at the time it was replaced / merged (it is a snapshot of how the past looked),
  2. the LSID, GRSCICOLL etc identifiers would be copied across to the new entity (the UUID of the old entry is added as an ID on the new entry),
  3. all the lookup services would add the equivalent of "AND replaced_by IS NULL" to not return replaced entities,
  4. If someone used a UUID in an occurrence record that pointed to a replaced GRSciColl entity, we'd overwrite that with the latest UUID. For now the best practise would be to check if an entry is linked to records before merging/deleting it and contact the data publisher (we can use this type of query: http://api.gbif.org/v1/occurrence/search?institutionKey=75f50140-830d-4630-a290-3d6e951a7c29&facet=datasetKey&limit=0&facetLimit=50). We don't expect to have many of these use cases (if any), we will consider adding a flag in the future if needed.

The type of use case we are trying to address:

Monday: Entity ABC exists, records are connected through collectionID=ABC

Tuesday: ABC is merged into XYZ

  • Record1 is reprocessed, and has collectionID=XYZ ("altered" shown on webpage)
  • gbif.org/grscicoll/ABC shows "this is merged into XYZ"
  • gbif.org/grscicoll/XYZ has latest metadata

marcos-lg added a commit that referenced this issue Dec 10, 2020
marcos-lg added a commit that referenced this issue Dec 11, 2020
@marcos-lg
Copy link
Contributor Author

I also disallow the merge of 2 iDigBio entities and restricted the endpoint to GRSCICOLL_ADMINS users only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GRSciColl Issues related to institutions, collections and staff
Projects
None yet
Development

No branches or pull requests

2 participants