Implement service to merge duplicate GRSciColl entities #255

marcos-lg · 2020-11-12T11:15:43Z

Due to the nature of how GRSciColl has been assembled, duplicate records exist. When discovered data managers need to be able to easily address the issue. A service should be made that allows a data manager to effectively delete (logically) a record indicating that it is to be considered a duplicate of another entity.

This should do the following:

Mark the record as deleted
Set a field (to be defined) that indicates the record is a duplicate of another
Copy the identifiers from the record onto the new record so things resolve to the preferred record
Copy the iDigBio machine tags from the record onto the new record so things resolve to the preferred record
Add UUID of entities to delete as identifier to entities to keep (with the type "UUID).
Code(s) of the deleted entity will be added as "alternative code" to the replacement. Alternative codes of the deleted entity will be migrated to the entity to keep too
Update the explicit mappings to link occurrence records to GRSciColl entities
Copy over information that would be lost when removing these duplicates (such as description when available). For the fields that are lists we merge the lists from both entities. Tags and comments are not migrated
For a merge of institutions, move the collections to the entity to keep
Update the primary institution and primary collection of the affected persons
Merge contacts

There are some preconditions that have to be met in order to do the merge:

If both entities have an IH_IRN identifier the service will return an error. This is because we wouldn't know how to sync them with IH: if we move the identifier to the replacement this entity will be synced with 2 IH entities and the second sync will overwrite the first one; if we don't move it, then the next IH sync will create a new entity for that IRN, hence the replacement would be useless.
For a merge of collections only: both collections have to belong to the same institution.

marcos-lg · 2020-12-01T12:36:18Z

The endpoint is available now in UAT to receive POST requests at:

http://api.gbif-uat.org/v1/grscicoll/institution/{key}/merge

In the body we need to pass the replacement entity key in JSON format:

{
    "replacementEntityKey": "dd155a13-33da-46be-9f6e-07809d2ab5ab"
}

Authentication is required and the user needs to be a GRSciColl admin or editor.

Curl request example:

curl -u username:password -X POST 'api.gbif-uat.org/v1/grscicoll/institution/b779903b-f02b-45f2-91ea-b3b28a0c408e/merge' \
--header 'Content-Type: application/json' \
--data-raw '{
    "replacementEntityKey": "dd155a13-33da-46be-9f6e-07809d2ab5ab"
}'

…forming an institution to a collection

marcos-lg · 2020-12-08T12:17:38Z

To be changed:

Don't move the identifiers, just duplicate them and filter the deleted ones in the lookup and in the identifier resolver

ManonGros · 2020-12-09T13:15:02Z

Some additional reasoning and thoughts about keeping the IDs associated with the old entry (from Skype conversation):

the only reason was to keep the original record in tact so we had a view of what it looked like at the time it was replaced / merged (it is a snapshot of how the past looked),
the LSID, GRSCICOLL etc identifiers would be copied across to the new entity (the UUID of the old entry is added as an ID on the new entry),
all the lookup services would add the equivalent of "AND replaced_by IS NULL" to not return replaced entities,
If someone used a UUID in an occurrence record that pointed to a replaced GRSciColl entity, we'd overwrite that with the latest UUID. For now the best practise would be to check if an entry is linked to records before merging/deleting it and contact the data publisher (we can use this type of query: http://api.gbif.org/v1/occurrence/search?institutionKey=75f50140-830d-4630-a290-3d6e951a7c29&facet=datasetKey&limit=0&facetLimit=50). We don't expect to have many of these use cases (if any), we will consider adding a flag in the future if needed.

The type of use case we are trying to address:

Monday: Entity ABC exists, records are connected through collectionID=ABC

Tuesday: ABC is merged into XYZ

Record1 is reprocessed, and has collectionID=XYZ ("altered" shown on webpage)

gbif.org/grscicoll/ABC shows "this is merged into XYZ"

gbif.org/grscicoll/XYZ has latest metadata

… replaced entity

…use these endpoints

marcos-lg · 2020-12-11T11:39:36Z

I also disallow the merge of 2 iDigBio entities and restricted the endpoint to GRSCICOLL_ADMINS users only.

…forming an institution to a collection

… replaced entity

…use these endpoints

marcos-lg added the GRSciColl Issues related to institutions, collections and staff label Nov 12, 2020

marcos-lg self-assigned this Nov 12, 2020

timrobertson100 changed the title ~~Add replacedBy field to GRSciColl entities~~ Implement service to merge duplicate GRSciColl entities Nov 12, 2020

ManonGros mentioned this issue Nov 19, 2020

Implement a service to deal with Institutions that should be collections #259

Closed

marcos-lg added a commit that referenced this issue Nov 27, 2020

#255 service to merge 2 grscicoll entities

156065c

marcos-lg added a commit that referenced this issue Nov 27, 2020

#255 added some preconditions

b495e80

marcos-lg added a commit that referenced this issue Dec 1, 2020

#255 #259 changed parameter names to make it clearer

e59d754

marcos-lg added a commit that referenced this issue Dec 1, 2020

#255 #259 fix content type + check extra condition in merge

067c007

marcos-lg added a commit that referenced this issue Dec 3, 2020

#255 #259 set modified and modifiedBy + moving collections when trans…

4bc1d84

…forming an institution to a collection

ManonGros mentioned this issue Dec 4, 2020

GrSciColl admin console - deduplication gbif/registry-console#353

Closed

marcos-lg added a commit that referenced this issue Dec 10, 2020

#255 keeping identifiers, machine tags and occurrence mappings in the…

db02361

… replaced entity

marcos-lg added a commit that referenced this issue Dec 11, 2020

#255 disallow to merge or convert idigbio entities + only admins can …

54c27cb

…use these endpoints

marcos-lg added a commit that referenced this issue Dec 15, 2020

#255 service to merge 2 grscicoll entities

391d349

marcos-lg added a commit that referenced this issue Dec 15, 2020

#255 added some preconditions

511f045

marcos-lg added a commit that referenced this issue Dec 15, 2020

#255 #259 changed parameter names to make it clearer

a510366

marcos-lg added a commit that referenced this issue Dec 15, 2020

#255 #259 fix content type + check extra condition in merge

9706292

marcos-lg added a commit that referenced this issue Dec 15, 2020

#255 #259 set modified and modifiedBy + moving collections when trans…

e596e99

…forming an institution to a collection

marcos-lg added a commit that referenced this issue Dec 15, 2020

#255 keeping identifiers, machine tags and occurrence mappings in the…

2619e4e

… replaced entity

marcos-lg added a commit that referenced this issue Dec 15, 2020

#255 disallow to merge or convert idigbio entities + only admins can …

710f9ae

…use these endpoints

marcos-lg closed this as completed Dec 17, 2020

ManonGros mentioned this issue Jan 19, 2021

Nested institutions in GRSciColl #285

Open

ManonGros mentioned this issue Feb 11, 2021

Updated version of GrSciColl permissions - roles and scopes #310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement service to merge duplicate GRSciColl entities #255

Implement service to merge duplicate GRSciColl entities #255

marcos-lg commented Nov 12, 2020 •

edited

Loading

marcos-lg commented Dec 1, 2020 •

edited

Loading

marcos-lg commented Dec 8, 2020 •

edited

Loading

ManonGros commented Dec 9, 2020

marcos-lg commented Dec 11, 2020

Implement service to merge duplicate GRSciColl entities #255

Implement service to merge duplicate GRSciColl entities #255

Comments

marcos-lg commented Nov 12, 2020 • edited Loading

marcos-lg commented Dec 1, 2020 • edited Loading

marcos-lg commented Dec 8, 2020 • edited Loading

ManonGros commented Dec 9, 2020

marcos-lg commented Dec 11, 2020

marcos-lg commented Nov 12, 2020 •

edited

Loading

marcos-lg commented Dec 1, 2020 •

edited

Loading

marcos-lg commented Dec 8, 2020 •

edited

Loading