Define and implement the GRSciColl master data management solution #319

ManonGros · 2021-02-25T07:30:49Z

There are potentially multiple sources of truth for the metadata in the catalogue which needs to be resolved; a problem known as master data management. For example we have information available in a dataset metadata description, an existing GRSciColl entry and an Index Herbariorum record.

Define, implement and document the approach taken by the catalogue for handling differing views of metadata.

An approach could be as follows:

For each institution and collection entry in the catalogue, a single source of truth is identified for the key metadata (title, description etc). This may be one of:
- An entry from Index Herbariorum, or other system that is automatically integrated through harvesting also see Explore synchronisation with the NCBI BioCollections #307 and Enable CETAF as a master source of information #322
- Metadata for a dataset registered in GBIF (i.e. an EML file) Allow Master record to be eml file for collection #305
- An entry made directly into the catalogue through the user interface, or pushed through the API by an application (e.g. a collection management system)
The core metadata is never changed in GRSciColl for externally sourced entities, and edits must be applied in the system providing the master record.
- The entries in GRSciColl may be enriched with the following fields:
  - Additional identifiers

ManonGros · 2021-08-24T11:16:19Z

What we want:

Every GRSciColl entry whether institution or collection will have source of truth (master record) with a type.
There will be four types (maybe more at some point): GRSciColl (meaning the entry is maintained in GRSciColl), GBIF Registry (the information comes from a dataset metadata or a publisher page), IH and CETAF.
The type will be associated with an identifier or some way to retrieve the information needed to update the record.
MachineTags will most likely be used to capture that information (source and type of source of information).
NB: Users won’t handle the matchineTags directly, we will need a wrapper.
It should be clear to users what is the source of truth.
The UI should allow editors, mediators, etc. to select a source of truth.
When the source of truth is chosen, the UI should show what can still be edited in the GRSciColl registry and what will be * overwritten by future synchronization. Ideally, this information (which fields can or cannot be edited will be captured in the backend).
We would presumably need a "Create a collection" based on a source. Something along the lines of "create a collection using this dataset".
For datasets as a sources of truth, dataset ingestion should trigger GRSciColl update. When working on CETAF, we will need some sort of crawler.

What we don’t want or don’t need:

Right now IH synch generates new entries when a new institution is added to IH. We don’t want to do the same for the other sources of truth. We briefly discussed making creation suggestions but given the low number of records in CETAF, it would just be easier if someone manually created entries in GRSciColl (from CETAF).
No need to work on NCBI BioCollections synch for now. The most requested sources have been GBIF datasets/publisher and CETAF. We will focus on that.
Given how complicated mapping the sources to GRSciColl is (might require transformation), we cannot have a configuration mapping file. But it would be nice to have the mapping available or documented somewhere that I can check.

Where we start:

We will first only focus on GBIF datasets and publisher links. This will allow us to iron out the details in a system we know. CETAF will come later.
Tim and Marcos will figure out how to set up the backend for this to happen.

ManonGros · 2021-08-24T11:47:06Z

Attempt at mapping fields:

Collection fields	Dataset metadata fields
name	title
description	description
homepage	~~dataset DOI?~~ `homepage`
catalogueURL	~~link to occurrences?~~
apiURL	~~link GBIF API call to occurrences?~~
presevationTypes	specimenPreservationMethod in `collections` (although we will need to map the terms too)
taxonomicCoverage	taxonomicCoverages (find a way to aggregate the data) or inferred from occurrences
geography	geographicCoverages (only the description part most likely) or inferred from occurrences
incorporatedCollections	name in `collections`
Active	default: True
identifier	~~identifier in `collections`~~ + datasetDOI?
address	`publishingOrganization` address
city	`publishingOrganization` city
province	`publishingOrganization` province
postalCode	`publishingOrganization` postalCode
country	`publishingOrganization` country
contacts	contacts (we should probably refine the mapping here)

NB: the Institution and Code, which are mandatory fields cannot be inferred from the EML. The users will have to fill those fields. ~~Perhaps we should also encourage the users to add a physical address?~~ We could infer the address from the publisher as Marcos mentioned below.

Institution fields	Organization fields
name	title
description	description
homepage	homepage
phone	phone
email	email
catalogueURL	~~link to occurrences?~~
apiURL	~~link GBIF API call to occurrences?~~
latitude	latitude
longitude	longitude
logoUrl	logoUrl
address	address
city	city
province	province
postalCode	postalCode
country	country
Active	default: True
contacts	contacts (we should probably refine the mapping here)

NB: Same comment about codes as for collection.

MortenHofft · 2021-08-25T04:40:09Z

collection homepage
Perhaps just use the dataset homepage (the field homepage)

collection identifiers
I'm not sure we can use the collections.identifiers for much. At least some curation would be needed. Below is a sample of how they are used.

{
"key": "FMB",
"doc_count": 46
},
{
"key": "IAvH-CT",
"doc_count": 37
},
{
"key": "IAvH-A",
"doc_count": 29
},
{
"key": "IAvH-E",
"doc_count": 24
},
{
"key": "4ec2b246-f5fa-4b90-9a8d-ddafc2a3f970",
"doc_count": 21
},
{
"key": "Registro Nacional de Colecciones Biológicas: 207",
"doc_count": 19
},
{
"key": "Registro Nacional de Colecciones Biológicas: 3",
"doc_count": 19
},
{
"key": "IAvH-Am",
"doc_count": 18
},
{
"key": "IAvH-R",
"doc_count": 18
},
{
"key": "Registro Nacional de Colecciones Biológicas: 158",
"doc_count": 17
}

MortenHofft · 2021-08-25T05:09:53Z

Also should it perhaps be possible to map dataset => institution ?
E.g. https://www.gbif.org/dataset/288e1f4c-7c09-4604-ad19-920a61c55462 seem to be an institution. They talk about their collectionS in plural.

And they list their collections
https://api.gbif.org/v1/dataset/288e1f4c-7c09-4604-ad19-920a61c55462

UPDATE: in this case the publisher would be natural to use I guess. So perhaps no need after all :)
https://www.gbif.org/publisher/748bb006-8e16-4703-9936-8be1286aac30

MortenHofft · 2021-08-25T05:12:01Z

taxonomicCoverage taxonomicCoverages (find a way to aggregate the data)

Perhaps we could fall back to occurrence metrics when/if it isn't filled?

marcos-lg · 2021-08-25T10:30:08Z

For the collection-dataset mapping:

homepage -> I'd also use the dataset homepage
catalogueUrl and apiURL: shouldn't they point to the collection page and collection api instead of the occurrences?
maybe we can take the address from the publisher organization?

For the institution-organization mapping:

we could use the abbreviation field as code perhaps?
catalogueUrl and apiURL: same as for the other mapping

For both mappings, for the contacts I think we could check if the person exists in grscicoll and create a new person otherwise. It's not ideal since we'll be kind of duplicating people and if the person changes in the organization or the dataset, should we update it in grscicoll too? or if it's deleted do we still keep this person in grscicoll? with the current model that we have for persons I don't think there isn't a good solution unless we improve the model first.

ManonGros · 2021-08-25T11:32:48Z

The problem with inferring a collection's parent institution from a dataset title (or publisher) is that it might generate duplicates if the spellings are different than what we have in GRSciColl. Plus, what if there are several institutions in GRSciColl matching the same name? I think someone will have to check manually which institution should be the parent one, it cannot really happen automatically.

Concerning using occurrences and publisher to infer some content:
It depends on whether we will be using the EML only for synch or the published dataset (which will have the publisher).

Using the EML only would mean that we can have people link data from IPTs that aren't necessarily published on GBIF (like OBIS for example).
But using the GBIF dataset would probably be easier. Plus, we could infer:
- the taxonomic and geographic coverages from the occurrences
- infer the address from the publisher address

I don't think the abbreviation field is part of the become a publisher form so I doubt it will be filled very often. We probably cannot count on it very much.

I agree that we should first check if the contact exists in GRSciColl before creating a new one.
We should at least be able to update changes in contacts ("this person is now in charge of that" type of changes). Ideally, we should probably update changes in email addresses. phones, etc. But I know that many datasets have the same contacts, I can imagine some conflicts if the person is not updated everywhere. What would be possible?

The definitions of the catalogueURL field we wrote is "If your specimens are digitized and available online, you can put here the link to access them".
For the apiURL, it is "Same as Catalogue URL, if your institution exposes its records via an API (relevant mainly for iDigBio entries)."
That's why I was suggesting to put the links to occurrences. Does it make sense? We can also leave those fields empty.

marcos-lg · 2021-08-25T13:45:48Z

But I know that many datasets have the same contacts, I can imagine some conflicts if the person is not updated everywhere. What would be possible?

Yes, that can happen. This complicates things. If we don't want to have conflicts we'd have to "duplicate" all the contacts and keep a link between them so we know for sure to what grscicoll person they refer.

The definitions of the catalogueURL field we wrote is "If your specimens are digitized and available online, you can put here the link to access them".
For the apiURL, it is "Same as Catalogue URL, if your institution exposes its records via an API (relevant mainly for iDigBio entries)."
That's why I was suggesting to put the links to occurrences. Does it make sense? We can also leave those fields empty.

I'm not sure. I guess some collections might have records in multiple datasets. We could have a link to the occurrences in the institution/collection page.

ManonGros · 2021-08-26T06:26:00Z

I'm not sure. I guess some collections might have records in multiple datasets. We could have a link to the occurrences in the institution/collection page.

You are right, it gets a bit complicated. I think we should leave those empty by default and the users can always fill them in.

marcos-lg · 2021-10-18T13:12:05Z

As agreed with the others, we'll map the specimenPreservationMethod in the collections field of the dataset to the presevationTypes of a grscicoll collection like this:

specimenPreservationMethod	presevationTypes
NO_TREATMENT	empty
ALCOHOL	SAMPLE_FLUID_PRESERVED
DEEP_FROZEN	STORAGE_FROZEN_BETWEEN_MINUS_132_AND_MINUS_196
DRIED	SAMPLE_DRIED
DRIED_AND_PRESSED	SAMPLE_DRIED, SAMPLE_PRESSED
FORMALIN	SAMPLE_FLUID_PRESERVED
REFRIGERATED	STORAGE_REFRIGERATED
FREEZE_DRIED	SAMPLE_FREEZE_DRYING
GLYCERIN	SAMPLE_FLUID_PRESERVED
GUM_ARABIC	SAMPLE_FLUID_PRESERVED
MICROSCOPIC_PREPARATION	SAMPLE_SLIDE_MOUNT
MOUNTED	SAMPLE_OTHER
PINNED	SAMPLE_PINNED
OTHER	STORAGE_OTHER

marcos-lg · 2022-01-12T17:06:33Z

Deployed to PROD.

ManonGros added the GRSciColl Issues related to institutions, collections and staff label Feb 25, 2021

ManonGros mentioned this issue May 17, 2021

GRSciColl - use machineTags instead of identifiers for IH synch #342

Closed

ManonGros mentioned this issue Aug 30, 2021

Changing the GRSciColl model for staff members and contacts #379

Closed

marcos-lg added a commit that referenced this issue Nov 10, 2021

Merge branch 'dev' into #319-master-data-management

0a501c8

marcos-lg added a commit that referenced this issue Nov 30, 2021

Merge branch 'dev' into #319-master-data-management

b7cff5f

marcos-lg added a commit that referenced this issue Nov 30, 2021

Merge branch 'dev' into #319-master-data-management

f93cce3

marcos-lg closed this as completed Jan 12, 2022

ManonGros mentioned this issue Aug 9, 2023

Update GBIF's EML profile (EML 2.2.0) gbif/eml-profile#5

Closed

ManonGros mentioned this issue Oct 18, 2023

grscicoll - Should we add institutionID to the EML or onto the record level? #531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define and implement the GRSciColl master data management solution #319

Define and implement the GRSciColl master data management solution #319

ManonGros commented Feb 25, 2021 •

edited

Loading

ManonGros commented Aug 24, 2021 •

edited

Loading

ManonGros commented Aug 24, 2021 •

edited

Loading

MortenHofft commented Aug 25, 2021 •

edited

Loading

MortenHofft commented Aug 25, 2021 •

edited

Loading

MortenHofft commented Aug 25, 2021

marcos-lg commented Aug 25, 2021

ManonGros commented Aug 25, 2021

marcos-lg commented Aug 25, 2021

ManonGros commented Aug 26, 2021

marcos-lg commented Oct 18, 2021 •

edited

Loading

marcos-lg commented Jan 12, 2022

Define and implement the GRSciColl master data management solution #319

Define and implement the GRSciColl master data management solution #319

Comments

ManonGros commented Feb 25, 2021 • edited Loading

ManonGros commented Aug 24, 2021 • edited Loading

ManonGros commented Aug 24, 2021 • edited Loading

MortenHofft commented Aug 25, 2021 • edited Loading

MortenHofft commented Aug 25, 2021 • edited Loading

MortenHofft commented Aug 25, 2021

marcos-lg commented Aug 25, 2021

ManonGros commented Aug 25, 2021

marcos-lg commented Aug 25, 2021

ManonGros commented Aug 26, 2021

marcos-lg commented Oct 18, 2021 • edited Loading

marcos-lg commented Jan 12, 2022

ManonGros commented Feb 25, 2021 •

edited

Loading

ManonGros commented Aug 24, 2021 •

edited

Loading

ManonGros commented Aug 24, 2021 •

edited

Loading

MortenHofft commented Aug 25, 2021 •

edited

Loading

MortenHofft commented Aug 25, 2021 •

edited

Loading

marcos-lg commented Oct 18, 2021 •

edited

Loading