Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Citation Plan #103

Merged
merged 31 commits into from
Oct 4, 2019
Merged
Changes from 16 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
6bd82e4
Create 0000-data-citation-plan.md
theathorn Aug 3, 2019
4cb5fe8
Tidy up prior to first review
theathorn Aug 9, 2019
f7d3352
Rephrased Summary
theathorn Sep 9, 2019
7881310
Revised Summary
theathorn Sep 9, 2019
17f7a30
Updated Phase 1
theathorn Sep 9, 2019
11a0596
Update Phase 2
theathorn Sep 9, 2019
6e5dc08
Update Phase 3
theathorn Sep 10, 2019
33fc6a9
Update external DOI website
theathorn Sep 10, 2019
2fe5890
Clarify meaning of DOI repositories
theathorn Sep 10, 2019
4800a11
Minor updates to DOI section
theathorn Sep 12, 2019
58c1037
Added Acceptance Criteria
theathorn Sep 12, 2019
0ee5be0
Stable URLs for projects only
theathorn Sep 13, 2019
b250e2a
Updated Unresolved Questions
theathorn Sep 13, 2019
098ab9f
Clarify "Release View" question
theathorn Sep 13, 2019
3dda07e
Minor grammar fixes
theathorn Sep 13, 2019
9f677f0
Added Shepherd
theathorn Sep 20, 2019
b2a0c38
Updates rarely affect primary data
theathorn Sep 28, 2019
f6b9a5d
Clarify decision process for DOI assigning entity
theathorn Sep 28, 2019
4931804
Add summary titles to the 3 phases
theathorn Sep 28, 2019
3b790ea
Clarify use of matrix output files
theathorn Sep 28, 2019
f4fae2a
Update User Stories and Acceptance Criteria
theathorn Sep 28, 2019
95f7071
Clarify Ingest DOI update suggestion
theathorn Sep 28, 2019
03f9647
Delete unused optional sections
theathorn Sep 28, 2019
da7512c
Remove unnecessary implementation note
theathorn Sep 28, 2019
f46165f
Change Data Release to Data Distribution
theathorn Oct 1, 2019
d6419fb
Define Authorized User
theathorn Oct 1, 2019
24fe761
Clarify "cited data" in Phase 2
theathorn Oct 1, 2019
95162c5
Remove dependencies on Data Distribution features
theathorn Oct 2, 2019
7ac7962
Remove all references to Data Dsitribution
theathorn Oct 3, 2019
c304fdb
Spelling corrections
theathorn Oct 3, 2019
dc4a19c
Approved as rfc14
theathorn Oct 4, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions rfcs/text/0000-data-citation-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:*

`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# HCA DCP Data Citation Plan

## Summary

Data Contributors need to be able to cite data sets that they have contributed to the DCP.
Data Consumers need to be able to cite the DCP data that they have used in their research projects.
Providing citable records for the DCP will enable all contributors and consumers to reference the data stored in the DCP that they have used in their scientific publications.

## Author(s)

[Trevor Heathorn](mailto:theathor@ucsc.edu)

## Shepherd

[Trevor Heathorn](mailto:theathor@ucsc.edu)

## Motivation

There is currently no clear and agreed upon definition of the requirements for Data Citation in the DCP.
Key issues that this RFC seeks to resolve are:
- What is the minimum feature set required for a first release (Phase 1)?
- What are the discrete set of features that make sense for subsequent releases (e.g. Phase 2, Phase 3, etc.)
- In which phase, if any, is support for a formal digital object identifier ([DOI](https://en.wikipedia.org/wiki/Digital_object_identifier)) required?

### User Stories

1. As a data contributor (e.g. researcher with a pipette), it is essential to have a unique way for my project (consisting of primary and secondary analysis data files and associated metadata files) to be identified so that others can properly cite my work when using my data, and that this citation identifier is available in the DCP Data Portal. Note: This will discourage but not prevent others from using my data without attribution.
2. As a data consumer (e.g. researcher with a keyboard), I want to be able to view and share a unique citation identifier so that a reader of my manuscript can obtain the data needed to reproduce my results. Anyone can use the citation identifier to view and download all the original cited data and metadata files for a project from the DCP.
3. As a data consumer, I need a simple way to reference a project in the DCP so that I can fulfill the requirements of a Creative Commons attribution license (CC-BY).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side task rather than something relevant to the RFC but I don't think we make the CC-BY license very obvious for our data. Most people won't know that the data is licensed like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4. As a data contributor or consumer, I need a way to use the citation identifier to access the output produced by the DCP Matrix Service for the data being cited.
5. As a data contributor or consumer, I want to be able to update my project and provide versioned views of the data being cited.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the design and acceptance criteria are framed around the ideas of data releases created by the data operations team or around data consumers creating arbitary sets of data to support their work.

Neither of these concepts are reflected in the user stories

Also, why would a data consumer need to be able to update a project, someone who is solely a consumer won't own any projects so won't have the rights to update anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated User Stories #5 (Data Operations team) and #6 (User collections).
An (authorized) data consumer will be able to define their own "research project" as an arbitrary collection of data they have selected (e.g. "brain data from all projects"). They create a citation for that data. Later on more brain data is added to the DCP and the data consumer creates a new version of the the citation for their research project,

## Scientific "guardrails" [optional]

*Describe recommended or mandated review from HCA Science governance to ensure that the RFC addresses the needs of the scientific community.*

## Detailed Design

It is proposed to split the initial implementation into three phases:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phase description are written assuming someone already can see what the main goal of the phase is from reading the above text. For understanability it would be nice to give the three phases tldr style titles or summary sentences for the main goal otherwise it leads to a lot of flitting backwards and forwards to understand what is being references

If I have understood correctly I would suggest

Phase 1 - stable non versioned citable URLs
Phase 2 - versioned urls and support for data release citation
Phase 3 - Enabling citation of user defined data collections

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am correct in that phase 3 would be the fast-path for a producer who wants to cite their own data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added summary titles.
Phase 3 would enable producers and consumers to cite an arbitrary set of versioned data.

### Phase 1
This is designed to satisfy the minimal set of requirements for User Stories #1 through #4 by providing only "per-project" citations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure a list of links is enough for cc-by attribution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabsie Can you comment on this as you raised the original requirement? Is the intent that a data consumer licenses their published work with CC-BY and includes URLs to the DCP in that publication? We aren't currently licensing DCP content with a CC-BY license, so what do we need to do in the DCP to satisfy this requirement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lauraclarke Where is this CC-BY attribution going and who is creating it? Is a publication author putting the CC-BY license in their publication and linking to the DCP project page? Or is the DCP going to attach CC-BY licenses automatically to each project or to the site as a whole?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the cc-by attribution should likely be at the project level the same way as DOIs will be

If you look at figshare, https://figshare.com/articles/Malignant_Cancer_Cell_Nucleus/9751670
Zenodo https://zenodo.org/record/3363060#.XWoe8ZNKjOQ they both put it on the individual study/project pages

Please note this cc-by license is different from if we as the DCP chose to license the static content of our browser, that sounds like something we should discuss but not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the Summary. I didn't alter the CC-BY User Story, assuming that scientific authors will include such a license in their publications. Should we be including a CC-BY license on each Project Detail page of the Data Browser? i.e. Are we saying the data files for each DCP project are licensed under CC-BY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lauraclarke Can you answer the question in my previous comment? I think I may be misunderstanding something here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we need to review how we support people understanding that the data is licensed using cc-by

I think the citation widget gives people a way to meet cc-by attribution needs

They will need to be able to add attribution to whenever they reuse something, at the project level seems a good starting point, Ultimately we might want a way for someone to give us any identifier and getting an appropriate attribution text for that identifier in our system

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A paper which might be useful in considering solutions https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0213090

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Data Browser project details page will add a "To cite this project please copy this link" item.
This "stable non-versioned project URL" will link back to the production site project page using the project's UUID (e.g. https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79).
The URL refers to the “live" view of the project and is therefore subject to additions and updates (e.g. corrections) of data and metadata. However, these are expected to be infrequent and should not affect existing primary data.
theathorn marked this conversation as resolved.
Show resolved Hide resolved
theathorn marked this conversation as resolved.
Show resolved Hide resolved
A stable URL applies only to a project as a whole; there is no facility to provide separate citations for individual bundles or files within a project.
If an existing project is deleted and re-ingested then the cited project UUID would become invalid. If such re-ingestion occurs then a means will be provided to redirect the original "stable project URL" to the new version of the project (e.g. by providing a landing page which states something similar to "This project has been updated with corrected data and/or analyses. For the current version of this project click *here*", where *here* is a link to the new project page).
theathorn marked this conversation as resolved.
Show resolved Hide resolved
Note: Scientists are *already* citing such project based URLs in publications.
A formal DOI is *not* required for Phase 1.

### Phase 2
theathorn marked this conversation as resolved.
Show resolved Hide resolved
This is designed to satisfy the data contributor requirements for User Story #5.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The framing of user story 5 doesn't provide a strong connection to this phase description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Data Release User Story.

This allows the Data Operations team to make a “Data Release” (i.e. a curated data set) in which the specific versions of projects within the Data Release itself are citable.
theathorn marked this conversation as resolved.
Show resolved Hide resolved
A Data Release must be immutable.
The Data Browser must provide users with access to each Data Release (i.e. immutable version of data) in addition to the “live/latest” view.
theathorn marked this conversation as resolved.
Show resolved Hide resolved
The Data Browser must be able to provide users with a means to download the data and metadata associated with a Data Release.
Since the Matrix Service does not support the ability to process older versions of input files, the per-project output files from the Matrix Service *must* be stored as part of the immutable data set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if this is an immovable characteristic of the matrix service? Never supporting older files seems bold decision this early in the project given there must be risk of these files changing?

That said, having all processed outputs stored associated with an immutable release seems like a good idea even if the matrix service could still produce them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified Matrix output requirements.

The Data Store "collections" API provides a suitable means for recording the contents of a Data Release.
A formal DOI is required for this phase as is provides a standardized method for specifying specific versions of data objects.

### Phase 3
This is designed to satisfy the data consumer requirements for User Story #5.
This also fully satisfies User Story #2 by allowing a data consumer to cite an arbitrary set of data which spans multiple projects.
The data consumer must be able to create a discrete collection of their selected data with the ability to update that collection (i.e. create a new version of the collection).
The Data Browser must provide users access to each version of their collections.
A data consumer must be able to share a citable DOI link to any of the collection versions that they have created.

### Implementation notes for DOI support
A DOI provides a link of the form https://doi.org/xxxx which resolves via the hosting [Registration Agency](https://www.doi.org/registration_agencies.html) or [open-access repository](https://en.wikipedia.org/wiki/Open-access_repository).
A DOI is the most commonly accepted means for citing documents and data in scientific publications.

It is an Unresolved Question as to whether a DOI may resolve to an external website (e.g. BioStudies, Zenodo, etc). Such an external web page could then provide a URL to the project in the Data Browser as well as the ability to store point-in-time copies (i.e. versions) of relevant files such as the download manifest, metadata tsv, matrix output file, etc.
The stored manifest file could then be used in the HCA CLI to download the exact versions of the data and metadata for that version of the project. It is a requirement that the DCP never deletes older versions of cited data files, except for the special case of retraction of unconsented data.

#### Possible DOI Registration Agencies and open-access repositories that provide the ability to assign DOIs:
- [BioStudies](https://www.ebi.ac.uk/biostudies/)
- [Zenodo](https://en.wikipedia.org/wiki/Zenodo)
- [Figshare](https://en.wikipedia.org/wiki/Figshare)
- [Crossref](https://en.wikipedia.org/wiki/Crossref)
- HCA DCP assigns its own DOIs by becoming a member of the International DOI Foundation (IDF)

#### DOI Versioning
theathorn marked this conversation as resolved.
Show resolved Hide resolved
Most open access repositories that provide DOI minting services appear to provide support for versioning (i.e. multiple versions of a single citation available e.g. via a drop-down selection), the ability to store accompanying files in the repository (this would be useful for storing versions of manifest and metadata tsv files, etc.), and the ability to store URL references (useful for linking back to a page in the Data Browser).
Examining how both Figshare and Zenodo implement DOIs, as the group minting DOIs controls everything after the prefix, it is possible to have a main DOI which points to the most recent version and then versioned URLs which point to specific versions.

#### Data Browser support for DOIs
The Data Browser could contain a reference to the base DOI for each project. Clicking on the DOI link would then redirect the user to the (external) DOI repository where the user could see all the versions of that project and download the associated manifest and metadata.tsv files for a specific version.

#### DOI Creation/Update Process
Embedding a DOI in the metadata (e.g. biostudies_doi) during ingest would provide per-project citability.
theathorn marked this conversation as resolved.
Show resolved Hide resolved
Provided the DOI repository supports versioning the metadata DOI field could be updated automatically by Ingest to a new version whenever a project update is processed.
Using a DOI that links to an external repository that can store a manifest for each version of a project may be the simplest way to provide access to versions of the data for a project.

The creation/update process would perform the following steps:
- Ingest creates a new DOI (new project) or a new version of an existing DOI (updated project) when the submission is deemed “complete”.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "complete" here is unclear. What is meant by complete? There are many ways to define that, more specific details here would be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted to clarify - OK now?

- Upload the corresponding metadata tsv and manifest file for this version of the project to the DOI repository (need to wait for these to be generated).
- Upload the matrix output file(s) for this version of the project to the DOI repository.
- Update a project description to the DOI repository (e.g. what’s in this version?).
- Create a link in the DOI repository back to the project details page in the Data Browser.

### Acceptance Criteria [optional]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add acceptance criteria for the three phases and success metrics that we can use to see if our solutions actually work for the community

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added acceptance criteria for each phase - please review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lauraclarke Are the acceptance criteria, err, acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the acceptance criteria, these look good but the phase 3 criteria seem disconnected from the user stories as they are written now. The user stories don't mention a data release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Data Release User Story.

#### Phase 1
- A citation link is provided in the Data Browser for each project
- The citation link remains valid for the lifetime of the DCP
- The citation link is a URL which resolves to the latest version of the project details page for the specified project
- The citation URL is used in publications by scientists to cite DCP project data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great statement but it is a success metric (which we should track) rather than an acceptance criterion. This won't be true when the feature is released but we should make sure we measure how frequently people cite us so we can understand the success of the feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "used in publications" from acceptance criteria.

#### Phase 2 (in addition to Phase 1 criteria)
- The Data Operations team is able to create an immutable citation reference for each project in a Data Release
- The citation reference consists of a versioned DOI
- The Data Operations team can update the version of the project's DOI in a later Data Release
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean project or release here? why would the data operations team need to update an individual project's DOI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified to explain that a project may be subject to updates between Data Releases.

- The Data Browser provides a means of downloading the data associated with a specific version of a project from a Data Release
Copy link
Contributor

@jahilton jahilton Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems beyond the scope of Data Citation and into the realm of Data Distribution access. Would you consider relieving this criteria from the current proposal? Or am I missing the connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified to state that users must be able to download the cited data that makes up a Data Distribution. Isn't it necessary that users can download cited data so they can reproduce experimental results? Just trying to make that explicit here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, users must be able to download data from a Data Distribution, but it seems to be an unnecessary step when a simpler criterium would be "The Data Browser provides a means of downloading the data associated with a specific version of a project" (no matter where that version is mentioned)
I worry that adding unnecessary complexities are going to make implementation and acceptance more difficult to achieve. For instance, the current criterium relies on some functionality of Data Distribution, which has not been fully laid out yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed most dependencies on Data Distribution features.

#### Phase 3 (in addition to Phase 1 and 2 criteria)
- An authorized user may create a citation for an arbitrary set of versioned data selected via the Data Browser
jahilton marked this conversation as resolved.
Show resolved Hide resolved
- An authorized user may update a previously versioned set of data to a new version and update the version of the corresponding citation
- An authorized user may share a citation which they have created with others users who can then access the specific version of the data referenced by the citation

### Unresolved Questions
theathorn marked this conversation as resolved.
Show resolved Hide resolved

Must a data citation that provides a DOI provide an *immutable* view of all the data and metadata associated with a project? Or is acceptable for any of the following to undergo additions, updates or deletions over time?
- a project's primary data
- a project's secondary analysis outputs
- a project's expression matrix outputs, generated by the Matrix Service
- a project's metadata

For Phase 2 & 3 does the Data Browser need to provide a view of the cited data on which further faceted searches can be performed?
lauraclarke marked this conversation as resolved.
Show resolved Hide resolved
Currently the Data Browser provides only a view of the latest data and metadata. It may be desirable to add another "facet" such as "Release Version" which would enable the user to view a particular point-in-time Data Release and then perform further faceted searches within that view.

Can a DOI for the DCP resolve to an external website? The author(s) will work with the UX team to come up with pros and cons of using an external authority vs setting up the DCP as a DOI assigning entity and then, if that effort doesn't give us a clear answer, we will ask the Oversight Committee for their view.
theathorn marked this conversation as resolved.
Show resolved Hide resolved

### Drawbacks and Limitations [optional]

*Why should this RFC **not** be implemented?*

### Prior Art [optional]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some emphasis on the use cases of programatic consumers, who will likely be your highest frequency users, would be good. Through the lens of current best practice you have only indirectly (through BioStudies) mentioned compact identifiers, but these are critical and would help at all phases of delivery. Consideration of best practice should go into the prior art section.

DOI and URL have both been mentioned. These are relevant to publishers and a user with a keyboard. But compact identifiers are most frequently used to consume, map, link and share data and to my mind represent best practice. They are also the most useful way to discuss a project via word of mouth or via slack which is essential for developers, wranglers and data ops. Saying 'prof x's latest organ y dataset' is not scalable and uuids are not human readable or memorable, which increases the likelihood of miscommunication (missing use case). Within the EBI, these identifiers are used to make data interoperable but all major biological data institutions also embrace them. As a specific example the successful INSDC (sharing genomic data between NCBI, EBI and DDBJ) works by mapping these identifiers.

You will have come across identifiers that look like this in our metadata SRAxxxx, GEOxxxx, PMCxxxx, PRJxxxx... We use them to reference sources of data. This is the same use case as your downstream users. This single short identifier can be resolved similar to a DOI through identifiers.org, communicated in metadata easily and thanks to the resource specific prefix bakes in the source of the data. Bioinformaticians are used to dealing with these identifiers in this format.

As it stands uuids are not HCA specific. Thankfully no-one else (in our realm) seems to use them so we (at SCEA) guess these identifiers refer to a HCA dataset. It would be preferable to have an ID HCAxxxx to confirm this.

At the moment our metadata would look like this:
"data source" : ["ERP114453", "PRJEB31843", "c4077b3c-5c98-4d26-a614-246d12c2e5d7"]
Spot the odd one out? How do I resolve this if anyone other than the HCA starts to use uuids?

Other resources also support these identifiers for different granularities of the data. For example ENA provide prefixes to describe studies, experiments, samples and runs. This is very useful for downstream users like us who want to split and splice datasets but still need to refer to the provenance of the data.

So compact identifiers give you a familiar way to communicate granularity and source of data while easing programatic or manual access. These identifiers are often used in the literature to refer to data rather than DOIs. If the plan is for BioStudies to mint these identifiers for HCA projects, I would ask the following:

  • This functionality is added ASAP. Once a referencing scheme is established it is difficult and costly to change. If uuids/URLs are referenced in upcoming papers the trend will quickly stick.
  • This identifier is in the metadata before a dataset is released so that it can always be relied upon downstream.
  • The prefix maintains provenance so it is clear the dataset is HCA
  • HCA is able to provide references to more granular components of a project (sample, assay etc) in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project metadata does contain compact identifiers in the form of "project.insdc_project_accessions" (e.g. SRP078321), "project.insdc_study_accessions" (e.g. PRJNA328774), "project.array_express_accessions" (e.g. E-GEOD-84133), "project.geo_series_accessions" (e.g. GSE84133), and "project.biostudies_accessions".
@lauraclarke Are you on board with adding HCA-specific accessions (e.g. HCAxxxxx) to the project metadata and possibly to other objects (e.g. samples, assays)? Is biostudies_accessions ever populated (I couldn't find an example)? Noting that the INSDC/GEO/ArrayExpress accessions are not always populated for a project and don't provide the direct reference to data in the DCP that would be provided by an HCA compact identifier.

@hewgreen for programmatic consumers, are you expecting additions to an existing API or proposing a new service, and what functions would be needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to cite a paper I am myself an author on, but it is worth reading this article before deciding whether or not to create new accessions: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2001414. This article also references many other useful papers on identifying resources in the life sciences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tburdett beat me to it but to agree with him. I think we need to think through the consequences of adding another namespace to the identifiers within bioinformatics. There are upsides to do it but there are also costs and we need to ensure we understand them before we embark down that route.

Copy link

@hewgreen hewgreen Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theathorn As you rightfully point out the compact identifiers you refer to are to secondary resources. They do not reflect HCA data (which can be novel, analysed or truncated). They are not always present and cannot be relied upon. Therefore, these do not address the requirements I laid out.

@tburdett the paper outlines the considerations perfectly. One interesting idea is human readable labels. Maybe these could solve some issues (not user provided titles). But overall I think it suggests we should accession in some cases and think carefully about identifier design. UUIDs maybe not being great. I recommend the paper to others. My interpretation of the 'lessons' are as follows:

1. Credit any derived content using its original identifier
We should create IDs for HCA's new data and new analysis because it's novel stuff. Source IDs are important.

2. Help local IDs travel well: Document prefix and patterns
Prefixes are super important. aka a uuid alone isn't useful.

3. Opt for simple, durable web resolution
We should get a resolution provider. Why I mention identifiers.org but DOI sort of works for this although you have to follow them to work out what they are. Our URL prefixes are already nice and simple but the local ID's are long, cumbersome, human unrepeatable.

4. Avoid embedding meaning or relying on it for uniqueness:
We don't want to pack meaning into our IDs. Fair point. I don't think anyone want's this but we should be careful not to.

5. Design new identifiers for diverse uses by others
The ID's need to be designed to be useful. UUIDs cannot be described as user friendly. We could design something much nicer.

8. Make URIs clear and findable & 10. Reference and display responsibly
Our prefixes aren't in the metadata as passed from the DSS. Should they be if they are the only way downstream consumers know a uuid belongs to the HCA?

The following lessons I think we have learned and are working towards already:
6. Implement a version-management policy
7. Do not reassign or delete identifiers
9. Document the identifiers you issue and use

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have proposed a F2F session on Accessions, fyi, with a proposal to put them in use on HCA data.

*Share references to prior art to deepen community understanding of the RFC, such as learnings, adaptations from earlier designs, or community standards.*

### Alternatives [optional]

*Highlight other possible approaches to delivering the value proposed in this RFC.
What other designs were explored? What were their advantages? What was the rationale for rejecting alternatives?*