Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert DOI URLs in related_publications to related resources #1417

Merged
merged 1 commit into from
Mar 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions dandi/metadata/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from dandischema import models
import requests
import tenacity
from yarl import URL

from .. import __version__
from ..utils import ensure_datetime
Expand Down Expand Up @@ -583,6 +584,29 @@
return None


def extract_related_resource(metadata: dict) -> list[models.Resource] | None:
pubs = metadata.get("related_publications")
if not isinstance(pubs, (list, tuple)):
return None
related = []
for v in pubs:
if not isinstance(v, str):
continue

Check warning on line 594 in dandi/metadata/util.py

View check run for this annotation

Codecov / codecov/patch

dandi/metadata/util.py#L594

Added line #L594 was not covered by tests
try:
u = URL(v)
except ValueError:
continue

Check warning on line 598 in dandi/metadata/util.py

View check run for this annotation

Codecov / codecov/patch

dandi/metadata/util.py#L597-L598

Added lines #L597 - L598 were not covered by tests
if u.scheme not in ("http", "https") or u.host != "doi.org":
continue
related.append(
models.Resource(
identifier=v,
relation=models.RelationType.IsDescribedBy,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is is too specific to assume that any reference is the description of the data we find in the file.
What we can say only that the file references that publication, and hence I would better go with

Suggested change
relation=models.RelationType.IsDescribedBy,
relation=models.RelationType.References,

as the default. WDYT @bendichter @satra ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point that we cannot necessarily infer the relationship between the data and the paper.

The DataCite definition for "References" is "indicates B is used as a source of information for A." Here, this would mean that the paper is used as a source of information for the Dandiset. I don't know if this really fits here.

I have been wondering what the best way to associate papers with datasets using these relations. What are the different types of ways a paper and a dataset can be associated? So far, I have been using isDescribedBy (indicates that A describes B) for everything, assuming that the publication is the primary publication in which the data is introduced and described. Another type of relationship might be that a paper reuses a dataset. In this case, IsSupplementTo (indicates that A is a supplement to B ) might be a better choice, since it applies to both. Are there other relationship types that describe how a paper and dandiset might be related? What are the different use-cases here?

This is challenging in part because there is currently no aspect of the dandi schema that allows us to describe what the resource is (see issue here), so that needs to be inferred at least somewhat by the relation type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendichter Now that dandi/dandi-schema#231 has been resolved, how can we move this PR forwards?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got it right we still have not made our minds on the default type of the relationship here... The most logical is that if we do not know relationship - do not fill it in. But ATM our model seems to not allow relation to not be present . Should we change that in the model? or you still feel comfortable @bendichter in using IsDescribedBy here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. My preference would be to change to schema to make this optional. Now that we have a resource type I am much less reliant on this particular field, though I do appreciate the caveat that optional fields are very rarely populated. I would also be happy with defaulting to IsDescribedBy or IsSupplementTo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, to expedite, let's proceed with IsDescribedBy and see where it takes us ;-)

)
)
return related


FIELD_EXTRACTORS: dict[str, Callable[[dict], Any]] = {
"wasDerivedFrom": extract_wasDerivedFrom,
"wasAttributedTo": extract_wasAttributedTo,
Expand All @@ -595,6 +619,7 @@
"anatomy": extract_anatomy,
"digest": extract_digest,
"species": extract_species,
"relatedResource": extract_related_resource,
}


Expand Down
7 changes: 7 additions & 0 deletions dandi/tests/data/metadata/metadata2asset_3.json
Original file line number Diff line number Diff line change
Expand Up @@ -92,5 +92,12 @@
"name": "Cyperus bulbosus"
}
}
],
"relatedResource": [
{
"schemaKey": "Resource",
"identifier": "https://doi.org/10.48324/dandi.000027/0.210831.2033",
"relation": "dcite:IsDescribedBy"
}
]
}
3 changes: 2 additions & 1 deletion dandi/tests/data/metadata/metadata2asset_simple1.json
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,6 @@
"schemaKey": "Participant",
"identifier": "sub-01"
}
]
],
"relatedResource": []
}
6 changes: 5 additions & 1 deletion dandi/tests/test_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,9 @@ def test_timedelta2duration(td: timedelta, duration: str) -> None:
"institution": "University College",
"keywords": ["test", "sample", "example", "test-case"],
"lab": "Retriever Laboratory",
"related_publications": "A Brief History of Test Cases",
"related_publications": [
"https://doi.org/10.48324/dandi.000027/0.210831.2033"
],
"session_description": "Some test data",
"session_id": "XYZ789",
"session_start_time": "2020-08-31T15:58:28-04:00",
Expand Down Expand Up @@ -860,6 +862,7 @@ def test_nwb2asset(simple2_nwb: Path) -> None:
variableMeasured=[],
measurementTechnique=[],
approach=[],
relatedResource=[],
)


Expand Down Expand Up @@ -939,4 +942,5 @@ def test_nwb2asset_remote_asset(nwb_dandiset: SampleDandiset) -> None:
variableMeasured=[],
measurementTechnique=[],
approach=[],
relatedResource=[],
)