provide automated DOI registration #699

alee · 2024-03-14T22:52:47Z

add support for automated metadata translation from our object model into the DataCite schema with eventual DOI publication via the datacite Python library

refs https://github.com/comses/planning/issues/146

sgfost · 2024-03-15T21:04:56Z

django/library/models.py

+        some fields DataCite do not want or have.
+        """
+        metadata = {}
+        codemeta = release.codemeta.metadata


Do we want to go through codemeta at all for any of this? Seems like it may be easier to follow if we just transform straight from the release, and use shared methods for any instances where fields may have the same transformation as codemeta

originally we thought that all the info from codemeta will go straight into the datacite doi but found out that's not the case. i can refactor to use the release info as much as possible instead the codemeta. @alee , ok?

@sgfost - can you elaborate on the "shared methods" and how this is done in python syntax? for example CodeMeta convert_authors method, how to share that with the DataCiteMetadata?

@asuworks and I had a conversation about this yesterday - there's two ways we could think about this, one is that codemeta represents our "Intermediate Representation (IR)" metadata that we use to translate to all other metadata. That would be fine if there's a non-lossy transformation between our object model metadata -> codemeta -> datacite

However the issue is if there's additional / richer metadata fields in datacite than codemeta provides that we could use but aren't available so we need to directly use our object model metadata to transform into datacite, in that case it might be better to have a direct transformation from our object model metadata -> datacite

I think we did some analysis on this in the requirements spec in the metadata crosswalk table but don't have it clear in mind at the moment, we should look at that and respond definitively here if this is the case

to answer your question though we could think of "shared methods" as functions that take our object model and convert it into a fully materialized in-memory intermediate representation (Python dictionary based or otherwise) that can be:

used to more easily transform into codemeta or datacite

easily tested without having to deal with making additional database queries, etc

maybe other benefits

I like that, I think the Transformer naming makes sense. Any reason to not keep the build()/transform() as a @classmethod for the child classes? It makes the interface slightly simpler, not sure if there is any downside

Also, it may make more sense to directly call methods that the base transformer class provides, rather than using its transform() which is sorta just an arbitrary collection of fields that wouldn't mean much if we, say, needed a 3rd metadata crosswalk for something

The reason to use a base class transform() is based on the assumption that it would generate an object which would cover (mostly) all the required fields in an acceptable format.
In a child class we would just need to "fix" the wrong or missing attributes.
@sgfost can you provide a quick code sample of your idea?

since i'm new to our code and python, i'll defer the decision to Allen and Scott. should i hold off on my refactor and work on the fabrica api until a decision is made or should i refactor using the CodebaseRelease info?

something to this effect @asuworks, similar strategy just calling the methods from the base directly rather than building everything and picking out what we want from a dict

class SharedMetadataTransformer: def __init__(self, release): self.release = release def convert_contrib_to_authors(self): # something pass class CodemetaTransformer(SharedMetadataTransformer): INITIAL_DATA = { "@context": "http://schema.org", # ... } def transform(self): self.metadata = self.INITIAL_DATA.copy() self.metadata.update( # ... codemeta_specific="something" authors=self.convert_contrib_to_authors() # ... ) return self.metadata

@monaw We'll still need to identify any shared transforms and remove the datacite class's dependency on codemeta regardless of how exactly it ends up looking. Focusing on the api client in the meantime isn't a bad idea though

Nice ideas! I could see the use of having transform() generate a Python dictionary and return it every time, but we could also still do something like DataCiteMetadata.build() or CodeMeta.build() and use the transformer within it. One thing that might be good to do is to cache the Python dictionary that gets built up by the transformer, to have a more stateful object (though I certainly appreciate the functional programmingness of this design).

I tend to prefer composition over inheritance so might suggest that DataCiteMetadata and CodeMeta have a MetadataTransformer instance that can convert a CodebaseRelease into a pure in-memory data structure that can then be used as faithful source material for further CodeMeta or DataCite transformations. The MetadataTransformer can then be mocked and tested without db dependencies etc.

Other thoughts?

…done, author and contributors remaining to do as well as testing; team decided to cache metadata dictionary so will hold off on refactor for now (comses.net/comses#699)

sgfost

Looks good, very thorough. Some thoughts from a first pass over everything.

also, I haven't fully combed through the code in all of the tasks in doi.py but I wonder if there any duplication between them that can be cleaned up?

sgfost · 2024-05-03T15:54:09Z

django/library/models.py

+        # FIXME: what is the difference between
+        # CodebaseRelease.objects.filter(codebase=r.codebase).order_by("-version_number").all()
+        # and
+        # ordered_codebase_releases: List[CodebaseRelease] = codebase.ordered_releases()


codebase.ordered_releases(has_change_perm=True) should be equivalent

sgfost · 2024-05-04T01:05:52Z

django/library/doi.py

+            http_status = 204
+            message = str(dc_nce)
+
+        except DataCiteBadRequestError as cd_bre:


just a typo

sgfost · 2024-05-04T01:38:00Z

django/library/models.py

+        return ReleaseContributor.objects.authors(self.codebase_release)
+
+
+class CodeMetaMetadata:


These metadata transformers still feel a bit unwieldy or difficult to read through, especially the huge build..() methods. One idea is dataclasses instead of the metadata = {} and consistently using property methods instead of a mix of classmethod getters, property/cached_property, and direct assignment in build().

Though this might be a refactor that can happen after this merge. I'll be implementing something like a GithubMetadata rather soon so that may be a good time to revisit

sgfost · 2024-05-04T02:50:16Z

django/library/doi.py

+
+"""
+RECURRENT TASKS
+"""


all of these should be available to be invoked from the management command. I also think the input() waits should be made optional so that we can schedule them as cron jobs for example

…, default publication year to this year if none, updated identifier and creators fields (comses/comses.net/comses#699)

…ode notes (issue comses/comses.net/comses#699)

monaw · 2024-07-18T16:01:53Z

the issue that i was working on was the datacite python package schema43.validate() test was failing. one thing i noticed was that the creators metadata was empty. the test code does publish() the release but yet the creators were empty. oddly during debugging, sometimes the creators will have 2 test_user entries but i didn't figure out why sometimes it was empty and sometimes it had 2...perhaps there is something about the compute_contributors() caching that i don't understand. that's as far as i got before my time ran out. i'm really sorry to leave this issue unsolved! since DataCite requires the creators metadata, i can see why schema43.validate() failed but there may be other additional reasons why the validation is failing. for more info, see DOI feature documentation

…ode notes (issue comses/comses.net/comses#699)

sgfost

The core doi creation logic seems alright to me, the only real improvement I can think of other than minor cleanup is doing metadata updates on post_save signals rather than the complicated-seeming comparison. This should probably be done with celery though..

Otherwise, the only thing I'm confused by is the remove_dois_from_not_peer_reviewed_releases command. If the DOI exists and points to the release, is there harm in keeping it stored, even though its sorta 'legacy' data?

sgfost · 2024-08-13T18:27:45Z

django/library/models.py

@@ -2438,14 +2510,90 @@ def __str__(self):
        return f"[peer review] {invitation.candidate_reviewer} submitted? {self.reviewer_submitted}, recommendation: {self.get_recommendation_display()}"


-class CodeMeta:
+class CommonMetadata:


it seems like this should be a collection of helper functions/static methods instead of a built dictionary, is this a worthwhile change?

One of the initial goals was to also provide an easier way to mock out the metadata tests and improve testing speed by not needing a DB but the design didn't end up that way. That's part of the reason that the CodeMeta and DataCite metadata classes were initialized with a bare dictionary but without validation it's kind of a mess. Let's discuss at or after tomorrow's dev meeting?

sgfost · 2024-08-13T18:29:29Z

django/library/models.py


    @classmethod
-    def convert_platforms(cls, codebase_release: CodebaseRelease):
-        return [tag.name for tag in codebase_release.platform_tags.all()]
+    def from_codebase(cls, codebase: Codebase):


would it be better if ReleaseDataciteMetadata and CodebaseDataciteMetadata were 2 separate sibling classes that inherit from the same schema or at least the base fields that are currently duplicated?

That sounds like a good way to separate the logic, I'll work on that while continuing to clean up the way their dictionaries are built up

alee · 2024-08-14T02:13:26Z

The core doi creation logic seems alright to me, the only real improvement I can think of other than minor cleanup is doing metadata updates on post_save signals rather than the complicated-seeming comparison. This should probably be done with celery though..

Good call, though I think celery is probably overkill for this, a cron job should be fine since these aren't long-running or frequent processes. I don't think I would do it on post_save, and instead batch it nightly.

Otherwise, the only thing I'm confused by is the remove_dois_from_not_peer_reviewed_releases command. If the DOI exists and points to the release, is there harm in keeping it stored, even though its sorta 'legacy' data?

I think the goal is to re-mint old handle.net and legacy PIDs to be uniform though @asuworks would probably be better to clarify here...

asuworks · 2024-08-14T05:26:55Z

We wanted to get rid of legacy DOIs for consistency. There were not so many legacy DOIs in prod, as far as I remember...
We would need to handle them separately. Do we want this?

remove codemeta tests entirely, will add them back in comses#699

- use datacite python client to mint DOIs - add basic scaffolding to settings / config.ini and a get_datacite_client() method to doi module - add custom setup and cleanup for property testing Co-authored-by: Anton Suharev <asuworks@users.noreply.github.com> Co-authored-by: Scott Foster <sgfost@users.noreply.github.com> Co-authored-by: Mona Wong <monaw@users.noreply.github.com>

- add DATACITE_DRY_RUN (default=true) to django settings and .env.template - check with `hasattr` before deleting cached values - make keywords case insensitive - add __init__ method to CodeMetaValidationTest to setup instance variables properly - default publication year to this year if empty for codebase release (comses/planning#146) - start to move DOI tasks to management commands Recurring Tasks: ./manage.py mint_dois --interactive --dry-run ./manage.py sync_doi_metadata --interactive --dry-run - skip parameters (interactive, dry-run), to run in production - push datacite sandbox into default settings - DataCite creator givenName, familyName, and name all must be set explicitly, the DataCite fabrica form performs an ORCID lookup to populate those fields - rename to CodeMetaSchema/DataCiteSchema, replace "Metadata" with "Schema" for something slightly shorter and better reading than CodeMetaMetadata - add two subtype classes to handle creating a DataCiteSchema from a Codebase or a CodebaseRelease. Should consider a pydantic data model in the future - quiet down exceptions for degenerate codebases w/o Licenses and return partially consistent proxy objects if the codebase is not yet published - remove ContributorAffiliation tags prefetch - remove spuriously additional build_aip from archive creation - start to refactor various Factory test mock classes - add a flag defer_fs currently only used by tests but in the future could support creating the published archive asynchronously as a scheduled task (huey / temporal / etc) - hypothesis tests were generating inconsistent results due to issues with our state generation for Codebases, CodebaseRelease, Users, etc. switching to get_or_create for now, this may have downstream effects but probably shouldn't - minor logger tuning to remove unnecessary messaging

- prefix all one-off destructive DOI commands with `doi_` - add reset_staging to mint new DOIs on staging using the datacite sandbox, doi_reset_staging -> step 3, doi_mint_parent_codebase_dois - bump deps for datacite schema 4.5 and django cve

alee force-pushed the datacite_doi_registration branch from 35a20e6 to e5786f4 Compare March 14, 2024 23:00

sgfost reviewed Mar 15, 2024

View reviewed changes

asuworks self-requested a review March 27, 2024 01:01

asuworks marked this pull request as ready for review March 27, 2024 03:48

alee force-pushed the datacite_doi_registration branch 5 times, most recently from a2b559f to f403211 Compare April 3, 2024 17:45

alee force-pushed the datacite_doi_registration branch from a859111 to ed5f198 Compare April 9, 2024 22:16

asuworks mentioned this pull request May 1, 2024

fix: failing test_codemeta #233 #715

Closed

sgfost reviewed May 4, 2024

View reviewed changes

monaw added a commit to alee/comses.net that referenced this pull request May 25, 2024

updated DataCiteMetadata.build_codebase_metadata(): added description…

f0d0bea

…, default publication year to this year if none, updated identifier and creators fields (comses/comses.net/comses#699)

alee pushed a commit to alee/comses.net that referenced this pull request Jun 19, 2024

updated DataCiteMetadata.build_codebase_metadata(): added description…

8a81b3e

…, default publication year to this year if none, updated identifier and creators fields (comses/comses.net/comses#699)

alee force-pushed the datacite_doi_registration branch 2 times, most recently from 8a81b3e to ca941ec Compare June 19, 2024 17:57

monaw added a commit to alee/comses.net that referenced this pull request Jul 18, 2024

updated tests and it will pass but needs lot more work; see doc and c…

0e29fcb

…ode notes (issue comses/comses.net/comses#699)

alee force-pushed the datacite_doi_registration branch from 0e29fcb to b35f00c Compare August 12, 2024 21:12

alee pushed a commit to alee/comses.net that referenced this pull request Aug 12, 2024

updated tests and it will pass but needs lot more work; see doc and c…

29027ea

…ode notes (issue comses/comses.net/comses#699)

alee force-pushed the datacite_doi_registration branch from 393075c to 63877cb Compare August 13, 2024 03:04

sgfost reviewed Aug 13, 2024

View reviewed changes

alee force-pushed the datacite_doi_registration branch 2 times, most recently from d9f7c5d to e73617a Compare August 16, 2024 06:09

alee added a commit to sgfost/comses.net that referenced this pull request Aug 30, 2024

content: minor improvements to wording

780a846

remove codemeta tests entirely, will add them back in comses#699

alee mentioned this pull request Aug 30, 2024

add note about frozen files on peer review completion and remove duplicated review reminders content #757

Merged

alee force-pushed the datacite_doi_registration branch 3 times, most recently from e45893c to 401cdc6 Compare November 19, 2024 00:38

alee force-pushed the datacite_doi_registration branch 4 times, most recently from 297ff46 to 543030f Compare November 27, 2024 09:27

alee and others added 4 commits November 27, 2024 02:29

fix: add datacite secrets to build workflow

a3a5640

deps: bump to postgres 16 to match client

abfa27c

alee force-pushed the datacite_doi_registration branch 14 times, most recently from ad9aec9 to 83a0bed Compare December 4, 2024 07:16

alee force-pushed the datacite_doi_registration branch from 83a0bed to 9754634 Compare December 5, 2024 03:46

refactor: prefix doi management commands

6c2725b

- prefix all one-off destructive DOI commands with `doi_` - add reset_staging to mint new DOIs on staging using the datacite sandbox, doi_reset_staging -> step 3, doi_mint_parent_codebase_dois - bump deps for datacite schema 4.5 and django cve

alee force-pushed the datacite_doi_registration branch from 9754634 to 6c2725b Compare December 5, 2024 04:46

alee merged commit 9d96cfc into comses:main Dec 5, 2024
6 of 7 checks passed

alee deleted the datacite_doi_registration branch December 12, 2024 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide automated DOI registration #699

provide automated DOI registration #699

alee commented Mar 14, 2024 •

edited

Loading

sgfost Mar 15, 2024

monaw Mar 25, 2024

monaw Mar 26, 2024

alee Mar 26, 2024

alee Mar 26, 2024

sgfost Mar 26, 2024 •

edited

Loading

asuworks Mar 27, 2024

monaw Mar 27, 2024

sgfost Mar 27, 2024

alee Mar 27, 2024

sgfost left a comment

sgfost May 3, 2024

sgfost May 4, 2024

sgfost May 4, 2024

sgfost May 4, 2024

monaw commented Jul 18, 2024 •

edited

Loading

sgfost left a comment

sgfost Aug 13, 2024

alee Aug 14, 2024

sgfost Aug 13, 2024

alee Aug 14, 2024

alee commented Aug 14, 2024

asuworks commented Aug 14, 2024 •

edited

Loading

		return ReleaseContributor.objects.authors(self.codebase_release)


		class CodeMetaMetadata:

provide automated DOI registration #699

provide automated DOI registration #699

Conversation

alee commented Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgfost Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgfost left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

monaw commented Jul 18, 2024 • edited Loading

sgfost left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alee commented Aug 14, 2024

asuworks commented Aug 14, 2024 • edited Loading

alee commented Mar 14, 2024 •

edited

Loading

sgfost Mar 26, 2024 •

edited

Loading

monaw commented Jul 18, 2024 •

edited

Loading

asuworks commented Aug 14, 2024 •

edited

Loading