Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cff to doi parser #107

Merged
merged 7 commits into from
Jan 30, 2024
Merged

feat: cff to doi parser #107

merged 7 commits into from
Jan 30, 2024

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Jan 30, 2024

Adds a parser to extract the DOI from a CITATION.cff file into schema:citation

@cmdoret cmdoret requested a review from rmfranken January 30, 2024 11:01
@cmdoret cmdoret self-assigned this Jan 30, 2024
@cmdoret
Copy link
Member Author

cmdoret commented Jan 30, 2024

With more recent versions of the format, doi can also be represented as:

identifiers:
  - description: This is the collection of archived snapshots of all versions of My Research Software
    type: doi
    value: "10.5281/zenodo.123456"

Need to add support for this

@cmdoret
Copy link
Member Author

cmdoret commented Jan 30, 2024

We also need to define behaviour in case multiple dois are present in the file example from the spec repo:

identifiers:
  - type: doi
    value: 10.5281/zenodo.1003149
    description: The concept DOI for the collection containing all versions of the Citation File Format.
  - type: doi
    value: 10.5281/zenodo.5171937
    description: The versioned DOI for the version 1.2.0 of the Citation File Format.
date-released: "2021-08-09"
keywords:
  - citation file format
  - CFF
  - citation files
  - software citation
  - file format
  - YAML
  - software sustainability
  - research software
  - credit
license: "CC-BY-4.0"
doi: 10.5281/zenodo.5171937

Do we want to generate multiple triples in that case?

<https://github.com/citation-file-format/citation-file-format> a schema:SoftwareSourceCode ;
    schema:citation <10.5281/zenodo.5171937> ;
    schema:citation <10.5281/zenodo.1003149> .

@rmfranken
Copy link
Member

Do we want to generate multiple triples in that case?
yes, I think so. But in the example case you sketch - I would only capture the value for "doi:", and not the values for the identifiers which happen to be DOI's, as they are referring to different abstractions ('is part of', 'specific versions'). If there are multiple values for "doi:" possible for a repository/tool in a CFF, then our tool should reflect that, regardless of how dumb it is to have two identifiers for representing the same "thing".

Note, I see that that the example also contains doi's for conference proceedings etc. I'm referring above to the "doi:" purely at the root of the file, not of objects that the file references as e.g.

conference: name: "Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE5.1)" doi: 10.6084/m9.figshare.3827058 title: "Track 2 Lightning Talk: Should CITATION files be standardized?"

Those should not be part of the written rdf output in my opinion. I would not even make them "seeAlso" since they don't seem to be central to what identifies a given repository.

Copy link
Member

@rmfranken rmfranken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks nice. Very clear. You gave me a new reason to chase down the parser abstraction, I will need the same kind of logic for my "rdfifyer" script. With the small comment improvement I think this is ready to go!

gimie/parsers/cff.py Outdated Show resolved Hide resolved
gimie/parsers/__init__.py Outdated Show resolved Hide resolved
@rmfranken
Copy link
Member

Edit: I see I was wrong re-reading the spec. "doi:" is not a mandatory root property of a cff. I will read a bit deeper and form an opinion about the multi doi scenario.

@cmdoret
Copy link
Member Author

cmdoret commented Jan 30, 2024

Do we want to generate multiple triples in that case?
yes, I think so. But in the example case you sketch - I would only capture the value for "doi:", and not the values for the identifiers which happen to be DOI's, as they are referring to different abstractions ('is part of', 'specific versions'). If there are multiple values for "doi:" possible for a repository/tool in a CFF, then our tool should reflect that, regardless of how dumb it is to have two identifiers for representing the same "thing".

Note, I see that that the example also contains doi's for conference proceedings etc. I'm referring above to the "doi:" purely at the root of the file, not of objects that the file references as e.g.

conference: name: "Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE5.1)" doi: 10.6084/m9.figshare.3827058 title: "Track 2 Lightning Talk: Should CITATION files be standardized?"

Those should not be part of the written rdf output in my opinion. I would not even make them "seeAlso" since they don't seem to be central to what identifies a given repository.

But if we take the other example (https://github.com/opencv/cvat/blob/develop/CITATION.cff), there is no doi: at the root, it is only defined through identifier. Looks like we have to choose between missing citations and included unrelated dois. Shall we go with the former and keep things as they are now? (only parse doi:)

EDIT: sorry, just saw your last comment

@cmdoret cmdoret requested a review from rmfranken January 30, 2024 13:48
@rmfranken
Copy link
Member

Yes, I think the former is simpler & safer. I find that objectifying the doi into some identifier more confusing and hopefully most people just have a single doi for their tool. That being said, we can always change it we see we are missing out on a lot of DOI's...

@cmdoret cmdoret merged commit 23c75dd into main Jan 30, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants