Skip to content

Commit

Permalink
Revised data/code indicators
Browse files Browse the repository at this point in the history
  • Loading branch information
vtraag committed Dec 13, 2024
1 parent 7a8e5e8 commit 7f6bec4
Show file tree
Hide file tree
Showing 4 changed files with 101 additions and 30 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -66,31 +66,27 @@ PLOS also provides API's to search its database. This [page](https://api.plos.or

### Level of FAIRness of data

Metrics on the level of FAIRness of data (sources) can support in establishing the prevalence of open/FAIR data practices. This metric attempts to show in a more nuanced manner where FAIR data practices are used and in some cases even to what extent they are used. Assessing whether or not a data source practices FAIR principles is not trivial with a quick glance, but there are initiatives that developed methodologies that assist to determine this for (a large number of) data sources.
Metrics on the level of FAIRness of data (sources) can support in establishing the prevalence of open/FAIR data practices. This metric attempts to show in a more nuanced manner where FAIR data practices are used and in some cases even to what extent they are used. Assessing whether or not a data source practices FAIR principles is not trivial with a quick glance, but there are some initiatives that developed methodologies that assist to determine this for (a large number of) data sources.

#### Measurement.

##### Existing methodologies

###### Research Data Alliance

The Research Data Alliance developed a [FAIR Data Maturity Model](https://www.rd-alliance.org/group/fair-data-maturity-model-wg/outcomes/fair-data-maturity-model-specification-and-guidelines-0) that can help to assess whether or not data adheres to the FAIR principles. This document is not meant to be a normative model, but provide guidelines for informed assessment.
The Research Data Alliance developed a FAIR Data Maturity Model [@group_fair_2020] that can help to assess whether or not data adheres to the FAIR principles. This document is not meant to be a normative model, but provide guidelines for informed assessment.

The [document](https://www.rd-alliance.org/system/files/FAIR%20Data%20Maturity%20Model_%20specification%20and%20guidelines_v1.00.pdf) includes a set of indicators for each of the four FAIR principles that can be used to assess whether or not the principles are met. Each indicator is described in detail and its relevance is annotated (essential, important or useful). The model recommends to evaluate the maturity of each indicator with the following set of maturity categories:
The FAIR Data Maturity Model includes a set of indicators for each of the four FAIR principles that can be used to assess whether or not the principles are met. Each indicator is described in detail and its relevance is annotated (essential, important or useful). The model recommends to evaluate the maturity of each indicator with the following set of maturity categories:

0 – not applicable

1 – not being considered yet

2 – under consideration or in planning phase

3 – in implementation phase

4 – fully implemented
0. Not applicable
1. Not being considered yet
2. Under consideration or in planning phase
3. In implementation phase
4. Fully implemented

By following this methodology, one could assess to what extent the FAIR data practices are adhered to and create comprehensive overviews, for instance by showing the scores in radar charts.

Data life cycle assessment
###### Data life cycle assessment

Determining the level of FAIR data practices can involve assessing how well data adheres to the FAIR principles at each stage of the data lifecycle, from creation to sharing and reuse [@jacob2019].

Expand All @@ -100,9 +96,13 @@ Evaluate adherence to FAIR principles at each stage: For each stage of the data

Determine the overall level of FAIR data practices: Once the scores for each principle and stage have been assigned, determine the overall level of FAIR data practices. This can be done by using a summary score that takes into account the scores for each principle and stage, or by assigning a level of FAIR data practices based on the average score across the principles and stages.

###### Automated detection of FAIRness

There are some attempts at trying to establish the FAIRness of data automatically. One such a tool, called F-UJI is available from <https://www.f-uji.net>, developed by @devaraju_f-uji_2024. The accuracy of the tool is not reported.

### Availability of data statement

A data availability statement in a publication describes how the reader could get access to the data of the research. Having such a statement in place improves transparency on data availability and can thus be considered as an Open Data practice. However, having a data availability statement in place does not necessarily imply that the data is openly available or that it is more likely that the data can be shared [@gabelica2022]. Nevertheless, a description of how to access an Open Data repository, how to make a request for data access or an explanation why some data cannot be shared due to ethical considerations are all examples of Open Data practices that make data reuse more accessible and transparent [@federer2018]. The availability of a data statement can therefore be considered as an Open Data practice.
A data availability statement in a publication describes how the reader could get access to the data of the research. Having such a statement in place improves transparency on data availability and can be considered as an Open Data practice. However, having a data availability statement in place does not necessarily imply that the data is openly available or that it is more likely that the data can be shared [@gabelica2022]. Nevertheless, a description of how to access an Open Data repository, how to make a request for data access or an explanation why some data cannot be shared due to ethical considerations are all examples of Open Data practices that make data reuse more accessible and transparent [@federer2018]. Indeed, even if data itself cannot be shared, metadata can typically be shared.

#### Measurement

Expand All @@ -112,4 +112,4 @@ All PLOS journals require publications to include a data availability statement.

## Known correlates

Some research suggests that openly sharing data is positively related to the citation rate of publications [@piwowar2007;@piwowar2013].
Some research suggests that openly sharing data is positively related to the citation rate of publications [@piwowar2007; @piwowar2013].
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,17 @@ affiliations:

## Description

Many, if not most, scientific analyses involve the use of code or software in one way or another. Code and software can be used for data handling, statistical estimation, visualisation, or various other tasks. Both open-source and closed-source software may be used for research. For instance, MATLAB and Mathematica are two commercial software packages that may be used in research, whereas Octave and SageMath are open-source alternatives. We here try to provide metrics that can serve as an indicator of the use of code in research, where code refers to any type of software (e.g. computer library, tool, package) or any set of computer instructions (e.g. like an R or Python script) used in the research cycle.
Many, if not most, scientific analyses involve the use of code or software in one way or another. Code and software can be used for data handling, statistical estimation, visualisation, or various other tasks. Both open-source and closed-source software may be used for research. For instance, MATLAB and Mathematica are two commercial software packages that may be used in research, whereas Octave and SageMath are open-source alternatives. We here try to provide metrics that can serve as an indicator of the use of code in research, where "code" refers to any type of software (e.g. computer library, tool, package) or any set of computer instructions (e.g. like an R or Python script) used in the research cycle.

One challenge is that we are typically interested in the use of “research software”, not in all software per se. Defining what this encompasses is not straightforward. [@gruenpeter2021] defines it as code “that \[was\] created during the research process or for a research purpose. Software components (e.g., operating systems, libraries, dependencies, packages, scripts, etc.) that are used for research but were not created during or with a clear research intent should be considered software in research and not Research Software” (Gruenpeter et al., 2021, p. 16) As this clarifies, this might also involve the creation of new software that is released for other researchers to work with., However, this is not considered in this indicator, but in the indicator on open code. Almost any code depends on other code to work properly. Some of these dependencies might constitute research software themselves, but this is not necessarily the case. Instead of trying to classify software as “research software” or not, we will take a more agnostic approach in the description of this indicator, and simply try to describe approaches to uncover the use of some code in research, regardless of whether it constitutes “research software” or not.
One challenge is that we are typically interested in the use of "research software", not in all software per se. Defining what this encompasses is not straightforward. [@gruenpeter2021] defines it as code "that \[was\] created during the research process or for a research purpose. Software components (e.g., operating systems, libraries, dependencies, packages, scripts, etc.) that are used for research but were not created during or with a clear research intent should be considered software in research and not Research Software" [@gruenpeter2021, p. 16]. As this clarifies, this might also involve the creation of new software that is released for other researchers to work with., However, this is not considered in this indicator, but in the indicator on open code. Almost any code depends on other code to work properly. Some of these dependencies might constitute research software themselves, but this is not necessarily the case. Instead of trying to classify software as "research software" or not, we will take a more agnostic approach in the description of this indicator, and simply try to describe approaches to uncover the use of some code in research, regardless of whether it constitutes "research software" or not.

Sometimes a distinction is made between "reuse" and "use", where "reuse" refers explicitly to the use of openly released software, whereas "use" refers to the use of software more generally. We do not make such a distinction here.

This indicator can be useful to provide a more comprehensive view of the impact of the contributions by researchers. Some researchers might be more involved in publishing, whereas others might be more involved in developing and maintaining research software (and possibly a myriad other activities).

## Metrics

Most research software is not properly indexed. There are initiatives to have research software properly indexed and identified, such as the [Research Software Directory,](https://research-software-directory.org/) but these are far from comprehensive at the moment. Many repositories support uploading research software. For instance, Zenodo currently holds about 116,000 records of research software. However, there are also reports of the absence of support for including research software in repositories [@carlin2023].
Most research software is not properly indexed. There are initiatives to have research software properly indexed and identified, such as the [Research Software Directory,](https://research-software-directory.org/) but these are far from comprehensive at the moment, and is the topic of ongoing research [@malviya-thakur_scicat_2023]. Many repositories support uploading research software. For instance, Zenodo currently holds about 116,000 records of research software. However, there are also reports of the absence of support for including research software in repositories [@carlin2023].

### Number of times code is cited/mentioned in scientific publications

Expand All @@ -45,9 +47,9 @@ The biggest limitation is that not all researchers report all research software

In addition, software might not be cited explicitly, and instead the paper associated with the software might be cited. The association between papers and software can be retrieved in various ways. Sometimes, software repositories are mentioned in papers, while vice-versa, the software repository may include citation information. This may take various forms, such as a [`CITATION.cff`](https://citation-file-format.github.io/) file in a GitHub repository, or a [`CITATION`](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/citation.html) file in an R package. The association between papers and code is also being tracked by <https://paperswithcode.com/>. However, it is difficult to distinguish between citations to a publication for the software it introduced, or other advances made in the paper. Nonetheless, it might be relevant to combine citations statistics to the paper with explicit citations or mentions of the research software.

Measurement.
#### Measurement

##### Existing datasources:
##### Existing datasources

###### Bibliometric databases

Expand All @@ -59,10 +61,12 @@ Not all bibliometric databases actively track research software, and therefore n

###### Extract software mentions from full text

Especially because of the limited explicit references to software, it is important to also explore other possibilities to track the use of code in research. One possibility is to try to extract the mentions of a software package or tool from the full-text. This is done by [@istrate] who have trained a machine learning model to extract references to software from full-text. They rely on the manual annotation of software mentions in PDFs by [@du2021]. The resulting dataset of software mentions is available from <https://doi.org/10.5061/dryad.6wwpzgn2c>.
Especially because of the limited explicit references to software, it is important to also explore other possibilities to track the use of code in research. One possibility is to try to extract the mentions of a software package or tool from the full-text. This is done by [@istrate] who have trained a machine learning model to extract references to software from full-text. They rely on the manual annotation of software mentions in PDFs by [@du2021]. The resulting dataset of software mentions is made available publicly [@istrate_cz_2022].

Although the dataset of software mentions might provide a useful resource, it is a static dataset, and at the moment, there do not yet seem to be initiative to continuously monitor and scan the full-text of publications. Additionally, its coverage is limited to mostly biomedical literature. For that reason, it might be necessary to run the proposed machine learning algorithm itself. The code is available from <https://github.com/chanzuckerberg/software-mention-extraction>.

A common "gold standard" dataset for training software mention extraction from full text is the so-called SoftCite dataset [@howison_softcite_2023].

### Repository statistics (# Forks/Clones/Stars/Downloads/Views)

Much (open-source) software is shared in version control repositories in online platforms. Various types of usage statistics can be derived from these online platforms, that somehow relate to the general level of interest in the software. These metrics vary from how many other users have copies of those repositories (often called forks), to how many people downloaded a particular release from this platform.
Expand All @@ -71,15 +75,15 @@ There are some clear limitations to this approach. Firstly, not all research sof

The most common version control system at the moment is [Git](https://git-scm.com/), which itself is open-source. There are other version control systems, such as Subversion or Mercurial, but these are less popular. The most common platform on which Git repositories are shared is GitHub, which is not open-source itself. There are also other repository platforms, such as [CodeBerg](https://codeberg.org/) (built on [Forgejo](https://forgejo.org/)) and [GitLab](https://gitlab.com/), which are themselves open-source, but they have not yet managed to reach the popularity of GitHub. We therefore limit ourselves to describing GitHub, although we might extend this in the future.

#### Measurement.
#### Measurement

We propose three concrete metrics based on the GitHub API: the number of forks, the number of stars and the number of downloads of releases. There are additional metrics about traffic available from [GitHub API metrics](https://docs.github.com/en/rest/metrics), but these unfortunately require permissions from a specific repository.\

##### Existing methodologies

###### Forks/Stars (GitHub API)

On GitHub, people can make a personal copy of a repository, which is called a fork. In addition, they can star a repository, in order to "save it in their list of favourite repositories. The number of forks of a repository hence provides a metric of how many people have made personal copies of a repository, and the number of stars provides a metric of how many people have marked it as a favourite.
On GitHub, people can make a personal copy of a repository, which is called a fork. In addition, they can "star" a repository, in order to "save" it in their list of "favourite" repositories. The number of forks of a repository hence provides a metric of how many people have made personal copies of a repository, and the number of stars provides a metric of how many people have marked it as a "favourite".

The calculation of the number of forks and the number of stars is really straightforward. For a particular `repo` from a particular `owner`, one can get the count from <https://api.github.com/repos/owner/repo>. For instance, for the repository `openalex-guts` from `ourresearch`, one can get the information from the URL [https://api.github.com/repos/ourresearch/openal](https://api.github.com/repos/ourresearch/openalex-guts)ex-guts. The number of forks are then listed in the field `forks_count` and the number of starts are listed in `stargazers_count`. See the API documentation for more details.

Expand Down
Loading

0 comments on commit 7f6bec4

Please sign in to comment.