-
Notifications
You must be signed in to change notification settings - Fork 0
/
preserve.qmd
172 lines (91 loc) · 12.1 KB
/
preserve.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: "Archiving and preserving your data"
---
As you finalize your project, an important task is to archive your data in a publicly available repository (pending sensitivity and by non-disclosure agreement exceptions). There are a few important steps to ensure that your data can be reused by others and thus make your work more reproducible.
Your general philosophy when preparing the preservation of your scientific products should be: ***Document what you used and preserve what you produced***
<hr>
## What scientific products to preserve?
Often the first question that comes to mind when starting to preserve your work is: What should I include in my data archive? Generally speaking, you want to preserve **your** work. This means capturing the methodology you used, the raw data you collected, any data cleaning you did, and any data and output (figure, report, etc.) you generated. Okay… so you mean everything!? Well, yes and no. Everything that was relevant to help you to come to the findings and conclusions discussed in your project report or any other publications and deliverables. Let’s break this down!
### Raw data
Here are a few questions to ask yourself to determine if you should refer in your documentation to the raw data you used or also include them in your data archive.
1. The **raw data is already publicly accessible**, and the hosting solution (website, FTP server, etc.) seems well maintained (ideally providing a recommended citation)
=> *Document the website or process you used to collect the data and when you accessed/downloaded the data you used. Try also to determine if a specific version number is associated with the data you used.*
2. The raw data is **not** publicly accessible
Note that we are not talking about data under a non-disclosure agreement (NDA) here but more about data with an unclear reuse status or obtained by interactions with a person or an institution. For example, if the data you used were sent to you privately, then we recommend that you:
- inquire with your person of contact about the status of licensing and if they would be willing to let you share those data publicly. You might face resistance at first, so take the time to explain why you think it is valuable to your work to also share those data sets.
- if, in the end, it is not possible to share the data, please still describe the data in your documentation and list the contact information (person or institution) to inquire about this data set.
### Intermediate data
This is data you generated either while cleaning or analyzing the raw data. You should preserve it if:
- it was not directly generated by a script (otherwise, preserve the code instead)
- it has reusable value. For example, cleaned-up versions of raw data can be very valuable for others to reuse!
### Code
Scripting your analytical workflow from the raw data to the end products is a great way to make your work more reproducible and more reusable by others. We thus strongly encourage your team to develop code to process and analyze data. Cloud-based code repository services, such as GitHub, GitLab, BitBucket, and more, are a great way to both manage and preserve your code.
Those services are often well-integrated with data repositories that link your code repository with your data archive. They also offer a way to tag a specific version of your code to ensure it is the exact code you used for a specific analysis.
### Final products
We recommend including any data set used to produce statistics, figures maps, and other visualizations that were used in your work, in this case, even if generated by scripts.
<hr>
## Choosing a data repository
OK, we know what we want to archive. Now let’s decide where we want to preserve things!
UCSB's institutional data repository [Dryad](https://datadryad.org/stash) will be your default data repository. However, we encourage you to discuss with your Faculty Advisor to determine if other data repositories might suit your targeted audience/community better.
If you would like to research on your own which data repository could be best for your project, the [Registry of Research Data Repositories](https://www.re3data.org) is a great resource to do so.
### Dryad
[Dryad](https://datadryad.org/stash) is free of use for any affiliated researcher. Here is an overview of the process of submitting data to DRYAD:
![DRYAD data submission overview](images/dryad_workflow.png)
Before you can start this submission process, we will describe the information you need to prepare for your data archive submission.
## Sharing your work
### Use open file formats
We recommend using open and text-based file formats as much as possible to make your data more accessible. It removes the need to buy specific software to open your data and ensures that this format will remain readable in the longer term.
### Choosing a license
This can quickly become an overwhelming subject. In a nutshell, [Creative Commons licenses](https://creativecommons.org/share-your-work/) are the ones that are favored to license data.
![Foter, CC BY-SA 3.0 <https://creativecommons.org/licenses/by-sa/3.0>, via Wikimedia Commons](https://upload.wikimedia.org/wikipedia/commons/a/a5/CC_License_Freedom_Scale_Chart.png)
Note that data repositories often support a certain number of licenses, so this can be a parameter to help you choose which data repository you want to use. For example, ***DRAYD is licensing all its content under [CC0](https://creativecommons.org/publicdomain/zero/1.0/), an equivalent to saying its content is in the public domain***.
If you want to know more on how to best license your data, click [here](https://www.library.ucsb.edu/sites/default/files/dls-n10-2021-licensing-navy_0.pdf)
## Documenting your work
To make your archiving process the most efficient, it is key to document your work as you progress throughout your project. If you do so, archiving your data will consist of collecting existing information about the various parts of your project rather than developing it from scratch a few months after you have generated this specific data set.
Add an image about the power of README
### Metadata
Metadata aims at describing your data with enough information that should let you be able to reuse this data even if you know nothing about this specific data set. It is sometimes defined as data about data. So what should you include? Here are some pointers:
- Describe the contents of data files. If you are using complex jargon or concepts make sure you refer to external vocabulary or clearly define these terms as used in your project
- Keep data entry consistent
- Define the parameters and the units on the parameter
- Explain the formats for date, time, geographic coordinates, and other parameters
- Define any coded values or acronyms used (in the data or headers)
- Describe quality flags or qualifying values
- Define missing values
- Always use an established metadata standard
Note that there are existing metadata standards that can help you to structure your data in a systematic way. Some data repositories offer web forms that let you enter metadata information in a way that is compatible with those standards.
### README
README files are not new; they have been around computer-based projects since the early days and have proven to be very valuable in describing content. For your main project README we recommend compiling the following information about your data:
- Data sources: where did you get your raw data from?
- Provenance: what data set you used/combined to generate your new data
- License & permissions: state under which license your data sets are released so others know what they can do with your data
- Methodology: describe the methodology used for your projects
- Data listing: list all your data and their potential relationships
- Metadata: add the metadata you developed earlier for each of your data set
Note that you should not feel limited to one README only. It is a great practice to also have READMEs in any subfolder/section of your project with complex organization and explanations on how to navigate the file structure or in which order files should be combined.
Our team has prepared a [README template](https://github.com/UCSB-Library-Research-Data-Services/project-data-management/blob/main/_templates/README-Template-EDS213.md) for your submission to Dryad. Note that some sections might not be relevant to your project, and you should simply delete those as you move through filling out this template.
## Submitting to DRYAD
You are now ready to start the submission process!! Here we are going to use DRYAD as an example.
- Before logging into Dryad, you must create a free [ORCID](https://info.orcid.org/what-is-orcid/). ORCIDs are free, unique identifiers for yourself. Think of it as a Social Security Number for Researchers that, this time, you can share. Create your account [here](https://orcid.org/register).
- Then you can [login](https://datadryad.org/stash/sessions/choose_login) and should land on a page as below. Click on the `Start new dataset` button
![](images/dryad_landing-page.png)
- You will now have to fill out the web form to provide project-level information, starting with the _Preliminary information_ section. Select `other or not applicable` as your submission is not related to a journal publication.
![](images/dryad_preliminary.png)
- Fill the _Basic information_ section
+ _Title_: Note it can be different than your project a describing as much as possible your data archive content
+ _Authors_: enter all your teammates
+ _Research domain_: Choose the more relevant to your project
+ _Funding_: check the `No Funding received box` or enter any external funding you received for this project
+ _Abstract_: enter a short description of your work
- _Data Description_ section:
+ _Keywords_: add relevant keywords and please make sure to add the Bren project keyword `BrenMESMProject`
+ _Methods_: describe the methods you used for your analysis
+ _Usage notes_: you can add any software you used to conduct your project here. please provide the version number along with the software name
- _Related work_ section:
+ This is where you can link your data repository on GitHub with your data archive. Select `Software` as _Work type_ and paste the URL to your repository. Do not forget to also provide the necessary information on the README on your code repository. Once your data archive is published on DRYAD, do not forget to add the link to README on GitHub, linking the resources both ways.
![](images/dryad_related-work.png)
***DRYAD full Submission and publication guidelines: <https://datadryad.org/stash/submission_process>***
## Data curation process
OK, you have submitted your data and its documentation to the DRYAD (or others) data repository, what is next!? Most of the data repository will have some level of data curation process, mainly a series of checks that will ensure that you provided all the necessary documentation to make your research work understandable to others. Some of those checks might be done automatically (leveraging metadata standards, for example), some others will be done by data curators associated with the data repository. In the coming days of your submission to DRYAD, a curator will reach out to you with a list of items to improve. For example replacing and Excel spreadsheets (proprietary software) with csv files that can be more easily open, especially in several years from now. Adding more information about the raw data you used or clarifying their licensing, provide more details about your methodology. Once this feedback received, you will have to iterate through those recommendations as needed and resubmit your updated documentation and data. You can expect the curation process to take several days to potentially a few weeks. Therefore plan accordingly!!
## Publication
Once the curation process is over, your data will be publish and made publicly available. Most of the data repository provide a [DOI](https://www.doi.org/the-identifier/what-is-a-doi/) to reference your data archive. Use this DOI to share and cite your data archive in your project report and publications!!