Welcome to the pipeline repository for the SPI-Birds Network and Database. Follow the links above to visit our website, contact us via e-mail or Twitter. This README contains all the details required to work with the pipelines, including workflow guidelines for developers.
Table of Contents (general user)
Table of Contents (developers guidelines)
Welcome to the SPI-Birds pipeline package. This section of the README will give you an introduction on how to load the package, how to find out details about each pipeline, and how to use the package for creating bird data following the SPI-Birds community data standard and generating standard quality checks.
The pipeline package can be installed in R using the following code with the package remotes
:
remotes::install_github("SPI-Birds/pipelines")
library(pipelines)
This will install all pipelines and quality check code on your computer and attach our pipeline
package into your session of R. Individual pipelines are build as separate functions for each data owner (where one data owner can administer multiple populations). Each function is given the name format_X()
where X is the letter code for the data owner. The codes for different data owners and corresponding populations are described in the SPI-Birds standard protocol. Note in cases where a data owner administers one population, the unique 3 letter population ID code and the data owner code are identical.
To process each set of primary data into the structure described in the SPI-Birds standard protocol it is often necessary to make assumptions about how each variable is interpreted. All assumptions made during the pipeline process are described in the help documentation for a given function. This can be accessed using the ?
in R. For example, to read about the assumptions made when processing data from the NIOO, you can use the code:
?format_NIOO
Each set of primary data is in a slightly different format. Therefore, to run all pipelines successfully, your system will require additional software and drivers (in addition to R).
Pipelines for some populations require additional software and drivers. Setup instructions describe how to install the required software for both a Windows 10 64-bit operating system and Mac OSX. The setup procedure should be similar for other Windows 64-bit systems. If you are unsure which version of Windows is running on your computer, check 'System Type' in 'System Information'. To run the pipelines for all populations a users system must have:
- Microsoft Access Driver (/.mdb, /.accdb) (Windows only)
- Python 3
- Python libraries pandas and pypxlib
Note Users running Mac OSX will not be able to run pipelines with primary data stored in Microsoft Access format without purchasing paid drivers.
Firstly, you must check that you are running a 64-bit version of R. Open an R session and see whether you have 64-bit or 32-bit installation.
If you do not have a 64-bit version you will need to install one here.
Once you have a 64-bit version of R, search for 'ODBC' in the Windows taskbar. There will be two version (32-bit and 64-bit) select the 64-bit version. This will open the 'ODBC Data Source Administrator (64-bit)' window.
In the new window check for 'Microsoft Access Driver'. If this already exists you can skip to the Python stage.
If 'Microsoft Access Driver' does not exist click 'Add' to install a new driver on your system.
Select 'Microsoft Access Driver (/.mdb, /.accdb)' and click finish.
If 'Microsoft Access Driver (/.mdb, /.accdb)' does not appear, you will need to download the 64-bit driver here**
In the next window, you must add a 'Data Source Name'. Everything else can be left blank.
Check if this driver is installed and recognised by R using the function odbcListDrivers()
from the odbc
package. Note that you will need to open a new session of R before the driver will appear.
To install Python, we recommend using the Anaconda distribution. Make sure to download the 3.X version of Python. The Anaconda distribution comes with some libraries (including pandas) pre-loaded.
Once installed, open the 'Anaconda prompt' and type:
pip install pypxlib
This will install the pypxlib library on your system.
Restart your computer before running the pipelines.
To generate the pdf quality check report on Windows you will need to have installed MikTex
. If MikTex is not installed, only the html version of the quality check report can be created.
An alternative LaTeX distribution that works well in R is TinyTeX
.
At present, no free Microsoft Access Driver is available for Mac.
As a consequence, the pipelines
package currently does not run pipelines requiring a Microsoft Access Driver on Mac OSX (the affected pipelines are skipped and a information message displayed when attempting to run on Mac).
The following notes detail how to set up the python environment on MacOS, including necessary libraries:
-
Install Anaconda 3.X (this was last tested with 3.8)
-
Check your default python version by opening terminal and typing:
python3 --version
(This should return Python 3.X.Y) -
Check that
pip
is available by typingpip3 --version
in the terminal -
Update
pip
by typingpython3 -m pip install --user --upgrade pip
in the terminal (You have to use the--user
argument as permission may be denied otherwise) -
Open RStudio and load the reticulate package:
library(reticulate)
-
Check which version of python reticulate is linked to:
py_config()
(Required python libraries need to be installed into this virtual environment) -
Install
pandas
library from within R:py_install("pandas")
-
Install
pypxlib
library from within R:py_install("pypxlib", pip = TRUE)
(Since this is an external library hosted on GitHub, you need to specify installation via pip) -
Check that both libraries are now available:
reticulate::py_module_available("pandas")
reticulate::py_module_available("pypxlib")
(Both commands should return TRUE)
With this setup, python should be good to go for extracting paradox databases. (Note that when you install Anaconda, the r-reticulate environment should already be present. If that is not the case, you may have to first generate the environment and link it to RStudio).
At present, the pipelines
package does not create pdf outputs when run on a Mac.
This is work in progress and will be changed in the future.
If you are still unable to run the pipelines following these setup instructions try these troubleshooting techniques:
-
Restart your computer before running pipelines to ensure R recognises the newly installed software and drivers.
-
If R on Windows does not recognise Python's pandas module, try installing it using
reticulate::py_install("pandas")
. -
Download the newest version of R here.
-
Update all R packages.
Once your computer is set up and primary data follow the correct naming protocol you can run the pipeline function. R will ask you to select the folder where the primary data are stored. You can decide on the output create by the pipeline using the argument output_type
, which can be either "csv" (as separate .csv files, the default) or "R" (as an R object).
format_NIOO(output_type = "R")
If you want to run multiple pipelines at once, you can use the run_pipelines()
function instead.
All files relevant to SPI-Birds are stored in the N drive on the NIOO server (N:\Dep.AnE\SPI_Birds\data
). This data
folder contains separate folders for every data owner in the format X_Name_Country
, where X is the data owner code, Name is the name of the data owner, and Country is the country where the data owner is based. For example, the NIOO data are stored in the folder:
NIOO_NetherlandsInstituteOfEcology_Netherlands
The folder for each data owner will contain all relevant information for all populations administered by the data owner. This will usually include:
- Primary data
- Meta data
- Archive meta data
- The archive folder
The naming convention of each of these files is described below.
In addition to folders for each data owner, the data
folder contains the most recent output of all pipelines in the standard format, including an archiving folder. When a data request is made, this version of the standard format can be filtered to meet a given data request (see Data requests below). This is more efficient than re-running pipelines for each data request.
All files used to run pipelines and store data should follow the standard naming convention.
Primary data should be named with the format X_PrimaryData_Y
. Where X is the data owner code (described above) and Y is additional information used to distinguish between multiple primary data files. For example, the a data owner ABC
may have separate primary data files for great and blue tits. These files might then be named:
ABC_PrimaryData_Greattit.csv
ABC_PrimaryData_Bluetit.csv
All data owners should also provide meta-data about their population(s) in an .xslx file with the format X_MetaData.xlsx
, where X is the data owner code.
The folder of each data owner will also include an archive meta data .txt file (the archiving process is explained in more detail below). This file will be in the format X_ArchiveMetaData.txt
, where X is the data owner code.
The data owner may provide other files (e.g. field work protocols, relevant papers). The possible types of files here is unrestricted, to the naming convention must be more flexible. Files can contain any information and be of any file type; however all files should start with the data owner code. For example, the field protocol for data owner ABC
may be stored as:
ABC_FieldProtocol.docx
Code of all pipelines is stored in the /R folder of the pipelines repository. Every pipeline file should follow the naming convention format_X.R
, where X is the data owner code. More details on the structure of pipeline code can be found below.
Below we describe the workflow that any developer should follow when building a new pipeline.
-
Contact the data owner and let them know you have started to work on their data. At this point, it is usually helpful to ask about any changes or additions that may have been made to the primary data since it was first included in the SPI-Birds database.
-
Update the SPI-Birds Google Sheet and list the pipeline as 'in progress'.
-
Pull the newest version of the master branch (
git pull
). -
Create a new branch from the master where you will build your pipeline (
git checkout -b new_branch_name
). Make sure the branch name is clear and concise. -
As you work, you should stage (
git add format_X.R
) and commit (git commit -m 'commit header' -m 'commit details'
) your work regularly.
Note Commits should ideally be distinct blocks of changes with a concise header and detailed description. See some commit best practices here.
-
To make commits more easily readable/searchable you should include an emoji at the start of each commit message following these dev guidelines. For example, if you find some typos in the code your commit would be ':pencil2: Fix typo in format_XXX() function'.
-
When you have finished working for a day, push your branch to the remote (
git push -u origin new_branch_name
the first time;git push
afterwards).
-
In your new branch, create the new file
format_X.R
in the /R folder, where X is the data owner code. -
This file should contain one parent function (
format_X()
) and at least 4 internal functions for each of the four tables in the standard format:create_brood_X()
create_capture_X()
create_individual_X()
create_location_X()
-
format_X()
should always take 4 standard arguments:- path: The location of the folder where primary data are stored. Can be left blank and R will prompt you to find the folder.
- PopID: The population code(s) for the populations where you want to run pipelines. Relevant for data owners that administer multiple populations.
- Species: The species code(s) for the species you want to use (e.g. PARMAJ for great tit). See the SPI-Birds standard protocol for all species codes.
- output_type: Should the data be output in R or as separate .csv files
-
These arguments are all documented under pipeline_params in the
zzz.R
file within /R. -
Every function should be documented using
roxygen2
. The 'Details' section of the documentation should be used to record any decisions that were made during the pipeline construction that may be relevant for the data owner, users, or other developers. -
Once a pipeline is finished, add information about the pipeline to
pop_codes.csv
andpop_species_combos.csv
in the /inst/extdata folder. -
If your pipeline works with a new species, also include this species in
species_codes.csv
in the /inst/extdata folder.
Note: We recommend you look at other pipelines as a guide.
Every pipeline should have a set of unit tests in the /test/testthat folder using the testthat
package.
-
The unit testing code for each pipeline should be stored in a file
test-XXX.R
, where XXX is the data owner code. The file should start with an option to skip if the data path is missing. It should then run the correspondingformat_XXX()
function for the pipeline, followed by the required tests. Unit tests should ensure that primary data has been properly converted to the standard format. This will usually involve comparing the values for a range of different individuals in the standard format (e.g. different species, different sex) to those that would be expected from the primary data. In other words, these tests require some manual inspection of the primary data to determine the correct output expected for each individual. -
Each pipeline should undergo five sets of unit tests:
- Test standard format structure. Have the four tables Brood_data, Capture_data, Individual_data, and Location_data been created.
- Test brood data.
- Test capture data.
- Test individual data.
- Test location data.
-
See examples from completed pipelines to better understand the structure of unit testing code.
Once you have finished the pipeline and written relevant unit tests you should make sure these tests pass.
-
Firstly, run unit tests just for your new pipeline. In the console type
test_pipeline(filter = "XXX")
, where XXX is the data owner code of your new pipeline. -
Once your pipeline passes the relevant tests next run the unit tests for all existing pipelines by removing the filter argument:
test_pipeline()
. This can be time consuming and a bit annoying, but it is important to regularly test all pipelines in case old code has broken due to e.g. package updates. -
If one or more tests fail you can fix them and trouble shoot using the filter argument as shown above. To test more than one pipeline simultaneously use
test_pipeline(filter = "XXX|YYY")
, where XXX and YYY are two different data owner codes.
Once your branch is passing all unit tests you should next check the package structure. This will more broadly check things like the documentation, check for unreadable characters, ensure all the code can be loaded. This will not re-run the pipeline unit tests, which are skipped at this stage.
-
You can check the code using
devtools::check()
or Ctrl/Cmd + Shift + E to run the checks in the build window. -
Currently, the output of
devtools::check()
should include 2 known notes:
Imports includes 27 non-default packages.
Importing from so many packages makes the package vulnerable to any of
them becoming unavailable. Move as many as possible to Suggests and
use conditionally.
Package dependencies are discussed in more detail below.
- Any other ERRORS, WARNINGS, or NOTES must be fixed before continuing! Pull requests will not be accepted otherwise.
This will often occur when working with dplyr
code. All references to columns in a data frame should be prefixed by .data$
.
All functions except those in the base
package should have the package namespace explicitly stated (e.g. stats::na.omit
).
The function XXX includes an argument that is not documented with @param
in the roxygen2 documentation code. Check for spelling mistakes!
The function XXX includes documentation for an argument in @param
in the roxygen2 documentation code that does not appear in the function. Check for spelling mistakes!
Packages can only include ASCII characters. You can check the character types being used in a line of code with stringi::stri_enc_mark()
. For example:
#Will return ASCII
stringi::stri_enc_mark("ABC")
#Will return UTF-8
stringi::stri_enc_mark("是")
Watch out for cases where slanted quotation marks are used (‘’) instead of straight ones ('')! Slanted quotation marks can often be introduced when copying text from outside R, but they are NOT ASCII.
If a non-ASCII character must be used, it can be encoded with unicode \uxxxx
.
Once your pipeline is stable and passes all tests and checks it should be reviewed by other developers.
-
Push your finished pipeline to the remote (
git push
). -
Visit the pipelines repository and open a pull request.
-
Request a reviewer. It is also good to let the person know directly so they don't miss it.
Note One key aspect of the code review should also be to test the pipelines on both Mac OSX and Windows.
Note The pull request should not be merged until after the data owner confirmation.
- Once the pipeline is stable it can be updated to 'finished' in the Google Sheet.
The code review should ensure that there are no major bug or oversights. At this point, we can contact the data owner to discuss the pipeline.
-
Explain all decisions that were made in the pipeline.
-
Confirm any columns/data that were unclear/uncertain.
-
Ask about any other data that appear to be missing (e.g. capture data, nest box coordinates).
-
At this point, some changes may be needed to incorporate data owner input. If changes are made to the pipeline code it's important that unit tests and checks are run on the code again.
-
Record data owner approval in the Google Sheet.
-
Once a pipeline has approval from both the data owner and code reviewer the pull request can be merged.
-
At this point the working branch can be deleted from the remote and local.
Note Remember to pull the newest version of the master branch at this point, it will include the new pipeline.
-
Run
quality_check()
on the standard format output from the pipeline. Send the quality check report and the standard format data to the data owner to help them improve data integrity. See more details on the quality check below. -
Contact Antica to update the populations as 'available' on the website.
- Every time a new pipeline is finished (or an old pipeline updated) we should update and archive the .standard_format folder on the N drive. More about archiving below.
-
A data request will specify the PopIDs and Species of interest. We can return the relevant data in the standard format by running
subset_datarqst()
on the most recent version of the standard format in the .standard_format folder. -
You can choose to include or exclude individuals where the species is uncertain using the
include_conflicting
argument (FALSE by default). -
Run
quality_check()
on the subset of the standard format. -
Provide the user with the subset of the standard format and the quality check report.
- Create a new folder in N:\Dep.AnE\SPI_Birds\data. It should follow the syntax
<OWNERID>_<PopName>_<Country>
- Rename files.
- Primary data should follow the syntax
<OWNERID>_PrimaryData
. If there are multiple primary data files provide different suffixes to differentiate them (e.g.<OWNERID>_PrimaryData_GTData
- Population meta-data should follow the syntax
<OWNERID>_MetaData
- All other data that is not meta-data or primary data can be named in any way, but should always start with
<OWNERID>_
- Primary data should follow the syntax
- Create the initial archive. The below code will generate a
ArchiveMetaData.txt
file and generate an archive folder for the new population. Important: Make sure you specify that this is the initial archive withinitial = TRUE
.
archive(data_folder = "N:\Dep.AnE\SPI_Birds\data", OwnerID = <OWNERID>, new_data_date = <DATE WHEN DATA WERE RECEIVED>, initial = TRUE)
- Rename new files to match existing data files (i.e. with the syntax
<OWNERID>_PrimaryData
). Important: Files should have the exact same name, otherwise the pipelines may break. If you do need to use new file names (and rework the pipeline) you will be given a prompt to continue. - Decide if we are dealing with a 'minor' update (e.g. fix typos) or a 'major' update (e.g. new year of data).
- Run archiving code:
archive(data_folder = "N:\Dep.AnE\SPI_Birds\data", OwnerID = <OWNERID>, update_type = <"major"/"minor">,
new_data_path = <LOCATION OF NEW FILES. Consider using choose.files()>,
new_data_date = <DATE WHEN DATA WERE RECEIVED>, initial = FALSE)
THIS IS STILL DONE MANUALLY AND NEEDS TO BE UPDATED. EVERY TIME A NEW PIPELINE IS FINISHED WE SHOULD ADD THE NEWEST VERSION OF THE STANDARD FORMAT IN .standard_format AND ALSO IN A NEW FOLDER .standard_format/archive/<YYYY_MM_DD>.
Note: the quality check is built for pipelines tailored to version 1.0.0 and 1.1.0 of the standard format. Updating the quality checks to match pipelines tailored to version 2.0.0 of the standard format is in progress.
The quality_check()
function is a wrapper function that combines 4 dataset-specific wrapper functions:
brood_check()
capture_check()
individual_check()
location_check()
Each of the dataset-specific functions contains a series of individual quality check functions. These individual quality check functions should be named ‘check_’ or ‘compare_’ followed by a short description of the check and come with a CheckID (e.g. B2 is the second individual check within the brood_check()
wrapper).
All individual checks should function on rows and flag records as ‘warning’ (unlikely values) or ‘potential error’ (impossible values).
Approve-listed records (i.e. flagged records that are subsequently verified by data owners) should not be flagged by the checks.
If the data owner verifies any records flagged by the quality check (i.e. classifies them as legitimate values) add them to brood_approved_list.csv
, capture_approved_list.csv
, individual_approved_list.csv
or location_approved_list.csv
.
The quality check is run on data in the standard format using quality_check()
.
The output of the quality check includes:
- A summary table of which checks resulted in warnings and potential errors
- The pipeline output, where each table of the standard format includes two additional columns (Warning and Error) marking the records that resulted in warnings and potential errors
- A report (in html and/or pdf) with a description of all checks and a list of all warnings and potential errors that have been flagged in the pipeline output.
If you have any issues with running the quality check, try these troubleshooting tips:
- Often pipelines make use of several grouping structures (inserted by
dplyr::group_by()
ordplyr::rowwise()
). Removing these groups (bydplyr::ungroup()
ordplyr::summarise(..., .groups = "drop")
) reduces the run time of the quality check considerably. - If you have trouble creating the pdf, try setting the LaTeX engine to LuaLaTeX (i.e.
quality_check(..., latex_engine = "lualatex")
).