Quality check and pipeline code fixes #169

StefanVriend · 2021-04-30T16:13:38Z

We are in the process of getting feedback on the quality check procedure, report and protocol document from the advisory council.
Before we send them the documents, I am fixing some bugs/issues in the quality check and pipeline codes that are revealed by the quality check procedure.

The finished pipelines of advisory council members are: NIOO, UAN, WYT, MON and PFN.
Quality check will be run on subsets of the pipeline outputs (approximately 5 years) so that the quality check reports are not terrifyingly large.

Quality check protocol document is here: https://github.com/SPI-Birds/documentation/blob/master/quality_check/SPI-Birds_quality-check-protocol_v1.0.pdf

Fix empty strings ("") for FemaleID/MaleID in Brood_data of WYT pipeline.
Rewrite create_individual_UAN() to use Capture_data instead of unprocessed capture information.

First mutate then filter, so that mutate no longer throws a warning when filter has resulted in zero records. Also add PopID as grouping variable to C5. Quality check should be able to run on single-population as well as multi-population pipeline outputs.

This function should not try to fit a logistic model when all ChickAge or Mass records are NA.

First mutate then filter, so that mutate no longer throws a warning when filter has resulted in zero records. Also change which column/variable is preferred for B6. When pipeline outputs from two versions (v1.0 and v1.1) are combined, datasets will contain old and new columns. Quality check should be run on the columns that actually contain the information.

Quality check should be able to run on single-population as well as multi-population pipeline outputs.

Checking whether individuals have BroodIDs is only done for chicks. Now the message supplied in the report explicitly says that the record concerns a chick without a BroodID.

Part of the message "Impossible chick age may be caused by problems with hatch date." was printed for every record, but it is clearer to only show this on the pages of the report where the checks are described.

Previously the check used RingAge == "chick", but this may include individuals first caught after they fledged, so they are not expected to have a BroodID. Now we use the more accurate Age_observed == 1.

Set empty strings ("") for FemaleID and MaleID in Brood_data to NA.

`create_individual_UAN()` is now primarily based on Capture_data. Only Sex is determined via primary data, because the UAN pipeline is created in version 1.0 of the standard format and therefore does not have Sex columns in the Capture_data.

Previously, when selecting either BOS or PEE, both were selected in the pipeline as there was no pop_filter.

BroodIDs were wrongly filled into BroodIDLaid for individuals caught first as adult.

When grouping structures inserted by `dplyr::group_by()` and `dplyr::rowwise()` are not removed (by `dplyr::ungroup()` or `dplyr::summarise(..., .groups = "drop")`), quality check is very slow.

StefanVriend · 2021-05-04T15:59:21Z

After some quality check and pipeline fixes, I've run the quality check on subsets (years: 2005-2015) of the datasets, resulting in relatively small reports:

UAN (PopID: BOS): 11 pages of errors; 393 pages total
MON (PopID: ROU): 4 pages of errors; 50 pages total
PFN (PopID: EDM): 7 pages of errors; 28 pages total
NIOO* (PopID: HOG): 18 pages of errors; 276 pages total
WYT: 16 pages of errors; 305 pages total

*Note that we haven't fixed the issue in the NIOO pipeline (the one related to individuals in Individual_data missing in Capture_data) yet. This subset seems to be unaffected (check I6 flags no missing records), but it might still be worth fixing before sending the documents to Marcel. @LiamDBailey - what do you think?

This bug appeared when a pipeline output table is not a tbl/tibble, like for Brood_data in the WYT pipeline.

StefanVriend added 9 commits April 28, 2021 14:48

Fix bugs in C2 and C5 🐛

db723a3

First mutate then filter, so that mutate no longer throws a warning when filter has resulted in zero records. Also add PopID as grouping variable to C5. Quality check should be able to run on single-population as well as multi-population pipeline outputs.

Fix bug in calculate_chick_mass_cutoffs 🐛

9eb38b4

This function should not try to fit a logistic model when all ChickAge or Mass records are NA.

Fix typo in C5 ✏️

60d3e75

Add PopID to checks I3 and I6 🔨

eebddc5

Quality check should be able to run on single-population as well as multi-population pipeline outputs.

Update report message of check I3 💬

06ab295

Checking whether individuals have BroodIDs is only done for chicks. Now the message supplied in the report explicitly says that the record concerns a chick without a BroodID.

Update report message of check C3 💬

9ab1b4b

Part of the message "Impossible chick age may be caused by problems with hatch date." was printed for every record, but it is clearer to only show this on the pages of the report where the checks are described.

Update how nestlings are selected in check I3

dacbb5f

Previously the check used RingAge == "chick", but this may include individuals first caught after they fledged, so they are not expected to have a BroodID. Now we use the more accurate Age_observed == 1.

Fix bug in WYT pipeline 🐛

8521b51

Set empty strings ("") for FemaleID and MaleID in Brood_data to NA.

StefanVriend self-assigned this Apr 30, 2021

StefanVriend linked an issue Apr 30, 2021 that may be closed by this pull request

Outstanding issue for UAN: individuals with no age record but caught on nest #168

Closed

StefanVriend added 7 commits May 3, 2021 13:50

Rewrite create_individual_UAN()

53a4d23

`create_individual_UAN()` is now primarily based on Capture_data. Only Sex is determined via primary data, because the UAN pipeline is created in version 1.0 of the standard format and therefore does not have Sex columns in the Capture_data.

Fix bug in UAN pipeline: add pop_filter 🐛

b439a70

Previously, when selecting either BOS or PEE, both were selected in the pipeline as there was no pop_filter.

Fix typo in UAN pipeline ✏️

2508ae7

Fix bug in UAN pipeline 🐛

d01ddb5

BroodIDs were wrongly filled into BroodIDLaid for individuals caught first as adult.

Fix bug in brood_check 🐛

4ca1102

Update dummy data according to updated check I3

4116bb5

Update docs 📝

de3d6bd

StefanVriend linked an issue May 4, 2021 that may be closed by this pull request

Fix issue with unique IndvID's within populations but not within datasets #170

Open

Remove grouping structure from UAN to speed up quality check ⚡

9efba56

When grouping structures inserted by `dplyr::group_by()` and `dplyr::rowwise()` are not removed (by `dplyr::ungroup()` or `dplyr::summarise(..., .groups = "drop")`), quality check is very slow.

StefanVriend and others added 3 commits May 4, 2021 18:15

Fix bug in quality check when reporting PopIDs and Species 🐛

23b810f

This bug appeared when a pipeline output table is not a tbl/tibble, like for Brood_data in the WYT pipeline.

Merge branch 'master' into quality_check_fixes

11c54f7

🔧 Add dplyr NAMESPACE

0981d8f

LiamDBailey merged commit 4457155 into master May 7, 2021

LiamDBailey deleted the quality_check_fixes branch July 9, 2021 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality check and pipeline code fixes #169

Quality check and pipeline code fixes #169

StefanVriend commented Apr 30, 2021 •

edited

Loading

StefanVriend commented May 4, 2021

Quality check and pipeline code fixes #169

Quality check and pipeline code fixes #169

Conversation

StefanVriend commented Apr 30, 2021 • edited Loading

StefanVriend commented May 4, 2021

StefanVriend commented Apr 30, 2021 •

edited

Loading