Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality check and pipeline code fixes #169

Merged
merged 20 commits into from
May 7, 2021
Merged

Conversation

StefanVriend
Copy link
Collaborator

@StefanVriend StefanVriend commented Apr 30, 2021

We are in the process of getting feedback on the quality check procedure, report and protocol document from the advisory council.
Before we send them the documents, I am fixing some bugs/issues in the quality check and pipeline codes that are revealed by the quality check procedure.

The finished pipelines of advisory council members are: NIOO, UAN, WYT, MON and PFN.
Quality check will be run on subsets of the pipeline outputs (approximately 5 years) so that the quality check reports are not terrifyingly large.

Quality check protocol document is here: https://github.com/SPI-Birds/documentation/blob/master/quality_check/SPI-Birds_quality-check-protocol_v1.0.pdf

  • Fix empty strings ("") for FemaleID/MaleID in Brood_data of WYT pipeline.
  • Rewrite create_individual_UAN() to use Capture_data instead of unprocessed capture information.

First mutate then filter, so that mutate no longer throws a warning when filter has resulted in zero records.

Also add PopID as grouping variable to C5. Quality check should be able to run on single-population as well as multi-population pipeline outputs.
This function should not try to fit a logistic model when all ChickAge or Mass records are NA.
First mutate then filter, so that mutate no longer throws a warning when filter has resulted in zero records.

Also change which column/variable is preferred for B6. When pipeline outputs from two versions (v1.0 and v1.1) are combined, datasets will contain old and new columns. Quality check should be run on the columns that actually contain the information.
Quality check should be able to run on single-population as well as multi-population pipeline outputs.
Checking whether individuals have BroodIDs is only done for chicks. Now the message supplied in the report explicitly says that the record concerns a chick without a BroodID.
Part of the message "Impossible chick age may be caused by problems with hatch date." was printed for every record, but it is clearer to only show this on the pages of the report where the checks are described.
Previously the check used RingAge == "chick", but this may include individuals first caught after they fledged, so they are not expected to have a BroodID. Now we use the more accurate Age_observed == 1.
Set empty strings ("") for FemaleID and MaleID in Brood_data to NA.
`create_individual_UAN()` is now primarily based on Capture_data. Only Sex is determined via primary data, because the UAN pipeline is created in version 1.0 of the standard format and therefore does not have Sex columns in the Capture_data.
Previously, when selecting either BOS or PEE, both were selected in the pipeline as there was no pop_filter.
BroodIDs were wrongly filled into BroodIDLaid for individuals caught first as adult.
When grouping structures inserted by `dplyr::group_by()` and `dplyr::rowwise()` are not removed (by `dplyr::ungroup()` or `dplyr::summarise(..., .groups = "drop")`), quality check is very slow.
@StefanVriend
Copy link
Collaborator Author

After some quality check and pipeline fixes, I've run the quality check on subsets (years: 2005-2015) of the datasets, resulting in relatively small reports:

  • UAN (PopID: BOS): 11 pages of errors; 393 pages total
  • MON (PopID: ROU): 4 pages of errors; 50 pages total
  • PFN (PopID: EDM): 7 pages of errors; 28 pages total
  • NIOO* (PopID: HOG): 18 pages of errors; 276 pages total
  • WYT: 16 pages of errors; 305 pages total

*Note that we haven't fixed the issue in the NIOO pipeline (the one related to individuals in Individual_data missing in Capture_data) yet. This subset seems to be unaffected (check I6 flags no missing records), but it might still be worth fixing before sending the documents to Marcel. @LiamDBailey - what do you think?

StefanVriend and others added 3 commits May 4, 2021 18:15
This bug appeared when a pipeline output table is not a tbl/tibble, like for Brood_data in the WYT pipeline.
@LiamDBailey LiamDBailey merged commit 4457155 into master May 7, 2021
@LiamDBailey LiamDBailey deleted the quality_check_fixes branch July 9, 2021 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants