Release version 1.0.0 #227

dfalster · 2024-05-06T03:41:48Z

First major release of APCalign. A preprint is available at
https://www.biorxiv.org/content/10.1101/2024.02.02.578715v1.
Article has been accepted for publication at Australian Journal of Botany.

Following review, a number of changes have been implemented. These have sped &
streamlined the package.

Update function documentation
Speed up extract_genus
Write a replacement function for stringr::word that is much faster.
Additional speed up and accuracy of fuzzy_match function by
- Restricting reference list to names with the same first letter as input string.
- Switch from using utils::adist to stringdist:stringdist(method = "dl")
Rework standardise_names to remove punctuation from the start of the string
Rework strip_names_extra (previously strip_names_2) to just perform
additional functions to strip_names, rather than repeating those performed by strip_names.
Avoid importing entire packages by using package::function format throughout
and removing functions from @import
Add fuzzy match arguments to create_taxonomic_update_lookup
Add 3 additional family-level APC matches to match_taxa.
Refine tests
Make messages to console optional
Fix issue with fails when github is down (fix CRAN issue before release #205)

* Added redevp to gitignore * Bumped version and refined graceful failing * minor syntax fixes * corrections to matches that can't match to genus (these were still assigning taxon_rank = genus) * remove test checks for alignment codes (creating unnecessary errors) --------- Co-authored-by: Fonti Kar <f.kar@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

* This PR refactors a few functions to increase speed. The time to run load_taxonomic_resources has dropped from 15.0s to 2.2s (on Daniel's MacBook Pro M2) * Faster version of extract_genus (#187) * Faster version of stringr::word * Function to standardise taxon rank * Speed up strip_name * update tests

* First commit updated DESCRIPTION and NEWS * Updated installation instructions * Added reproducibility article and exported default_version * Added citation * Added reproducibility article * Update vignettes/articles/reproducibility.Rmd --------- Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

* adding progress bar for loading * trying to get caching/output option to work * passing output through * reviving caching * fixing counting * roxygen update * adding quiet option * checking cached file * documenting caching functionality * getting message working * removing cutting edge arrow * reverting change back to cran, too soon * nope arrow github not working yet

Changes to `standardise_names` to standardise corner cases that were being missed with standardise names. This mainly focused on removing stray punctuation at the beginning and end of name strings. There were also minor required tweaks to `extract_genus` to ensure genera were split on "\" and that names were standardised to remove stray characters at the beginning of strings before genus names were extracted. As a final step, excepted changes to the tests for standardise_names, strip_names, strip_names_extra, and extract_genus were made. The outputs of a list of 42 unusual names are now all correct. Closes #197

…provements (#203) * removing hard cap on file size of current downloads. this is slower, but safer going forward * better wording in documentation * other place there was a hard cap

Add message that indicates how many taxa have perfect matches to APC.

* trying to update actions to best practices * further updating * more updates * adding develop back * changing release hash * how do commit hashes work? * another try * giving up * really giving up

- fix spelling in name - remove duplicate set of tests

Updates to the family-level matching algorithms to allow: fuzzy matches to APC-accepted and APC-synonymous families updates from APC-synonymous family names to accepted APC family names --------- Co-authored-by: Will Cornwell <wcornwell@gmail.com> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

- amend fuzzy matching algorithm to only compare to subset of accepted_list with the same first letter - greatly speeds up fuzzy matching --------- Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

The fuzzy_match function had not previously worked if n_allowed > 1 (the number of shortest-distance matches), even though `n_allowed` was included as an argument in the function. The actual APCalign functions still do not have `n_allowed` included as an argument (they use n_allowed = 1), but fixing fuzzy_match is the first step toward eventually implementing this. Also added simple tests to confirm 1 vs 2 outputs, as expected. --------- Co-authored-by: Fonti Kar <f.kar@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

Fixes a known issue when reading in identifiers from a column - if there were two rows with distinct identifiers but the same original_name, the code broke. Identifier has now been added to lines of code in `align_taxa.R` that were determining how many distinct rows to retain for matching. There will now, occasionally, be repeat original names run through the match algorithms, but this is necessary to attach the correct identifier to each instance of the original name. I've also added a new test. Closes issue #177

Switching from util:adist to stringdist:stringdist for matching. This is both much faster and allows us to use a more nuanced matching algorithm by implementing the Damerau–Levenshtein distance method, and prioritising types of string changes (based on their algorithm) I've run all 47,000 AusTraits names through this and there were 33 that were different - it seems they are all instances of names that were passed over during fuzzy matching (match 5's) previously and now are being caught. So some additional matching power, but nothing being misaligned. (additional minor typo being fixed - Wasn't running "distinct()" on original_name but on entire row - was leading to humorous output that perfect matches greater than total taxa being checked)

Needed to add `if else` loop to `fuzzy_match.R` to only search for fuzzy matches if the subset accepted list (with same first letter) is non-empty. If there were no strings on the accepted list with the same first letter as the input text, warnings were generated. Test added to check this functionality.

Add fuzzy match arguments to `create_taxonomic_update_lookup` We'd omitted the fuzzy match arguments from `create_taxonomic_update_lookup`, which meant users who wanted to change the fuzzy match sliders would need to separately align and update taxonomy. Closes issue #212

As part of #196, we found that stringr::word was quite slow, and so implemented a faster version. This PR makes the new word function a private function accessible via APCalign:::word; adds tests for new function; extends use of this new function throughout Co-authored-by: ehwenk <ehwenk@gmail.com>

* removing unused function option and updating readme * more readme updates * more work on readme

* better description of `imprecise_fuzzy_matches` closes issue #155

* cleaning up the namespace * Remove importing of dplyr, stringr, remove tibble * add explicit namespace to calls of relevant functions * Add the pipe --------- Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

Have greatly reduced number of lines > 80 characters, in all R files except in the file match_taxa.R which we will likely refactor - as this is the one file with lots of longer lines within the code itself. Closes #188

* update roxygen documention for all functions --------- Co-authored-by: Will Cornwell <w.cornwell@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

[skip ci]

ehwenk

It is excellent to see how small refinements can keep improving the package.

I corrected a few typos in your message, but otherwise all good.

R/reexports.R

fontikar

Minor change for importing pipe but not crucial!

First major release of APCalign. A preprint is available at https://www.biorxiv.org/content/10.1101/2024.02.02.578715v1. Article has been accepted for publication at Australian Journal of Botany. Following review, a number of changes have been implemented. These have sped & streamlined the package. * Update function documentation * Speed up `extract_genus` * Write a replacement function for `stringr::word` that is much faster. * Additional speed up and accuracy of `fuzzy_match` function by - Restricting reference list to names with the same first letter as input string. - Switch from using `utils::adist` to `stringdist:stringdist(method = "dl")` * Rework `standardise_names` to remove punctuation from the start of the string * Rework `strip_names_extra` (previously `strip_names_2`) to just perform additional functions to `strip_names`, rather than repeating those performed by `strip_names`. * Avoid importing entire packages by using package::function format throughout and removing functions from @import * Add fuzzy match arguments to `create_taxonomic_update_lookup` * Add 3 additional family-level APC matches to `match_taxa`. * Refine tests * Make messages to console optional * Fix issue with fails when github is down (#205) --------- Co-authored-by: Elizabeth Wenk <ehwenk@gmail.com> Co-authored-by: Fonti Kar <f.kar@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au> Co-authored-by: Will Cornwell <w.cornwell@unsw.edu.au>

fontikar and others added 30 commits April 18, 2024 10:36

Pushing develop branch

eeb33f5

Add testing for functions: standardise_names, strip_names, extract_genus

c68dc70

Rename testing files to alter order of tests (slow results last)

47729b9

Activate tests on develop branch

2a8fedb

Removing hard limit on current downloads and documentation wording im…

e4da0ba

…provements (#203) * removing hard cap on file size of current downloads. this is slower, but safer going forward * better wording in documentation * other place there was a hard cap

Message updates (#201)

c272c2d

Add message that indicates how many taxa have perfect matches to APC.

updating github actions

acc5eb0

* trying to update actions to best practices * further updating * more updates * adding develop back * changing release hash * how do commit hashes work? * another try * giving up * really giving up

Rename/delete test files

b4b13de

- fix spelling in name - remove duplicate set of tests

Make messages to console optional

0cc6663

Updated pkgdown and fixed error #205 (#206)

6b5e46a

Added skip if github is down (#211)

cd4b3e5

Fuzzy-match: subset accepted list to same first letter (#210)

6f630f6

- amend fuzzy matching algorithm to only compare to subset of accepted_list with the same first letter - greatly speeds up fuzzy matching --------- Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

removing unused function option and updating readme (#208)

e0e9086

* removing unused function option and updating readme * more readme updates * more work on readme

better description of imprecise_fuzzy_matches (#221)

0ffe4fb

* better description of `imprecise_fuzzy_matches` closes issue #155

cleaning up the namespace (#223)

2cd65dc

* cleaning up the namespace * Remove importing of dplyr, stringr, remove tibble * add explicit namespace to calls of relevant functions * Add the pipe --------- Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

Remove dependency on forcats

b785d3e

Remove use of .data in tidyselect (as deprecated)

10e909b

Add line breaks (#224)

1bf0761

Have greatly reduced number of lines > 80 characters, in all R files except in the file match_taxa.R which we will likely refactor - as this is the one file with lots of longer lines within the code itself. Closes #188

ehwenk and others added 2 commits May 5, 2024 17:48

Update roxygen & websites (#225)

96b8267

* update roxygen documention for all functions --------- Co-authored-by: Will Cornwell <w.cornwell@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>

Prepare for release: Bump version number, update news

41b4a78

[skip ci]

dfalster requested review from ehwenk and fontikar May 6, 2024 03:42

ehwenk previously approved these changes May 6, 2024

View reviewed changes

fontikar reviewed May 6, 2024

View reviewed changes

R/reexports.R Show resolved Hide resolved

fontikar previously approved these changes May 6, 2024

View reviewed changes

Fix typo

809b11c

dfalster dismissed stale reviews from fontikar and ehwenk via 809b11c May 6, 2024 06:41

dfalster merged commit 6bc7e1f into master May 6, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release version 1.0.0 #227

Release version 1.0.0 #227

dfalster commented May 6, 2024 •

edited

Loading

ehwenk left a comment

fontikar left a comment

Release version 1.0.0 #227

Release version 1.0.0 #227

Conversation

dfalster commented May 6, 2024 • edited Loading

ehwenk left a comment

Choose a reason for hiding this comment

fontikar left a comment

Choose a reason for hiding this comment

dfalster commented May 6, 2024 •

edited

Loading