-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release version 1.0.0 #227
Conversation
* Added redevp to gitignore * Bumped version and refined graceful failing * minor syntax fixes * corrections to matches that can't match to genus (these were still assigning taxon_rank = genus) * remove test checks for alignment codes (creating unnecessary errors) --------- Co-authored-by: Fonti Kar <f.kar@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>
* This PR refactors a few functions to increase speed. The time to run load_taxonomic_resources has dropped from 15.0s to 2.2s (on Daniel's MacBook Pro M2) * Faster version of extract_genus (#187) * Faster version of stringr::word * Function to standardise taxon rank * Speed up strip_name * update tests
* First commit updated DESCRIPTION and NEWS * Updated installation instructions * Added reproducibility article and exported default_version * Added citation * Added reproducibility article * Update vignettes/articles/reproducibility.Rmd --------- Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>
* adding progress bar for loading * trying to get caching/output option to work * passing output through * reviving caching * fixing counting * roxygen update * adding quiet option * checking cached file * documenting caching functionality * getting message working * removing cutting edge arrow * reverting change back to cran, too soon * nope arrow github not working yet
Changes to `standardise_names` to standardise corner cases that were being missed with standardise names. This mainly focused on removing stray punctuation at the beginning and end of name strings. There were also minor required tweaks to `extract_genus` to ensure genera were split on "\" and that names were standardised to remove stray characters at the beginning of strings before genus names were extracted. As a final step, excepted changes to the tests for standardise_names, strip_names, strip_names_extra, and extract_genus were made. The outputs of a list of 42 unusual names are now all correct. Closes #197
…provements (#203) * removing hard cap on file size of current downloads. this is slower, but safer going forward * better wording in documentation * other place there was a hard cap
Add message that indicates how many taxa have perfect matches to APC.
* trying to update actions to best practices * further updating * more updates * adding develop back * changing release hash * how do commit hashes work? * another try * giving up * really giving up
- fix spelling in name - remove duplicate set of tests
Updates to the family-level matching algorithms to allow: fuzzy matches to APC-accepted and APC-synonymous families updates from APC-synonymous family names to accepted APC family names --------- Co-authored-by: Will Cornwell <wcornwell@gmail.com> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>
- amend fuzzy matching algorithm to only compare to subset of accepted_list with the same first letter - greatly speeds up fuzzy matching --------- Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>
The fuzzy_match function had not previously worked if n_allowed > 1 (the number of shortest-distance matches), even though `n_allowed` was included as an argument in the function. The actual APCalign functions still do not have `n_allowed` included as an argument (they use n_allowed = 1), but fixing fuzzy_match is the first step toward eventually implementing this. Also added simple tests to confirm 1 vs 2 outputs, as expected. --------- Co-authored-by: Fonti Kar <f.kar@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>
Fixes a known issue when reading in identifiers from a column - if there were two rows with distinct identifiers but the same original_name, the code broke. Identifier has now been added to lines of code in `align_taxa.R` that were determining how many distinct rows to retain for matching. There will now, occasionally, be repeat original names run through the match algorithms, but this is necessary to attach the correct identifier to each instance of the original name. I've also added a new test. Closes issue #177
Switching from util:adist to stringdist:stringdist for matching. This is both much faster and allows us to use a more nuanced matching algorithm by implementing the Damerau–Levenshtein distance method, and prioritising types of string changes (based on their algorithm) I've run all 47,000 AusTraits names through this and there were 33 that were different - it seems they are all instances of names that were passed over during fuzzy matching (match 5's) previously and now are being caught. So some additional matching power, but nothing being misaligned. (additional minor typo being fixed - Wasn't running "distinct()" on original_name but on entire row - was leading to humorous output that perfect matches greater than total taxa being checked)
Needed to add `if else` loop to `fuzzy_match.R` to only search for fuzzy matches if the subset accepted list (with same first letter) is non-empty. If there were no strings on the accepted list with the same first letter as the input text, warnings were generated. Test added to check this functionality.
Add fuzzy match arguments to `create_taxonomic_update_lookup` We'd omitted the fuzzy match arguments from `create_taxonomic_update_lookup`, which meant users who wanted to change the fuzzy match sliders would need to separately align and update taxonomy. Closes issue #212
As part of #196, we found that stringr::word was quite slow, and so implemented a faster version. This PR makes the new word function a private function accessible via APCalign:::word; adds tests for new function; extends use of this new function throughout Co-authored-by: ehwenk <ehwenk@gmail.com>
* removing unused function option and updating readme * more readme updates * more work on readme
* better description of `imprecise_fuzzy_matches` closes issue #155
* cleaning up the namespace * Remove importing of dplyr, stringr, remove tibble * add explicit namespace to calls of relevant functions * Add the pipe --------- Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>
Have greatly reduced number of lines > 80 characters, in all R files except in the file match_taxa.R which we will likely refactor - as this is the one file with lots of longer lines within the code itself. Closes #188
* update roxygen documention for all functions --------- Co-authored-by: Will Cornwell <w.cornwell@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is excellent to see how small refinements can keep improving the package.
I corrected a few typos in your message, but otherwise all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change for importing pipe but not crucial!
First major release of APCalign. A preprint is available at https://www.biorxiv.org/content/10.1101/2024.02.02.578715v1. Article has been accepted for publication at Australian Journal of Botany. Following review, a number of changes have been implemented. These have sped & streamlined the package. * Update function documentation * Speed up `extract_genus` * Write a replacement function for `stringr::word` that is much faster. * Additional speed up and accuracy of `fuzzy_match` function by - Restricting reference list to names with the same first letter as input string. - Switch from using `utils::adist` to `stringdist:stringdist(method = "dl")` * Rework `standardise_names` to remove punctuation from the start of the string * Rework `strip_names_extra` (previously `strip_names_2`) to just perform additional functions to `strip_names`, rather than repeating those performed by `strip_names`. * Avoid importing entire packages by using package::function format throughout and removing functions from @import * Add fuzzy match arguments to `create_taxonomic_update_lookup` * Add 3 additional family-level APC matches to `match_taxa`. * Refine tests * Make messages to console optional * Fix issue with fails when github is down (#205) --------- Co-authored-by: Elizabeth Wenk <ehwenk@gmail.com> Co-authored-by: Fonti Kar <f.kar@unsw.edu.au> Co-authored-by: Daniel Falster <daniel.falster@unsw.edu.au> Co-authored-by: Will Cornwell <w.cornwell@unsw.edu.au>
First major release of APCalign. A preprint is available at
https://www.biorxiv.org/content/10.1101/2024.02.02.578715v1.
Article has been accepted for publication at Australian Journal of Botany.
Following review, a number of changes have been implemented. These have sped &
streamlined the package.
extract_genus
stringr::word
that is much faster.fuzzy_match
function byutils::adist
tostringdist:stringdist(method = "dl")
standardise_names
to remove punctuation from the start of the stringstrip_names_extra
(previouslystrip_names_2
) to just performadditional functions to
strip_names
, rather than repeating those performed bystrip_names
.and removing functions from @import
create_taxonomic_update_lookup
match_taxa
.