-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
select_taxa
returns incorrect IDs for more specific search terms
#96
Comments
Hi! It sounds like the issue here is that
Looking at these in detail, Diuris pedunculata gives a 'misappliedName' error and defaults to the genus Diuris instead. This might explain some of the issues you describe with unnamed species or subspecies that you don't want; anything in that genus is being returned by your search, regardless of whether it is assigned to a species or subspecies. Otherwise, the search has returned what the ALA believes is the accepted name for each taxon. As another example, Eucalyptus cannonii returns Eucalyptus macrorhyncha subsp. cannonii (and therefore the species name Eucalyptus macrorhyncha); so the results are different from your input, but are not 'wrong' according to our taxonomic information. It is possible that our taxonomic information is incorrect or out of date; but that isn't a problem with galah per se. This is important because |
Hi Martin, Thanks for having a look at this. That makes sense and should be fine for the species that select_taxa is correctly identifying even if they are including more than just the subspecies for instance. However does that mean i would have to manually download the data for any species incorrectly recognised by select_taxa? And for the species which are not formally described, which seem to have the correct taxon concept id (usually in the format ALA_Typhonium_sp_aff_brownii rather than https://id.biodiversity.org.au/node/apni/2918275) but return records with NA's in place of a species name is there anyway to get the original name to carry through the process so i can assign the records to the correct species? Or would this have to be done manually as well? Thanks for your time and help with this and apologies if these questions go beyond the normal bounds of a github issue. Cheers Tom |
Hi Tom, Unfortunately, yes, I think you might need to download the data for all species recognised by However, I think there might be a misunderstanding with how First I'll create a # packages
library(tidyverse)
library(galah)
# For reproducibility, I only used the species from target_species.csv within genus Diuris
target_species <- tibble(ScientificName = c("Diuris aequalis", "Diuris arenaria",
"Diuris bracteata", "Diuris byronensis",
"Diuris disposita", "Diuris eborensis",
"Diuris flavescens", "Diuris pedunculata",
"Diuris praecox",
"Diuris sp. (Oaklands, D.L. Jones 5380)",
"Diuris venosa"))
# Use select_taxa() to search for species on ALA
taxa <- select_taxa(target_species$ScientificName) Now if I check to see whether all species names match between target_species and the results returned from # Are the species in target_species the same as in taxa?
# How many species names from taxa match with target_species?
missing_taxa_in_taxa <- taxa %>%
as_tibble() %>%
filter(!species %in% target_species$ScientificName)
missing_taxa_in_taxa %>% count() # 2 missing
#> # A tibble: 1 x 1
#> n
#> <int>
#> 1 2
missing_taxa_in_taxa %>% select(species)
#> # A tibble: 2 x 1
#> species
#> <chr>
#> 1 <NA>
#> 2 <NA> But when I look at the In other words, just because there is no species name in the species column doesn't mean the "wrong" result is being returned. The missing_taxa_in_taxa %>% select(scientific_name, species)
#> # A tibble: 2 x 2
#> scientific_name species
#> <chr> <chr>
#> 1 Diuris <NA>
#> 2 Diuris sp. (Oaklands, D.L. Jones 5380) <NA> Now, I can download the Diuris records using # Get records
ala_recs_diuris <- ala_occurrences(taxa = taxa)
ala_recs_diuris %>% distinct(scientificName) %>% count() # 149 returned A quick solution to only include the records you originally searched for is to filter # filter
ala_recs_filtered <- ala_recs_diuris %>%
filter(scientificName %in% target_species$ScientificName) You might need to double check that you are getting everything you want, though, as this method might be prone to mismatch - you might miss some species if the scientific name in the ALA doesn't match the one you provided. Alternatively, if you are happy with the |
Hi Dax Thanks for that run through, it certainly makes more sense how the select_taxa function is working now. Although this doesnt quite address the problem I was referring to in my last response but I might not have understood where the issue was well enough to explain myself. From what you've said here the select_taxa function is working as it should and that species column for whatever reason only includes formally described species. However, I think when it passes the list of matched species to the ala_occurrences function something is going wrong and these unnammed species (Bertya sp. (Chambigne NR, M. Fatemi 24), Eucalyptus sp. cattai etc) are getting lost in the process even though there are exact matches being found and they have some value in the taxon_concept_id column. I wanted to make sure i wasnt just missing something so i ran that missing species code chunk on the full target_species list provided above and modified it so i'd only get the species that werent in my original list rather than the species that had been added in error like all the extra Diuris. For reference: target_taxa_in_taxa <- ala_recs_names %>% missing_taxa_in_taxa <- target_species %>% There are 50 taxa on the original list that arent in the ala_occurrences output. Some are instances were they are there but under another name that ALA's taxonomy has corrected (e.g. Commersonia procumbens -> Androcalva procumbens). Some are subspecies that ALA has decided to ignore the subspecies (e.g. Boronia inflexa subsp. torringtonensis -> Boronia inflexa). But most are these undescribed species that end up as NA in the species column of the select_taxa output. To check your filter by taxon_concept_id workaround I looked through all the possible fields that could be included in the output and included every one that sounds relevant to species identification and that you could use in combination with the select_taxa output dataframe to match records to species. ala_recs <- ala_occurrences(taxa = select_taxa(target_species$ If you filter any of these columns by the original search terms, from target_species list or the matched taxon_concept_ID from the select_taxa output of most of the 50 missing species that are unnamed. I think this is because when select_taxa is parsed to ala_occurences whichever column is used to sent off to ALA to compile the records has the wrong data in it, which i think must be either the taxon_concept_ID column and/or the species column because it won't recognise the taxon_concept_ID for some of these taxa when its in the format ALA_Bertya_sp_Chambigne_NR_M_Fatemi_24. but it will for others e.g. Eucalyptus sp. Cattai. Apparently, because on their respective ALA pages they do or dont have these identifiers linked (e.g. https://bie.ala.org.au/species/https://id.biodiversity.org.au/node/apni/2892151#names). The subspecies i dont understand because their taxon_concept_id is correct but i guess it might relate to the species column excluding the subspecies. I understand this probably falls outside of the scope of the actual workings of the Galah package and that theres ongoing work to improve and streamline ALA. But from a user standpoint its pretty opaque and unintuitive when the same search terms return the correct results online, I've only happened to catch this issue by chance and then its taken me getting your help and spending few hours over a couple of days to understand the depth of it and i still have to go and figure out a workaround that can fit into a repeatable workflow. I appreciate your time and help in working through this though and in general think the package is a massive help in using ALA data. |
This was an exceptional explanation, Tom. As a result, I was able to get more of an idea about the source of the error and I appreciate you going to the effort of checking this in more detail (and for saying nice things about our package). First, I'll reproduce the error you're referring to more clearly for documentation. The error occurs when library(galah)
# galah_config(email = "your-email@email.com")
# search for species using select_taxa
taxa <- select_taxa("Bertya sp. (Chambigne NR, M. Fatemi 24)")
# Get taxon_concept_id
taxa$taxon_concept_id
#> [1] "ALA_Bertya_sp_Chambigne_NR_M_Fatemi_24"
# Pass this id into ala_occurrences
occs <- ala_occurrences(taxa = select_taxa(taxa$taxon_concept_id, is_id = TRUE))
#> Error in check_count(count): This query does not match any records. I've been looking more closely at where the errors are happening in For example, I accidentally noticed that searching the same search terms but without parentheses made # with parentheses
taxa <- select_taxa(c("Bertya sp. (Chambigne NR, M. Fatemi 24)",
"Bertya sp. (Clouds Creek, M. Fatemi 4)",
"Diuris sp. (Oaklands, D.L. Jones 5380)"))
taxa$taxon_concept_id
#> [1] "ALA_Bertya_sp_Chambigne_NR_M_Fatemi_24"
#> [2] "ALA_Bertya_sp_Clouds_Creek_M_Fatemi_4"
#> [3] "ALA_Diuris_sp_Oaklands_D_L_Jones_5380"
# without parentheses
taxa <- select_taxa(c("Bertya sp. Chambigne NR, M. Fatemi 24",
"Bertya sp. Clouds Creek, M. Fatemi 4",
"Diuris sp. Oaklands, D.L. Jones 5380"))
taxa$taxon_concept_id
#> [1] "https://id.biodiversity.org.au/node/apni/2892151"
#> [2] "https://id.biodiversity.org.au/node/apni/2907136"
#> [3] "https://id.biodiversity.org.au/taxon/apni/51290527"
# Pass this id into ala_occurrences
occs <- ala_occurrences(taxa = select_taxa(taxa$taxon_concept_id, is_id = TRUE))
#> This query will return 49062 records To try this with all the species in library(tidyverse)
target_species_edited <- target_species %>%
mutate(
ScientificName = str_remove_all(ScientificName, "[()]")
) However, using the filtering method that I gave previously using Other missing results from We'll need to run more tests to find all of the sources of these |
select_taxa
returns incorrect IDs for more specific search terms
Thanks for that breakdown and all your time and help on this, i think between what you've explained here and a little manual wrangling i should be able to come up with a work around now. |
Great news! For easier reference, here is a summary of the identified issue outside of our discussion:
# search terms with parentheses
taxa <- select_taxa(c("Bertya sp. (Chambigne NR, M. Fatemi 24)",
"Bertya sp. (Clouds Creek, M. Fatemi 4)",
"Diuris sp. (Oaklands, D.L. Jones 5380)"))
taxa$taxon_concept_id
#> [1] "ALA_Bertya_sp_Chambigne_NR_M_Fatemi_24"
#> [2] "ALA_Bertya_sp_Clouds_Creek_M_Fatemi_4"
#> [3] "ALA_Diuris_sp_Oaklands_D_L_Jones_5380"
# Pass IDs to ala_occurrences
occs <- ala_occurrences(taxa = select_taxa(taxa$taxon_concept_id, is_id = TRUE))
#> Error in check_count(count): This query does not match any records.
# search terms without parentheses
taxa <- select_taxa(c("Bertya sp. Chambigne NR, M. Fatemi 24",
"Bertya sp. Clouds Creek, M. Fatemi 4",
"Diuris sp. Oaklands, D.L. Jones 5380"))
taxa$taxon_concept_id
#> [1] "https://id.biodiversity.org.au/node/apni/2892151"
#> [2] "https://id.biodiversity.org.au/node/apni/2907136"
#> [3] "https://id.biodiversity.org.au/taxon/apni/51290527"
# Pass IDs to ala_occurrences
occs <- ala_occurrences(taxa = select_taxa(taxa$taxon_concept_id, is_id = TRUE))
#> This query will return 49062 records |
Describe the bug
When i run the ala_occurrences and select_taxa functions with a list of 360 target species they are returning a collection of occurrence with issues including:
I think what may be happening is whatever values select_taxa is providing to the ala_occurrences functions are slightly wrong or generalised possibly. If I run select_taxa over the list on its own every species has an exactly matched ScientificName but the species column excludes subspecies and species not formally described.
galah version
1.31
To Reproduce
Steps to reproduce the behaviour:
Read in target_species.csv (attached)
target_species.csv
Run the following code chunk using galah package
ala_recs <- ala_occurrences(taxa = select_taxa(target_species$ScientificName))
Expected behaviour
What I would expect to happen is that the ala_occurrences function returns all available records of the species in the target_species data frame and only these species.
Additional context
All species in this list are taxonomically valid (currently) even though some are not formally described and all can and have been searched for manually as they are spelt in this list into ALA online and return results correctly. So the records are there they just dont seem to be returning to me for some species. While also returning additional species not on the list in other cases.
The text was updated successfully, but these errors were encountered: