Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auk_split doesn't work #81

Open
auman-chan opened this issue Jun 6, 2024 · 10 comments
Open

auk_split doesn't work #81

auman-chan opened this issue Jun 6, 2024 · 10 comments

Comments

@auman-chan
Copy link

I met a problem that auk_split doesn't work . This function only exported files without any rows. Only I splited species before filter would it work.
image.

Are there any solution or suggestion? Thanks!

image

@mstrimas
Copy link
Contributor

mstrimas commented Jun 6, 2024

Can you provide the output of sessionInfo() and the code that generated the above output?

@auman-chan
Copy link
Author

auman-chan commented Jun 7, 2024

Well, here is the code I used, and I conducted the sample of EBD from the EBD download website

 library(auk)
 library(dplyr)

 #list.files("test")
ebd_file <- "test/ebd_US-AL-101_202204_202204_relApr-2022.txt"
 ebd_out <- "test/output.txt"
 prefix_spe <- "test/spe_"
 prefix_spe2 <- "test/spe2_"
 ebd_in <- auk_ebd(file = ebd_file) 

 data <- ebd_in %>% 
  auk_complete() %>% 
  auk_year(c(2012, 2022)) %>% 
   auk_duration(duration = c(0, 300)) %>% 
   auk_distance(distance = c(0, 5)) %>% 
   auk_protocol(protocol=c("Stationary","Area"))

 df <- auk_filter(data,file = ebd_out,
                overwrite = T) %>% read_ebd()
                                                                                                                                   
splist <- unique(df$scientific_name)[1:5]
 splist
[1] "Cardinalis cardinalis"    "Mimus polyglottos"        "Poecile carolinensis"    
[4] "Sitta pusilla"            "Thryothorus ludovicianus"

 spe_split <- auk_split(file = ebd_out,
                       species = splist,
                       prefix = prefix_spe,
                        overwrite = T)

 list.files("test")
 [1] "ebd_US-AL-101_202204_202204_relApr-2022.txt" "output.txt"                                 
 [3] "spe_Cardinalis_cardinalis.txt"               "spe_Mimus_polyglottos.txt"                  
 [5] "spe_Poecile_carolinensis.txt"                "spe_Sitta_pusilla.txt"                      
 [7] "spe_Thryothorus_ludovicianus.txt"            "spe2_Cardinalis_cardinalis.txt"             
 [9] "spe2_Mimus_polyglottos.txt"                  "spe2_Poecile_carolinensis.txt"              
[11] "spe2_Sitta_pusilla.txt"                      "spe2_Thryothorus_ludovicianus.txt"          
[13] "test.R"                                     
 file.size("test/spe_Mimus_polyglottos.txt")
[1] 693

The split file with 693 B sizes means it only contains column names.

image

Here is the information of my sesison:

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C               LC_TIME=zh_CN.UTF-8       
 [4] LC_COLLATE=zh_CN.UTF-8     LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4 auk_0.7.1  

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         rlang_1.1.4       stringi_1.8.4    
 [6] generics_0.1.3    assertthat_0.2.1  glue_1.7.0        bit_4.0.5         hms_1.1.3        
[11] readxl_1.4.3      writexl_1.4.2     fansi_1.0.6       cellranger_1.1.0  tibble_3.2.1     
[16] tzdb_0.4.0        lifecycle_1.0.4   stringr_1.5.1     compiler_4.3.3    pkgconfig_2.0.3  
[21] rstudioapi_0.15.0 R6_2.5.1          readr_2.1.5       tidyselect_1.2.1  utf8_1.2.4       
[26] parallel_4.3.3    vroom_1.6.5       pillar_1.9.0      magrittr_2.0.3    withr_3.0.0      
[31] tools_4.3.3       bit64_4.0.5   

@auman-chan
Copy link
Author

As an alternative, I selected species with the function read_delim_chunked, and remove duplicate group checklists and roll up taxonomy by the function distinct() and filter(). But I hope this pipeline could be fixed.

@mstrimas
Copy link
Contributor

mstrimas commented Jun 7, 2024

I ran your exact code on the sample EBD file and it appears to be working fine. I have a Mac, so I also tried running it in a Ubuntu Docker container to emulate your environment and also am not having any issues. For example, this is the file for Cardinal that I'm getting
spe_Cardinalis_cardinalis.txt

It's hard to troubleshoot since I can replicate the issue... You might try running in a Docker container as well to test.

@auman-chan
Copy link
Author

OK, I will have a try in Windows. Now I split species with the function read_delim_chunked and write_delim, and then imported them by read_ebd.

@auman-chan
Copy link
Author

Additionally I have another question.The read_ebd or the auk_unique would only keep the distinct observations by the values of group_identifier, even though it is a missing value. It seems all group_identifier in NA are considered as duplicate and only keep one of them by read_ebd.

However, I don't think observations with the missing values in group_identifier are the duplicate, as they usually are in different locations and recorded in different time.

Is there any further meaning in this procedure?

@mstrimas
Copy link
Contributor

I’m on vacation until June 25 so won’t be able to look into this in detail until then. However, auk_unique() shouldn’t impact rows that have NA for group_identifier. If that is happening, I’ll fix it when I return.

@auman-chan
Copy link
Author

It seems that the bug is on the path of the ebd file.

When I read the file in other disks with the path like "/media/username/disk_name/ebird_data/ebd.txt",the split function works. But when my file in the system disk and evoked by "/home/username/R//imp/ebd.txt", this function returns empty files.

@mstrimas
Copy link
Contributor

mstrimas commented Jul 8, 2024

That's strange, I'm not sure what that would be happening and I'm not able to reproduce the issue. If you figure it out how to fix it let me know and I can make the change.

@auman-chan
Copy link
Author

auman-chan commented Jul 10, 2024

I know what's wrong. My fiolder name included a whitespace, it could be identified in R (detected by file.exists) but could not in awk. It had better provide a check to avoid this situation, as the error from the shell didn't pass to R console.

On the other hand, I will further confirm the question of auk_unique().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants