auk_split doesn't work #81

auman-chan · 2024-06-06T06:43:53Z

I met a problem that auk_split doesn't work . This function only exported files without any rows. Only I splited species before filter would it work.
image.

Are there any solution or suggestion? Thanks!

mstrimas · 2024-06-06T14:54:54Z

Can you provide the output of sessionInfo() and the code that generated the above output?

auman-chan · 2024-06-07T12:53:45Z

Well, here is the code I used, and I conducted the sample of EBD from the EBD download website

 library(auk)
 library(dplyr)

 #list.files("test")
ebd_file <- "test/ebd_US-AL-101_202204_202204_relApr-2022.txt"
 ebd_out <- "test/output.txt"
 prefix_spe <- "test/spe_"
 prefix_spe2 <- "test/spe2_"
 ebd_in <- auk_ebd(file = ebd_file) 

 data <- ebd_in %>% 
  auk_complete() %>% 
  auk_year(c(2012, 2022)) %>% 
   auk_duration(duration = c(0, 300)) %>% 
   auk_distance(distance = c(0, 5)) %>% 
   auk_protocol(protocol=c("Stationary","Area"))

 df <- auk_filter(data,file = ebd_out,
                overwrite = T) %>% read_ebd()
                                                                                                                                   
splist <- unique(df$scientific_name)[1:5]
 splist
[1] "Cardinalis cardinalis"    "Mimus polyglottos"        "Poecile carolinensis"    
[4] "Sitta pusilla"            "Thryothorus ludovicianus"

 spe_split <- auk_split(file = ebd_out,
                       species = splist,
                       prefix = prefix_spe,
                        overwrite = T)

 list.files("test")
 [1] "ebd_US-AL-101_202204_202204_relApr-2022.txt" "output.txt"                                 
 [3] "spe_Cardinalis_cardinalis.txt"               "spe_Mimus_polyglottos.txt"                  
 [5] "spe_Poecile_carolinensis.txt"                "spe_Sitta_pusilla.txt"                      
 [7] "spe_Thryothorus_ludovicianus.txt"            "spe2_Cardinalis_cardinalis.txt"             
 [9] "spe2_Mimus_polyglottos.txt"                  "spe2_Poecile_carolinensis.txt"              
[11] "spe2_Sitta_pusilla.txt"                      "spe2_Thryothorus_ludovicianus.txt"          
[13] "test.R"                                     
 file.size("test/spe_Mimus_polyglottos.txt")
[1] 693

The split file with 693 B sizes means it only contains column names.

Here is the information of my sesison:

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C               LC_TIME=zh_CN.UTF-8       
 [4] LC_COLLATE=zh_CN.UTF-8     LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       

time zone: Asia/Shanghai
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4 auk_0.7.1  

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         rlang_1.1.4       stringi_1.8.4    
 [6] generics_0.1.3    assertthat_0.2.1  glue_1.7.0        bit_4.0.5         hms_1.1.3        
[11] readxl_1.4.3      writexl_1.4.2     fansi_1.0.6       cellranger_1.1.0  tibble_3.2.1     
[16] tzdb_0.4.0        lifecycle_1.0.4   stringr_1.5.1     compiler_4.3.3    pkgconfig_2.0.3  
[21] rstudioapi_0.15.0 R6_2.5.1          readr_2.1.5       tidyselect_1.2.1  utf8_1.2.4       
[26] parallel_4.3.3    vroom_1.6.5       pillar_1.9.0      magrittr_2.0.3    withr_3.0.0      
[31] tools_4.3.3       bit64_4.0.5

auman-chan · 2024-06-07T14:37:48Z

As an alternative, I selected species with the function read_delim_chunked, and remove duplicate group checklists and roll up taxonomy by the function distinct() and filter(). But I hope this pipeline could be fixed.

mstrimas · 2024-06-07T15:06:48Z

I ran your exact code on the sample EBD file and it appears to be working fine. I have a Mac, so I also tried running it in a Ubuntu Docker container to emulate your environment and also am not having any issues. For example, this is the file for Cardinal that I'm getting
spe_Cardinalis_cardinalis.txt

It's hard to troubleshoot since I can replicate the issue... You might try running in a Docker container as well to test.

auman-chan · 2024-06-09T11:11:37Z

OK, I will have a try in Windows. Now I split species with the function read_delim_chunked and write_delim, and then imported them by read_ebd.

auman-chan · 2024-06-09T11:45:19Z

Additionally I have another question.The read_ebd or the auk_unique would only keep the distinct observations by the values of group_identifier, even though it is a missing value. It seems all group_identifier in NA are considered as duplicate and only keep one of them by read_ebd.

However, I don't think observations with the missing values in group_identifier are the duplicate, as they usually are in different locations and recorded in different time.

Is there any further meaning in this procedure?

mstrimas · 2024-06-11T14:53:04Z

I’m on vacation until June 25 so won’t be able to look into this in detail until then. However, auk_unique() shouldn’t impact rows that have NA for group_identifier. If that is happening, I’ll fix it when I return.

auman-chan · 2024-07-04T02:36:32Z

It seems that the bug is on the path of the ebd file.

When I read the file in other disks with the path like "/media/username/disk_name/ebird_data/ebd.txt"，the split function works. But when my file in the system disk and evoked by "/home/username/R//imp/ebd.txt", this function returns empty files.

mstrimas · 2024-07-08T17:22:39Z

That's strange, I'm not sure what that would be happening and I'm not able to reproduce the issue. If you figure it out how to fix it let me know and I can make the change.

auman-chan · 2024-07-10T09:44:30Z

I know what's wrong. My fiolder name included a whitespace, it could be identified in R (detected by file.exists) but could not in awk. It had better provide a check to avoid this situation, as the error from the shell didn't pass to R console.

On the other hand, I will further confirm the question of auk_unique().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auk_split doesn't work #81

auk_split doesn't work #81

auman-chan commented Jun 6, 2024

mstrimas commented Jun 6, 2024

auman-chan commented Jun 7, 2024 •

edited

Loading

auman-chan commented Jun 7, 2024

mstrimas commented Jun 7, 2024

auman-chan commented Jun 9, 2024

auman-chan commented Jun 9, 2024

mstrimas commented Jun 11, 2024

auman-chan commented Jul 4, 2024

mstrimas commented Jul 8, 2024

auman-chan commented Jul 10, 2024 •

edited

Loading

auk_split doesn't work #81

auk_split doesn't work #81

Comments

auman-chan commented Jun 6, 2024

mstrimas commented Jun 6, 2024

auman-chan commented Jun 7, 2024 • edited Loading

auman-chan commented Jun 7, 2024

mstrimas commented Jun 7, 2024

auman-chan commented Jun 9, 2024

auman-chan commented Jun 9, 2024

mstrimas commented Jun 11, 2024

auman-chan commented Jul 4, 2024

mstrimas commented Jul 8, 2024

auman-chan commented Jul 10, 2024 • edited Loading

auman-chan commented Jun 7, 2024 •

edited

Loading

auman-chan commented Jul 10, 2024 •

edited

Loading