Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache$import() corrupts fst and data.frame targets #1120

Closed
3 tasks done
brendanf opened this issue Dec 19, 2019 · 5 comments
Closed
3 tasks done

cache$import() corrupts fst and data.frame targets #1120

brendanf opened this issue Dec 19, 2019 · 5 comments
Assignees

Comments

@brendanf
Copy link
Contributor

Prework

using 36861bf

Description

When importing a target that was stored using format = "fst" or format = "diskframe", the stored object in the original cache is corrupted. For fst, the copy in the new cache is ok. For diskframe, both versions are corrupted.

Reproducible example

library(drake)
suppressPackageStartupMessages(library(disk.frame))

cache1 <- new_cache(path = tempfile())
cache2 <- new_cache(path = tempfile())
plan <- drake_plan(
   data = data.frame(
      x = runif(1000),
      y = runif(1000)
   ),
   data_fst = target(
      data,
      format = "fst"
   ),
   data_disk = target(
      data,
      format = "diskframe"
   )
)

make(plan, cache = cache1)
#> target data
#> target data_disk
#> Warning: You selected disk.frame format for target data_disk, so drake
#> will try to convert it from class "data.frame" to a disk.frame object.
#> For optimal performance please create disk.frame objects yourself using an
#> outdir on the same drive as drake's cache (say, with as.disk.frame(outdir =
#> drake_tempfile())).
#> target data_fst
#> Warning: You selected fst format for target data_fst, so drake will convert
#> it from class c("data.table", "data.frame") to a plain data frame.
head(readd(data, cache = cache1))
#>            x          y
#> 1 0.06426745 0.46801031
#> 2 0.09030193 0.08022266
#> 3 0.50495667 0.43436283
#> 4 0.50929448 0.68074308
#> 5 0.99376134 0.14853709
#> 6 0.05075097 0.25130674
head(readd(data_fst, cache = cache1))
#>            x          y
#> 1 0.06426745 0.46801031
#> 2 0.09030193 0.08022266
#> 3 0.50495667 0.43436283
#> 4 0.50929448 0.68074308
#> 5 0.99376134 0.14853709
#> 6 0.05075097 0.25130674
head(readd(data_disk, cache = cache1))
#>             x          y
#> 1: 0.06426745 0.46801031
#> 2: 0.09030193 0.08022266
#> 3: 0.50495667 0.43436283
#> 4: 0.50929448 0.68074308
#> 5: 0.99376134 0.14853709
#> 6: 0.05075097 0.25130674

cache2$import(data, from = cache1)
head(readd(data, cache = cache1))
#>            x          y
#> 1 0.06426745 0.46801031
#> 2 0.09030193 0.08022266
#> 3 0.50495667 0.43436283
#> 4 0.50929448 0.68074308
#> 5 0.99376134 0.14853709
#> 6 0.05075097 0.25130674
head(readd(data, cache = cache2))
#>            x          y
#> 1 0.06426745 0.46801031
#> 2 0.09030193 0.08022266
#> 3 0.50495667 0.43436283
#> 4 0.50929448 0.68074308
#> 5 0.99376134 0.14853709
#> 6 0.05075097 0.25130674

cache2$import(data_fst, from = cache1)
head(readd(data_fst, cache = cache1))
#> Error in fst::read_fst(.self$file_return_key(key)): Error opening fst file for reading, please check access rights and file availability
head(readd(data_fst, cache = cache2))
#>            x          y
#> 1 0.06426745 0.46801031
#> 2 0.09030193 0.08022266
#> 3 0.50495667 0.43436283
#> 4 0.50929448 0.68074308
#> 5 0.99376134 0.14853709
#> 6 0.05075097 0.25130674

cache2$import(data_disk, from = cache1)
head(readd(data_disk, cache = cache1))
#> Error in fst::read_fst(path2, from = 1, to = n, as.data.table = TRUE): Error opening fst file for reading, please check access rights and file availability
head(readd(data_disk, cache = cache2))
#> Error in fst::read_fst(path2, from = 1, to = n, as.data.table = TRUE): Error opening fst file for reading, please check access rights and file availability

Created on 2019-12-19 by the reprex package (v0.3.0)

Session info
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.3 LTS
#> 
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyselect_0.2.5 disk.frame_0.3.0 purrr_0.3.3      dplyr_0.8.3     
#> [5] drake_7.8.0.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1         pryr_0.1.4         compiler_3.5.1    
#>  [4] pillar_1.4.2       highr_0.8          tools_3.5.1       
#>  [7] zeallot_0.1.0      digest_0.6.22      bit_1.1-14        
#> [10] jsonlite_1.6       evaluate_0.14      tibble_2.1.3      
#> [13] pkgconfig_2.0.3    rlang_0.4.1        igraph_1.2.4.1    
#> [16] filelock_1.0.2     yaml_2.2.0         parallel_3.5.1    
#> [19] xfun_0.10          furrr_0.1.0        bigreadr_0.2.0    
#> [22] bigassertr_0.1.1   storr_1.2.1        stringr_1.4.0     
#> [25] knitr_1.25         globals_0.12.5     vctrs_0.2.0       
#> [28] fs_1.3.1           bit64_0.9-7        data.table_1.12.6 
#> [31] glue_1.3.1         listenv_0.8.0      R6_2.4.0          
#> [34] future.apply_1.3.0 base64url_1.4      rmarkdown_1.16    
#> [37] txtq_0.2.0         magrittr_1.5       codetools_0.2-15  
#> [40] backports_1.1.5    htmltools_0.4.0    fst_0.9.0         
#> [43] assertthat_0.2.1   future_1.15.1      stringi_1.4.3     
#> [46] crayon_1.3.4

Expected result

Both caches should have working copies of the targets.

@brendanf brendanf changed the title cache imports corrupt fst and data.frame targets cache$import() corrupts fst and data.frame targets Dec 19, 2019
@brendanf
Copy link
Contributor Author

As an aside, this is unexpected:

#> Warning: You selected fst format for target data_fst, so drake will convert
#> it from class c("data.table", "data.frame") to a plain data frame.

How did it become a data.table?

@wlandau
Copy link
Member

wlandau commented Dec 19, 2019

Well that's insidious and embarrassing. Thanks for catching it. Fixed in 81fe55b.

@wlandau
Copy link
Member

wlandau commented Dec 19, 2019

Apparently #1120 (comment) happened because disk.frame converts data frames to data.tables in place.

make(plan)
#> target data # starts off as a data frame
#> target data_disk # as.disk.frame() changes data to a data.table
#> target data_fst # data is now a data.table.

I guess this makes sense because it is memory efficient, but it is a side effect I did not expect, and it violates drake's immutability assumption for targets (cc @xiaodaigh). To work around it in drake, we must duplicate the data in memory so the target does not get corrupted.

library(drake)
suppressPackageStartupMessages(library(disk.frame))

cache1 <- new_cache(path = tempfile())

plan <- drake_plan(
  data = data.frame(
    x = runif(1000),
    y = runif(1000)
  ),
  data_fst = target(
    data,
    format = "fst"
  ),
  data_disk = target(
    as.disk.frame(
      rlang::duplicate(data), # Duplicate the data to avoid corrupting the target.
      outdir = drake_tempfile(cache = cache1)
    ),
    format = "diskframe"
  )
)

make(plan, cache = cache1)
#> target data
#> target data_disk
#> target data_fst

Created on 2019-12-19 by the reprex package (v0.3.0)

@xiaodaigh
Copy link
Contributor

Sounds like an issue I contributed to. Just leaving a note here: I think {disk.frame} will move towards respecting user choice more and more so these types of conversion will be assessed in the future. Sorry about the inconvenience.

@wlandau
Copy link
Member

wlandau commented Dec 19, 2019

It's okay, glad you are aware now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants