Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getCRUCLdata: Download and Use CRU CL2.0 Climatology Data in R #96

Closed
13 tasks done
adamhsparks opened this issue Jan 30, 2017 · 41 comments
Closed
13 tasks done

getCRUCLdata: Download and Use CRU CL2.0 Climatology Data in R #96

adamhsparks opened this issue Jan 30, 2017 · 41 comments

Comments

@adamhsparks
Copy link
Member

adamhsparks commented Jan 30, 2017

Summary

  • What does this package do? (explain in 50 words or less):
    The getCRUCLdata package provides two functions that automate downloading and importing CRU CL2.0 climatology data, facilitates the calculation of minimum temperature and maximum temperature, and formats the data into a tidy data frame or a list of raster stack objects for use in an R session or easily exports to a raster format file for use in a geographic information system (GIS).

  • Paste the full DESCRIPTION file inside a code block below:

Package: getCRUCLdata
Type: Package
Title: Download and Use CRU CL2.0 Climatology Data in R
Version: 0.1.2
Authors@R: person(given = "Adam", family = "Sparks", email = "adamhsparks@gmail.com", role = c("aut", "cre"))
Description: Provides functions that automate downloading and importing
    University of East Anglia Climate Research Unit (CRU) CL2.0 climatology data
    into R, facilitates the calculation of minimum temperature and maximum
    temperature and formats the data into a tidy data frame or a list of raster
    stack objects for use in an R session.  CRU CL2.0 data are a gridded
    climatology of 1961-1990 monthly means released in 2002 and cover all land
    areas (excluding Antarctica) at 10-minute resolution.  For more
    information see the description of the data provided by the University of
    East Anglia Climate Research Unit,
    <https://crudata.uea.ac.uk/cru/data/hrg/tmc/readme.txt>.
License: MIT + file LICENSE
Depends: R (>= 3.0.0)
Imports:
    curl,
    dplyr,
    plyr,
    purrr (>= 0.2.0),
    raster,
    tidyr,
    utils
LazyData: TRUE
RoxygenNote: 5.0.1
ByteCompile: TRUE
Suggests:
    testthat,
    knitr,
    rmarkdown,
    covr
URL: https://github.com/adamhsparks/getCRUCLdata
BugReports: https://github.com/adamhsparks/getCRUCLdata/issues
VignetteBuilder: knitr
  • URL for the package (the development repository, not a stylized html page):
    https://github.com/adamhsparks/getCRUCLdata

  • Who is the target audience?
    Anyone interested in using climate data or generating raster files from the CRU CL2.0 data that are easy to use in a GIS

  • Are there other R packages that accomplish the same thing? If so, what is different about yours?
    Not that I'm aware of

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

Ivan Hanigan, https://github.com/ivanhanigan

@sckott
Copy link
Contributor

sckott commented Jan 31, 2017

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Currently seeking reviewers. It's a good fit and not overlapping.

  • Tests do succeed, but take a long time, would be nice to get these to run faster somehow
  • Ran goodpractice::gp() , but nothing to deal with at this time.

Reviewers: @ldecicco-USGS @ivanhanigan
Due date: 2017-02-22

@adamhsparks
Copy link
Member Author

@sckott, I'm working on getting the test time down. I've managed to reduce it by 1/2 already and keep the code coverage the same. I'll see what I can do further.

@sckott
Copy link
Contributor

sckott commented Feb 1, 2017

thanks @adamhsparks - thanks for trying!

@sckott
Copy link
Contributor

sckott commented Feb 1, 2017

reviewers assigned, see #96 (comment)

@ldecicco-USGS
Copy link

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION
Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:


Review Comments

Installation

I'm not sure if this is actually an issue with the package, but when I tried to install via CRAN, there was an error via the Windows binary:

> install.packages("getCRUCLdata")
Installing package into ‘C:/Users/ldecicco/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/bin/windows/contrib/3.3/getCRUCLdata_0.1.4.zip'
Content type 'application/zip' length 160600 bytes (156 KB)
downloaded 0 bytes

Warning in install.packages :
  downloaded length 0 != reported length 160600
Warning in install.packages :
  error 1 in extracting from zip file
Warning in install.packages :
  cannot open compressed file 'getCRUCLdata/DESCRIPTION', probable reason 'No such file or directory'
Error in install.packages : cannot open the connection

I could install via source just fine. This could be an issue with my CRAN mirror or something else...but I haven't experienced it before.

Vignette

The first thing I tried to do was run the first example in the vignette, and was unsuccessful:

library(getCRUCLdata)
CRU_data <- create_CRU_df(pre = TRUE,
                          pre_cv = TRUE,
                          rd0 = TRUE,
                          tmp = TRUE,
                          dtr = TRUE,
                          reh = TRUE,
                          tmn = TRUE,
                          tmx = TRUE,
                          sunp = TRUE,
                          frs = TRUE,
                          wnd = TRUE,
                          elv = TRUE)
 
Downloading requested data files.
 
Error in .f(.x[[i]], ...) : Too many retries...server may be under load

I then tried setting a single parameter to TRUE, and still could not get data to return:

CRU_data <- create_CRU_df(pre = TRUE)
 
Downloading requested data files.
 
Error in .f(.x[[i]], ...) : Too many retries...server may be under load

Examples

I had the same problem running the examples from the help files as running the examples from the vignette.

Tests

Similarly, the tests via testthat didn't pass with the same error "Too many retries...server may be under load". However, looking at the tests, they looked like appropriate tests. According to Coveralls, there's a 59% coverage on code. This isn't great, but it's not too bad.

Troubleshooting

When I went to the URLs to test the service, they worked fine. I use httr for the package dataRetrieval (so, not as familiar with purr). I tried this:

library(httr)

temp <- tempfile()
temp <- paste0(temp,".dat.gz")
obs_url <- "https://crudata.uea.ac.uk/cru/data/hrg/tmc/grid_10min_dtr.dat.gz"
doc <- GET(obs_url, write_disk(temp))

headerInfo <- headers(doc)
doc2 <- read.table(gzfile(temp))

unlink(doc)

doc2 came back as a table with 14 columns and 566262 rows. So, maybe give httr a try? As it is, I can't get anything back from the package, and I don't think it's the underlying services fault.

@sckott
Copy link
Contributor

sckott commented Feb 17, 2017

Thanks for your review @ldecicco-USGS !

@adamhsparks
Copy link
Member Author

Thanks for the review @ldecicco-USGS.

It's interesting. I'm not able to replicate the issue with installation because unfortunately, I don't have a Windows machine to test on, I rely on Winbuilder and Appveyor to alert me to any issues. I checked the zip package from CRAN, if I'm reading the error correctly it says that there was no DESCRIPTION file. However, the zip file I downloaded had one in it.

Moving on to the other items, I'm also unable to replicate the issues. I just ran the code that was pasted here that timed out. It worked and the tests pass on my computer, Travis and Appveyor when downloading, so I assume everything is fine. I wonder if there's a network issue that is causing problems?

I'll look at using httr. The purr code comes from one of my other packages where the server will stop responding so it limits the number of requests made and I incorporated it here to be nice to this server as well, so I'm stumped about what the issue is.

@adamhsparks
Copy link
Member Author

@ldecicco-USGS, I've modified the download function to use httr::GET as suggested in the devel branch, if you'd like to check it.

https://github.com/adamhsparks/getCRUCLdata/tree/devel

@ivanhanigan
Copy link

Package Review for getCRUCLdata

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION
Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly stating problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

  • 3hrs

Review Comments

This is a useful R package that is well written and provides access to an important data collection.
I reviewed the Github version, master branch, installed today 21 Feb 2017, on Ubuntu linux. I made comments against the line numbers for files in the devel branch.
The documentation is clear and attractive. Functionality on my linux machine is mostly good, with some functionality not working/returning errors.

Major comments

test 1 create_CRU_df
  • Success: I ran the suggested line from README for all 12 parameters and retrieved a df with 6800028 obs, 15 vars, 0.8GB according to object.size()
  • Comment re single threaded. I ran this on a 16 core machine and was disappointed to only see one core spinning. Is there options for parallelisation?
  • Comment re size of download is significant, and any problems (e.g. broken connection with cloud VM, power outage etc) may cause frustration if object is lost due to crashed R session. I ran save.image() straight after successful
  • Issue re with(CRU_data, plot(lon, lat)) caused my Rstudio server to return status 502 (unresponsive) and I killed the session. I tried again, left it running for 30 mins and then logged in to see the plot had completed, but the session was still clunky and I got unresponsive warnings and then I terminated it. This is a NeCTAR research cloud XXL running on QCIF infrastructure (64GB, 16 core, 2.3 GHz Opterons) so should not have an issue. Suggest you pack some easily visualisable meta info such as the lat lon points as a small df or spatialPoints obj so is easily plotable.
  • Issue re next line causes a warning and fails
t_rh <- create_CRU_df(tmp = TRUE,
+                       reh = TRUE)
 
Downloading requested data files.
 
  |=================================================================================================================| 100%
Error in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) : 
  'lat' column not found in lhs, cannot join
  • Issue re help file for this function, has example but also fails
 CRU_pre_tmp <- create_CRU_df(pre = TRUE, tmp = TRUE)
 
Downloading requested data files.
 
  |=================================================================================================================| 100%
Error in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) : 
  'lat' column not found in lhs, cannot join
test 2 create_CRU_stack
  • Comment re I then went ahead and test create_CRU_stack. I immediatly realise this is downloading more data, and wondered if seeing as I have a dataframe I previously downloaded it might be possible to just convert that to a stack. To save more downloading. Converse is also true: if user has already downloaded raster stack (or stored this as local tifs) then perhaps create_CRU_df could be told to use the data that already exists.
  • Comment re I see in devel/vignettes/getCRUCLdata.Rmd line 112-115 the useful suggestion to store rasters locally as geotiffs. Recommend a similar suggestion above for the dataframe. Also suggest you load the raster library here too, for Noobs who may get upset by the warning if the try to run writeRaster without that package attached.
  • Success re CRU_stack. I ran the following and got successful geotiffs
writeRaster(CRU_stack$pre, filename = paste0("~/tmp/pre_", names(CRU_stack$pre)), bylayer = TRUE, format = "GTiff")
writeRaster(CRU_stack$reh, filename = paste0("~/tmp/reh_", names(CRU_stack$reh)), bylayer = TRUE, format = "GTiff")
  • loaded precip december tif into QGIS succeeds, and is projected as WGS84, thanks. Value for Darwin of 240 is very close to the Darwin Airport 233.9 reported by the BoM for the same 61-90 period. Awesome thanks.
  • issue re next line from vignettes/getCRUCLdata
tmn_tmx <- create_CRU_stack(tmn = TRUE,
                            tmx = TRUE)
Downloading requested data files.
 
  |======================================                                                                           |  33%
Error in raster::crop(y, raster::extent(-180, 180, -60, 85)) : 
  object 'y' not found
							
  • Issue re help for this function: "Details This function generates a data.frame object in R with the following possible fields as specified by the user: " should read "raster stack" not data.frame
  • Success the example in that help file worked (CRU_pre_tmp <- create_CRU_stack(pre = TRUE, tmp = TRUE))

Minor comments

Re devel/README.Rmd:

  • line 34 re '10 minute resolution' - can you say this in decimal degrees to avoid confusion for GIS Noobs (also in devel/vignettes/getCRUCLdata.Rmd, line 19).
  • further comment, this appears in the file devel/man/getCRUdata.Rd as 10 arc-second, so at least need to resolve which units to declare, but I strongly recommend aligning with decimal degrees.
  • line 57 - should read "stable version of getCRUCLdata is available from CRAN" not "GSODR is available"

Re paper.md

  • This could have more of a statement of need and audience. Currently it echos the README but I think here it would be nice to expand a bit of discussion.

Re general

  • There might be opportunity to optimise the processing and also provide some safegaurds in terms of automatically storing downloaded data locally to protect users from crashs etc that might lead to them having to download the same data multiple times.
  • I did not assess code style, code duplication, Automated tests or Packaging guidelines as I lack the programming expertise for that.
  • I checked a couple external URLs which succeed in taking me to the Uni East Anglia pages that describe these data, thanks!

My sessionInfo

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.10

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] getCRUCLdata_0.1.4

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.2.0       assertthat_0.1 DBI_0.5-1      tools_3.3.1    dplyr_0.5.0    tibble_1.2     Rcpp_0.12.9  

@adamhsparks
Copy link
Member Author

@ivanhanigan Thank you for a thorough and helpful review.

I've started addressing a few of the more minor issues that you've pointed out in the "devel" branch and will continue to work through these suggestions.

@sckott
Copy link
Contributor

sckott commented Feb 21, 2017

thanks for your review @ivanhanigan !

@adamhsparks
Copy link
Member Author

adamhsparks commented Feb 26, 2017

@ivanhanigan and @ldecicco-USGS,
I've taken your suggestions from the reviews and incorporated them into the latest release, v0.1.5.

Highlights

Major Changes

  • create_CRU_stack() and create_CRU_df() now only work with locally available files. If you need to fetch and create a data frame or raster stack of the data, please use the new functions, get_CRU_stack() and get_CRU_stack()
  • R >=3.2.0 now required

Minor Changes

  • Improved documentation with examples on mapping and graphing and more detail regarding the data itself
  • Change the method in which files are downloaded to use httr::GET()
  • Ingest data using data.table::fread to decrease the amount of time necessary to run the functions
  • Functions check to see if data file(s) have already been downloaded during current R session, if so data file(s) are not requested for download again
  • Months are returned as a factor object in the tidy data frame

Detailed discussion

@ivanhanigan:

Regarding the parallelisation. I've intentionally not gone this route due to cross-platform issues and different processors making things difficult with little reward based on experience with another R package. However, I have used fread from the data.table package to decrease the runtime by ingesting the raw data more quickly. This resulted in a small, but notable decrease in the time necessary to run any of the functions.

Regarding the downloading issues, I've included two new functions that work with local files only and do not rely on R to download the files. Since CRAN guidelines specify that an R package should not write to disk (other than R's tempfile() or tempdisk(), I've elected to go this route so that a user may use an FTP client, web browser or other means to download and save the files locally before using this package in R. This also is useful for individuals who might desire to use this same data more than once, having the raw data on hand saving the download time and only importing into R from local.

Regarding the saving data when downloading, the functions now only download files that do not exist in the current R session. If you download temperature and create a stack, when creating a data frame of temperature will not trigger a new download unless you restart your R session.

Regarding the with(CRU_data, plot(lon, lat)), I've included examples using ggplot2 to generate maps and graphs in README and vignette, this takes only a few seconds to generate on the machines I have a available to test on.

Regarding the 10 minute (second) issues. I've corrected the documentation and added the values both in arcminutes (the original from New et al.) and as a decimal value of degrees.

I've updated the paper.md file to be more detailed and addressed other comments regarding functionality and documentation as suggested.

Thank you both, @ldecicco-USGS and @ivanhanigan, again for the very constructive and helpful comments. They have helped me make this a much more robust R package that hopefully benefits R users.

@sckott
Copy link
Contributor

sckott commented Feb 28, 2017

Thanks for the changes @adamhsparks

@ivanhanigan @ldecicco-USGS are you two happy with the changes?

@ivanhanigan
Copy link

ivanhanigan commented Mar 1, 2017 via email

@ldecicco-USGS
Copy link

I'll have a look sometime this week! Hopefully this afternoon if meetings pan out as scheduled.

@adamhsparks
Copy link
Member Author

adamhsparks commented Mar 2, 2017

Hello @ldecicco-USGS and @ivanhanigan, please don't rush to check these updates. I'm planning more.

I just found out about the rappdirs package looking at @sckott's ropensci::ccafs package I'll modify the code to use a persistent cache directory for storing the files. This will take me a week or two I'd guess since I'm pretty busy with "real" work right now.

However, I think that this will provide a much nicer end-user experience.

@sckott
Copy link
Contributor

sckott commented Mar 2, 2017

thanks for the update @adamhsparks

see also https://github.com/ropensci/hoardr which is not quite to cran yet - wraps rappdirs to be a little more user friendly, see eg usage here https://github.com/ropenscilabs/cmipr/blob/master/R/onLoad.R and then you just export and document the caching object https://github.com/ropenscilabs/cmipr/blob/master/R/caching.R created in onload

no pressure to try hoardr by the way

@sckott
Copy link
Contributor

sckott commented Mar 8, 2017

@ldecicco-USGS do you have an estimate for how long your review took (for our records)

@ldecicco-USGS
Copy link

I'll be happy to look again when @adamhsparks does his next round of updates (see 2 comments up). I think the first round only took ~1 hour.

@sckott
Copy link
Contributor

sckott commented Mar 9, 2017

okay, thanks @ldecicco-USGS

@adamhsparks
Copy link
Member Author

Ok folks. I've had another go at this.

I've added the ability to cache files in the users' home filespace as I mentioned and cleaned up a few other minor items while I was at it.

Please have another look and let me know of any issues or suggestions.

@ivanhanigan
Copy link

ivanhanigan commented Mar 15, 2017 via email

@ldecicco-USGS
Copy link

Same here, hopefully today or tomorrow.

@ivanhanigan
Copy link

Hi Adam thanks for the new release of the getCRUCLdata package.
I started a new review BUT have not done a comprehensive review yet.

But thought I'd just check in on something I think is worth discussing re cached files of a one-off download of the data vs repeated downloading on-the-fly.

When I looked at the code in the readme, it was obvious `~/Downloads' does not work.

t <- create_CRU_df(tmp = TRUE, dsn = "~/Downloads")

So I checked the function and found downloading is actually to this directory:

dsn <- "/home/ivan_hanigan/.config/getCRUCLdata"
# which works juset as well 
t2 <- create_CRU_df(tmp = TRUE, dsn = dsn)
identical(t, t2)
# TRUE

It seems to me that this location will not be very clear to users of the program if asked `where is the CRU data you downloaded as a one-off'? It seems like the package assumes people will download the data whenever the function is invoked - 'on-the-fly'.

My concern I wanted to raise before proceeding with my review is that data downloading as a 'one-off' might considered more important by the user, and being prompted to choose a place to store the result for further management, versus a 'on-the-fly' data download to a temporary/system location crated for the current session.
This assumes good network connections and there are associated issues with online data provision (including trust that the data provider will not change the data at the online location between downloads) that require soem consideration .

So I just thought I'd flag this issue now and suggest the following:

What about making get_CRU to ask the User where to download the CRU data to?
Then this makes the location explicit and makes the dataset a more recognisable contribution to the files that they have on their system.

PS I noticed a duplicated chunk of text in this

?get_CRU_df
"Duplicated?
# Download data and create a data frame of precipitation and temperature
# without caching the data files
CRU_pre_tmp <- get_CRU_df(pre = TRUE, tmp = TRUE)

# Download data and create a data frame of precipitation and temperature
# without caching the data files
CRU_pre_tmp <- get_CRU_df(pre = TRUE, tmp = TRUE)
"

All the best, and I will continue the review this week - in between other work responsibilities.

Thanks Adam!

I

@adamhsparks
Copy link
Member Author

Hi @ivanhanigan,
I don't understand why using create_CRU_df(dsn = "~/Downloads") won't work as I've illustrated?

It seems to work properly for me, could you give me the line numbers in the code that you think are incorrect? create_CRU_df() does not download anything. It requires the data to already exist locally, e.g "~/Downloads". get_CRU_df() downloads to the location you've described and caches.

Is there a need to rename the functions so that the differences are more clear?

PS
You're right. It was a duplicate. I've fixed that.

@ivanhanigan
Copy link

Hi,
It is not incorrect but it depends on the user moving the data from the cache_dir to the folder ~/Downloads in your example.

#in README it says 
t <- create_CRU_df(tmp = TRUE, dsn = "~/Downloads")
# but the function get_CRU_df() downloads and caches data in a special directory
# so you probably need to explain to the user
# they need to move/copy the data to the folder they then specified to the dsn argument
# we can find this with 
cachedir <- rappdirs::user_config_dir("getCRUCLdata")
# on my system this is "/home/ivan_hanigan/.config/getCRUCLdata"
# so I can move the files from there to the correct folder, or leave them there 
# (but that seems like sloppy data management to me)
# I just move them to a data folder here
system(sprintf("mv %s/ data_provided/", cachedir))
# and then when I come back to use them I know they are handy
t3 <- create_CRU_df(tmp = TRUE, dsn = "data_provided")
# and identical to what I downloaded fresh with 
t <- get_CRU_df(tmp = TRUE, cache = FALSE)
identical(t, t3)
# TRUE

Or do you recommend that users store the raw data in the .config/getCRUCLdata location?

@adamhsparks
Copy link
Member Author

adamhsparks commented Mar 27, 2017

@ivanhanigan, it's not intended for the user to move data at all. It's intended if a user has network issues, to use an external FTP program was previously suggested.

The the help for create_CRU_* is this:

This function automates importing CRU CL v. 2.0 climatology data into R from locally available data files and creates a list of raster stacks of the data. If requested, minimum and maximum temperature may also be automatically calculated as described in the data readme.txt file. This function can be useful if you have network connection issues that mean automated downloading of the files using R does not work properly. In this instance it is recommended to use an FTP client (e.g., FileZilla), web browser or command line command (e.g., wget or curl) to download the files, save locally and use this function to import the data into R.

The caching works only with the get_CRU_df() function. If you use that function, you would need to re-use it to take advantage of the caching functionality. Perhaps I should just drop the caching since it seems to be creating confusion? Or is it a matter of the README and vignette being confusing?

@adamhsparks
Copy link
Member Author

I've tried to revise the README text for clarity.

@ldecicco-USGS
Copy link

I pulled the latest changes from the master branch here and built from that. When I tried the first example in the vignette:

CRU_data <- create_CRU_df(pre = TRUE,
                          pre_cv = TRUE,
                          rd0 = TRUE,
                          tmp = TRUE,
                          dtr = TRUE,
                          reh = TRUE,
                          tmn = TRUE,
                          tmx = TRUE,
                          sunp = TRUE,
                          frs = TRUE,
                          wnd = TRUE,
                          elv = TRUE)
 Error in .validate_dsn(dsn) : 
File directory does not exist: .

I thought maybe that was because the vignette maybe is dated? So I tried the first example for that function:

CRU_pre_tmp <- create_CRU_df(pre = TRUE, tmp = TRUE, dsn = "~/Downloads")
Error in .validate_dsn(dsn) : 
File directory does not exist: ~/Downloads.

So...I need to create some actual path in the dsn argument. I did this:

CRU_pre_tmp <- create_CRU_df(pre = TRUE, tmp = TRUE, dsn = "D:/LADData")

Creating data frame now.

  |                                                     |   0%gzip: D:/LADData/ is a directory -- ignored
 Show Traceback
 
 Rerun with Debug
 Error in data.table::fread(paste0("gzip -dc ", .files), header = FALSE) : 
  File is empty: C:\Users\ldecicco\AppData\Local\Temp\1\RtmpszBo52\filefac5fb84c54 In addition: Warning messages:
1: running command 'C:\Windows\system32\cmd.exe /c (gzip -dc D:/LADData/) > C:\Users\ldecicco\AppData\Local\Temp\1\RtmpszBo52\filefac5fb84c54' had status 2 
2: In shell(paste("(", input, ") > ", tt, sep = "")) :
  '(gzip -dc D:/LADData/) > C:\Users\ldecicco\AppData\Local\Temp\1\RtmpszBo52\filefac5fb84c54' execution failed with error code 2

So, I think I agree with @ivanhanigan that the documentation on what I'm suppose to put in the dsn argument needs to be improved.

@adamhsparks
Copy link
Member Author

adamhsparks commented Mar 28, 2017

@ldecicco-USGS,
I don't know what's happened here. That's not the first example in the vignette and the function arguments as you've shown are not all supplied properly so it won't run and stops as it should.

This is the first example in the vignette:

library(getCRUCLdata)

CRU_data <- get_CRU_df(pre = TRUE,
                       pre_cv = TRUE,
                       rd0 = TRUE,
                       tmp = TRUE,
                       dtr = TRUE,
                       reh = TRUE,
                       tmn = TRUE,
                       tmx = TRUE,
                       sunp = TRUE,
                       frs = TRUE,
                       wnd = TRUE,
                       elv = TRUE,
                       cache = TRUE)

The equivalent for create_CRU_df() would be:

library(getCRUCLdata)

CRU_data <- create_CRU_df(pre = TRUE,
                       pre_cv = TRUE,
                       rd0 = TRUE,
                       tmp = TRUE,
                       dtr = TRUE,
                       reh = TRUE,
                       tmn = TRUE,
                       tmx = TRUE,
                       sunp = TRUE,
                       frs = TRUE,
                       wnd = TRUE,
                       elv = TRUE,
                       dsn = "~/Downloads")

Where you downloaded the CRU files external to R and put them in "~/Downloads". The function worked properly as designed telling you that you didn't tell it where to find the files to process.

@adamhsparks
Copy link
Member Author

@ldecicco-USGS,
I've added more description to the documentation, but I still don't understand the first example that you posted. I don't know where that comes from?

Thanks!

@ivanhanigan
Copy link

ivanhanigan commented Mar 28, 2017 via email

@ldecicco-USGS
Copy link

OK, I think what was happening with my first example is that I was using the older version of the vignette. I re-built the vignettes and got things working.

I think where there could be confusion (there was from me at least...), that might easily be straightened out for the user, is have the very first example in the vignette be how to actually download the data. The vignette starts with "Using getCRUCLdata", which is definitely the cool part of the package, but maybe having a section above that "Getting CRUCL Data" with the explicit instruction downloading for the first time would be helpful. Then, all the examples with cache=TRUE immediately work.

Otherwise, looks good!

A minor suggestion that might help people like me (who..maybe...have been known to only skim through the instructions 😓 )....I kept trying to run create_CRU_df without having run get_CRU_df first. I saw the "dsn" argument and thought it would save the data there, not find the data there. Perhaps you could put a custom error message in that function, that if the dsn argument doesn't have the correct files, to use get_CRU_df to download the data?

@adamhsparks
Copy link
Member Author

adamhsparks commented Mar 28, 2017

@ldecicco-USGS that's a good suggestion, I'm guilty of skimming documentation too. I'll implement that change.

I'll reorganise the documentation as you and @ivanhanigan have suggested. You've both been very helpful, thank you.

@adamhsparks
Copy link
Member Author

Hi folks, I've edited the README and vignette to include a quick start and a more advanced section on caching and using create_CRU_*() functions.

No changes to functionality, only documentation.

I've also linted the package and fixed some errors in the documentation with typos, "a a", etc.

@sckott
Copy link
Contributor

sckott commented Mar 31, 2017

is all taken care of @adamhsparks ? If so, i'll take a final look

@adamhsparks
Copy link
Member Author

@sckott, I hope that I've addressed @ldecicco-USGS' and @ivanhanigan's comments re: documentation.

I'm still trying to change the error message when the DSN doesn't exist or files aren't found. The error stops and works, but the issue I'm having is with getting it to pass the test. :/

@adamhsparks
Copy link
Member Author

@sckott, I think we're good. I've tidied it up a bit more here and there. Have a look now when you have time.

@sckott
Copy link
Contributor

sckott commented Apr 3, 2017

@adamhsparks Looks good to me. approved :)

i assume you know the drill from here?

@adamhsparks
Copy link
Member Author

adamhsparks commented Apr 4, 2017

I think I do. I've added the footer and have transferred it over to rOpenSci.

Thanks for your time, suggestion and effort @ivanhanigan, @ldecicco-USGS and @sckott. Once again, the rOpenSci review process really made the package a much better offering.

@sckott
Copy link
Contributor

sckott commented Apr 4, 2017

thx for your submission!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants