Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Decreasing the size of test data #104

Closed
pritamd47 opened this issue Jan 9, 2020 · 11 comments
Closed

[Suggestion] Decreasing the size of test data #104

pritamd47 opened this issue Jan 9, 2020 · 11 comments

Comments

@pritamd47
Copy link
Contributor

Currently the size of test data that is downloaded is ~200MB, and depending on the connection speed can take quite some time to download (I couldn't run tests on ArchGDAL while I was not in the university, due to low connection speed).

I believe if the datasets can be trimmed to — say cropped versions of the datasets where-ever possible — something smaller than their current size, can drastically improve the time required to test for fresh installs (or when downloading datasets is necessitated).

Looking for people's thoughts on this suggestion.

@yeesian
Copy link
Owner

yeesian commented Jan 10, 2020

Yeah that's a good idea :)

@pritamd47
Copy link
Contributor Author

Thank you for the feedback, I'll start working on this ASAP.

@evetion
Copy link
Collaborator

evetion commented Jan 13, 2020

Seems like a good idea to me!

Commenting here as a way to follow this activity, as GeoArrays over at https://github.com/evetion/GeoArrays.jl/blob/master/test/get_testdata.jl sneakily also uses the same testdata, so this could break tests.

@pritamd47
Copy link
Contributor Author

Yeah I noticed this about GeoArrays. In my opinion, after ArchGDAL, same treatment can be given to GeoArrays as well – or maybe hosting different sets of test data can also work.

@pritamd47
Copy link
Contributor Author

I was working on this, starting with the largest raster - ospy/data4/aster.img

I was testing the testsets of test/test_array.jl with a cropped version of the raster file.

@yeesian can you specify what this test does (int indexing starting line 33, in
test/test_array.jl? And if I can choose any arbitrary index, and check against its pixel value (verifying using Rasterio-python); or has this value of 0xff or 255 at this index been chosen with any specific purpose in mind?

Here is the code:

33 @testset "int indexing" begin
34                @test ds[755, 2107, 1] == 0xff
35            end

@visr
Copy link
Collaborator

visr commented Jan 19, 2020

The largest rasters are a good place to start :)

In #10 I noted that these two files took up the bulk of the test size. In general, for unit testing, small test datasets suffice. Larger datasets can be useful for benchmarking however. But since we currently don't have benchmarks set up, perhaps we should thake these two rasters out of the unit tests altogether, and do these same test on one of the smaller rasters instead.

To answer your question, if I open the raster in QGIS, I only see a value of 255 at a few select spots, where there is a little cloud in the satellite image. To test whether indexing works correctly, it's important to not pick a value that occurs a lot, or you risk getting it right for the wrong reason. It would probably be best to test not only the first band, but the other bands as well, as well as a neighbouring cell with different values. This will protect against off-by-one errors better.

By the way I just noticed that a lot of the ospy/data# data is identical. You can see the checksums being equal as well. So this would be good to take out as well. The two big rasters I mention in #10 are identical as well, I hadn't noticed that before.

@pritamd47
Copy link
Contributor Author

Thanks for the suggestions! Comparing the checksums didn't really cross my mind, and now that I am looking at the data, these are indeed the same datasets. Removing the redundant files brings down the size to 122.5 MB (of which, 82.7 MB is aster.img).

And choosing the pixel values of the clouds do seem reasonable – that being said, I'll see if the clipping of the large aster.img can be done in a better way. I was trying to keep the No-data Pixels in the cropped version, now will also try to keep your suggestion in mind and include some rare-pixel values.

@pritamd47
Copy link
Contributor Author

[UPDATE]

So I've been working on this issue, and with help and suggestions from fellow contributors, I've managed to decrease the size of the required data-set from ~200MB to 32MB.

Changes Made:

  • Removed Aster.img, and replaced with a cropped version of the same Cropped_aster.img
  • Rewrote some tests (made minor changes to values such as count and sum of pixels, verifying with rasterio and QGIS)
  • Some tests which require combination of certain Feature data-sets and the Aster.img raster have been commented out – which I plan to rewrite so that they use other files instead. These are tests covering the Homeworks by Chris Garrard. The coverage hence has dropped to 86% – which will improve once the tests are rewritten.
  • [Temporary] Currently the data is being downloaded from a fork , instead of yeesian/ArchGDALDatasets.

Other changes which I plan to work on:

  • To check if redundancy can further be checked, and tests rewritten such that minimum number of files need to be downloaded.
  • Rewrite broken tests and create new data-sets if required.

Please feel free to have a look at the changes made at pritamd47/ArchGDAL.jl and make any suggestions.

@evetion
Copy link
Collaborator

evetion commented Jan 24, 2020

Nice work!

Change overview: master...pritamd47:master
Feel free to make a PR based on the above, it makes it easier to review and also comment on those changes. You can put [WIP] or don't merge yet somewhere to indicate it's still a work in progress.

@pritamd47
Copy link
Contributor Author

Feel free to make a PR based on the above, it makes it easier to review and also comment on those changes. You can put [WIP] or don't merge yet somewhere to indicate it's still a work in progress.

Will do this soon. I plan to write some tests and then make a pull request for discussion over the changes.

@yeesian
Copy link
Owner

yeesian commented Sep 19, 2020

Closed by #121

@yeesian yeesian closed this as completed Sep 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants