[Suggestion] Decreasing the size of test data #104

pritamd47 · 2020-01-09T05:47:27Z

Currently the size of test data that is downloaded is ~200MB, and depending on the connection speed can take quite some time to download (I couldn't run tests on ArchGDAL while I was not in the university, due to low connection speed).

I believe if the datasets can be trimmed to — say cropped versions of the datasets where-ever possible — something smaller than their current size, can drastically improve the time required to test for fresh installs (or when downloading datasets is necessitated).

Looking for people's thoughts on this suggestion.

yeesian · 2020-01-10T03:46:57Z

Yeah that's a good idea :)

pritamd47 · 2020-01-10T05:00:57Z

Thank you for the feedback, I'll start working on this ASAP.

evetion · 2020-01-13T20:23:53Z

Seems like a good idea to me!

Commenting here as a way to follow this activity, as GeoArrays over at https://github.com/evetion/GeoArrays.jl/blob/master/test/get_testdata.jl sneakily also uses the same testdata, so this could break tests.

pritamd47 · 2020-01-14T05:30:31Z

Yeah I noticed this about GeoArrays. In my opinion, after ArchGDAL, same treatment can be given to GeoArrays as well – or maybe hosting different sets of test data can also work.

pritamd47 · 2020-01-19T16:24:09Z

I was working on this, starting with the largest raster - ospy/data4/aster.img

I was testing the testsets of test/test_array.jl with a cropped version of the raster file.

@yeesian can you specify what this test does (int indexing starting line 33, in
test/test_array.jl? And if I can choose any arbitrary index, and check against its pixel value (verifying using Rasterio-python); or has this value of 0xff or 255 at this index been chosen with any specific purpose in mind?

Here is the code:

33 @testset "int indexing" begin
34                @test ds[755, 2107, 1] == 0xff
35            end

visr · 2020-01-19T18:21:05Z

The largest rasters are a good place to start :)

In #10 I noted that these two files took up the bulk of the test size. In general, for unit testing, small test datasets suffice. Larger datasets can be useful for benchmarking however. But since we currently don't have benchmarks set up, perhaps we should thake these two rasters out of the unit tests altogether, and do these same test on one of the smaller rasters instead.

To answer your question, if I open the raster in QGIS, I only see a value of 255 at a few select spots, where there is a little cloud in the satellite image. To test whether indexing works correctly, it's important to not pick a value that occurs a lot, or you risk getting it right for the wrong reason. It would probably be best to test not only the first band, but the other bands as well, as well as a neighbouring cell with different values. This will protect against off-by-one errors better.

By the way I just noticed that a lot of the ospy/data# data is identical. You can see the checksums being equal as well. So this would be good to take out as well. The two big rasters I mention in #10 are identical as well, I hadn't noticed that before.

pritamd47 · 2020-01-19T21:19:50Z

Thanks for the suggestions! Comparing the checksums didn't really cross my mind, and now that I am looking at the data, these are indeed the same datasets. Removing the redundant files brings down the size to 122.5 MB (of which, 82.7 MB is aster.img).

And choosing the pixel values of the clouds do seem reasonable – that being said, I'll see if the clipping of the large aster.img can be done in a better way. I was trying to keep the No-data Pixels in the cropped version, now will also try to keep your suggestion in mind and include some rare-pixel values.

pritamd47 · 2020-01-23T17:12:17Z

[UPDATE]

So I've been working on this issue, and with help and suggestions from fellow contributors, I've managed to decrease the size of the required data-set from ~200MB to 32MB.

Changes Made:

Removed Aster.img, and replaced with a cropped version of the same Cropped_aster.img
Rewrote some tests (made minor changes to values such as count and sum of pixels, verifying with rasterio and QGIS)
Some tests which require combination of certain Feature data-sets and the Aster.img raster have been commented out – which I plan to rewrite so that they use other files instead. These are tests covering the Homeworks by Chris Garrard. The coverage hence has dropped to 86% – which will improve once the tests are rewritten.
[Temporary] Currently the data is being downloaded from a fork , instead of yeesian/ArchGDALDatasets.

Other changes which I plan to work on:

To check if redundancy can further be checked, and tests rewritten such that minimum number of files need to be downloaded.
Rewrite broken tests and create new data-sets if required.

Please feel free to have a look at the changes made at pritamd47/ArchGDAL.jl and make any suggestions.

evetion · 2020-01-24T16:43:33Z

Nice work!

Change overview: master...pritamd47:master
Feel free to make a PR based on the above, it makes it easier to review and also comment on those changes. You can put [WIP] or don't merge yet somewhere to indicate it's still a work in progress.

pritamd47 · 2020-01-25T05:36:58Z

Feel free to make a PR based on the above, it makes it easier to review and also comment on those changes. You can put [WIP] or don't merge yet somewhere to indicate it's still a work in progress.

Will do this soon. I plan to write some tests and then make a pull request for discussion over the changes.

yeesian · 2020-09-19T17:10:16Z

Closed by #121

pritamd47 mentioned this issue Jan 29, 2020

WIP: Changing/Omitting tests to decrease size of downloaded datasets #106

Closed

pritamd47 mentioned this issue May 13, 2020

Decreasing test size #121

Merged

yeesian closed this as completed Sep 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Decreasing the size of test data #104

[Suggestion] Decreasing the size of test data #104

pritamd47 commented Jan 9, 2020

yeesian commented Jan 10, 2020

pritamd47 commented Jan 10, 2020

evetion commented Jan 13, 2020

pritamd47 commented Jan 14, 2020

pritamd47 commented Jan 19, 2020

visr commented Jan 19, 2020

pritamd47 commented Jan 19, 2020

pritamd47 commented Jan 23, 2020

evetion commented Jan 24, 2020

pritamd47 commented Jan 25, 2020

yeesian commented Sep 19, 2020

[Suggestion] Decreasing the size of test data #104

[Suggestion] Decreasing the size of test data #104

Comments

pritamd47 commented Jan 9, 2020

yeesian commented Jan 10, 2020

pritamd47 commented Jan 10, 2020

evetion commented Jan 13, 2020

pritamd47 commented Jan 14, 2020

pritamd47 commented Jan 19, 2020

visr commented Jan 19, 2020

pritamd47 commented Jan 19, 2020

pritamd47 commented Jan 23, 2020

evetion commented Jan 24, 2020

pritamd47 commented Jan 25, 2020

yeesian commented Sep 19, 2020