Incomplete download of MNIST data #1875

mattwescott · 2019-05-21T19:15:37Z

Description

make test-examples fails to fully download mnist data dependency for the air example

Details

On osx==10.14.1 and docker==18.09.2

git checkout 3ac5a02e0e6b0a11ae796707413c11df2c14ff6b
make build pyro_branch=dev pytorch_branch=release python_verion=3.6
make run pyro_branch=dev pytorch_branch=release python_verion=3.6
cd pyro
pip install numpy==1.15
pip install scipy==1.2
make test-examples > output 2>&1

Output

Resulting files in .data

ls -lh -rw-r--r--  1 matt  staff    19M May 21 15:03 train-images-idx3-ubyte.gz.part

The text was updated successfully, but these errors were encountered:

neerajprad · 2019-05-21T19:26:04Z

I think the observations library is trying to download the files from http://yann.lecun.com/exdb/mnist/. Is this an intermittent failure, or does it fail on multiple retries? I do not have a better solution for this other than us moving away from observations.

@null-a - I think only the AIR example is using observations. Do we have any other sources for the multi-MNIST dataset? The dataset generation code seems straightforward, so if not, we can write our own data loader for this. Also see #1871.

mattwescott · 2019-05-21T19:37:54Z

It seemed to fail on multiple retries. And IIRC the same error came up outside of docker yesterday.

Looks like the download might be coming from here...

pyro/pyro/contrib/examples/util.py

Line 15 in 835aa31

@patch_dependency('torchvision.datasets.MNIST', torchvision)

The original output seemed to indicate that the download was successful, with the load attempt then failing because .part was all that made it into the folder.

neerajprad · 2019-05-21T20:25:06Z

Looks like the download might be coming from here...

The AIR example does not use torchvision.datasets and the error is coming from the observations library download, which seems to be a google storage link mirroring the original site.

null-a · 2019-05-22T08:55:09Z

I'm not sure what's going on here, this is working consistently for me at the moment. (dev branch, no docker, Mac OS.) Out of interest, are you able to grab the file that failed manually, with something like curl -O https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz? If so perhaps you could download all four .gz files to the examples/air/.data directory manually as a work-around?

Do we have any other sources for the multi-MNIST dataset?

@neerajprad I'm not aware of a canonical version of this, everyone seems to roll their own from MNIST, and of those I don't know of any that are hosted in a suitable way.

The dataset generation code seems straightforward

That's right. In my original implementation I just generated the data on demand from the MNIST data available via pytorch, only later was this switched over to observations. Are you suggesting we pretty much revert to something like that? If so, I could make a PR for that if you like? (Also happy for someone else to do it, of course.)

mattwescott · 2019-05-22T14:51:02Z

This time I saved the initial error output, which might be more helpful

$ make build pyro_branch=dev pytorch_branch=release python_version=3.6
$ make run pyro_branch=dev pytorch_branch=release python_version=3.6
$ pip install numpy==1.15.0 #  prevent binary warnings that are escalated into errors #1873
$ curl -O https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz 
$ echo $?
0
$ make test-examples > make_text_examples.txt 2>&1

make_text_examples.txt

mattwescott · 2019-05-22T15:14:43Z

If so perhaps you could download all four .gz files to the examples/air/.data directory manually as a work-around?

Not blocked by this but I'm reluctant to add more environment workarounds. For now I'll remove -x from make test, so it doesn't bail on the first failure. Seems easier to track specific test failures than environment state.

As an aside, I've seen it work well to run the CI tests inside the docker container(s). IIRC, we factored out a base container that is rebuilt infrequently, and on each push, rebuilt the application container with --no-cache to closely match a fresh dev environment, running the tests using the same flow and commands that were used locally.

neerajprad · 2019-05-22T17:36:17Z

Are you suggesting we pretty much revert to something like that? If so, I could make a PR for that if you like? (Also happy for someone else to do it, of course.)

Thanks, @null-a, that will be great! The observations library is deprecated in favor of tensorflow datasets which doesn't have multi MNIST at the moment, and it might be best to avoid adding tensorflow as a dependency anyways. I think we can just reinstate your original dataset creation utility in pyro.contrib.examples module. We already have the MNIST dataset in S3 backed by cloudfront at this url. Could you add a PR reinstating your earlier code? I'll be happy to contribute any fixes required, and get it merged.

null-a · 2019-05-23T12:51:18Z

I wonder if the original issue @mattwescott is running into is a race condition that occurs when trying to fetch a data set from multiple processes? Trying this locally, if I clear out the directory in which the VAE example caches data (rm -rf examples/vae/utils/.data/) and run pytest tests/test_examples.py -k 'vae/vae.py' then I see a success. However, clearing the cached data and running pytest -n auto tests/test_examples.py -k 'vae/vae.py' results in a failure that looks quite similar to the one reported. (Something similar happens with the AIR example.) @neerajprad Is it possible that this could be the problem without it biting us elsewhere?

neerajprad · 2019-05-23T18:43:24Z

@null-a : I think you are right, and that is exactly what is happening in this case since we have multiple tests for each example (there are two for AIR). This probably leads to a race condition when two subprocesses are trying to download to the same directory. I think we should just change our make test-examples to not run pytest with xdist, and given that our examples are compute heavy, I doubt if there is much benefit to distributing them across different cores anyways. I will test and make this change shortly.

EDIT: Running with xdist halves the amount of time it takes to run test-examples on my laptop with 12 cores, from 21 minutes to 10 minutes. If there was a way to group tests to run on the same worker using xdist, that would be ideal, but the support for this feature is lacking. See pytest-dev/pytest-xdist#365. I think it is best if we change the default to not use xdist so that new users aren't tripped. There are other ways around this, for instance by populating all the datasets beforehand, but not sure if it is worth spending too much time on this.

null-a mentioned this issue May 23, 2019

Add code to generate multi mnist dataset. #1881

Merged

neerajprad mentioned this issue May 23, 2019

Minor changes to test-examples #1882

Merged

fritzo closed this as completed in #1882 May 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete download of MNIST data #1875

Incomplete download of MNIST data #1875

mattwescott commented May 21, 2019 •

edited

Loading

neerajprad commented May 21, 2019

mattwescott commented May 21, 2019

neerajprad commented May 21, 2019

null-a commented May 22, 2019 •

edited

Loading

mattwescott commented May 22, 2019

mattwescott commented May 22, 2019

neerajprad commented May 22, 2019

null-a commented May 23, 2019 •

edited

Loading

neerajprad commented May 23, 2019 •

edited

Loading

Incomplete download of MNIST data #1875

Incomplete download of MNIST data #1875

Comments

mattwescott commented May 21, 2019 • edited Loading

Description

Details

neerajprad commented May 21, 2019

mattwescott commented May 21, 2019

neerajprad commented May 21, 2019

null-a commented May 22, 2019 • edited Loading

mattwescott commented May 22, 2019

mattwescott commented May 22, 2019

neerajprad commented May 22, 2019

null-a commented May 23, 2019 • edited Loading

neerajprad commented May 23, 2019 • edited Loading

mattwescott commented May 21, 2019 •

edited

Loading

null-a commented May 22, 2019 •

edited

Loading

null-a commented May 23, 2019 •

edited

Loading

neerajprad commented May 23, 2019 •

edited

Loading