Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete download of MNIST data #1875

Closed
mattwescott opened this issue May 21, 2019 · 9 comments · Fixed by #1882
Closed

Incomplete download of MNIST data #1875

mattwescott opened this issue May 21, 2019 · 9 comments · Fixed by #1882

Comments

@mattwescott
Copy link
Contributor

mattwescott commented May 21, 2019

Description

make test-examples fails to fully download mnist data dependency for the air example

Details

On osx==10.14.1 and docker==18.09.2

git checkout 3ac5a02e0e6b0a11ae796707413c11df2c14ff6b
make build pyro_branch=dev pytorch_branch=release python_verion=3.6
make run pyro_branch=dev pytorch_branch=release python_verion=3.6
cd pyro
pip install numpy==1.15
pip install scipy==1.2
make test-examples > output 2>&1

Output

Resulting files in .data

ls -lh -rw-r--r--  1 matt  staff    19M May 21 15:03 train-images-idx3-ubyte.gz.part
@neerajprad
Copy link
Member

I think the observations library is trying to download the files from http://yann.lecun.com/exdb/mnist/. Is this an intermittent failure, or does it fail on multiple retries? I do not have a better solution for this other than us moving away from observations.

@null-a - I think only the AIR example is using observations. Do we have any other sources for the multi-MNIST dataset? The dataset generation code seems straightforward, so if not, we can write our own data loader for this. Also see #1871.

@mattwescott
Copy link
Contributor Author

It seemed to fail on multiple retries. And IIRC the same error came up outside of docker yesterday.

Looks like the download might be coming from here...

@patch_dependency('torchvision.datasets.MNIST', torchvision)

The original output seemed to indicate that the download was successful, with the load attempt then failing because .part was all that made it into the folder.

@neerajprad
Copy link
Member

Looks like the download might be coming from here...

The AIR example does not use torchvision.datasets and the error is coming from the observations library download, which seems to be a google storage link mirroring the original site.

@null-a
Copy link
Collaborator

null-a commented May 22, 2019

I'm not sure what's going on here, this is working consistently for me at the moment. (dev branch, no docker, Mac OS.) Out of interest, are you able to grab the file that failed manually, with something like curl -O https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz? If so perhaps you could download all four .gz files to the examples/air/.data directory manually as a work-around?

Do we have any other sources for the multi-MNIST dataset?

@neerajprad I'm not aware of a canonical version of this, everyone seems to roll their own from MNIST, and of those I don't know of any that are hosted in a suitable way.

The dataset generation code seems straightforward

That's right. In my original implementation I just generated the data on demand from the MNIST data available via pytorch, only later was this switched over to observations. Are you suggesting we pretty much revert to something like that? If so, I could make a PR for that if you like? (Also happy for someone else to do it, of course.)

@mattwescott
Copy link
Contributor Author

This time I saved the initial error output, which might be more helpful

$ make build pyro_branch=dev pytorch_branch=release python_version=3.6
$ make run pyro_branch=dev pytorch_branch=release python_version=3.6
$ pip install numpy==1.15.0 #  prevent binary warnings that are escalated into errors #1873
$ curl -O https://storage.googleapis.com/cvdf-datasets/mnist/train-images-idx3-ubyte.gz 
$ echo $?
0
$ make test-examples > make_text_examples.txt 2>&1

make_text_examples.txt

@mattwescott
Copy link
Contributor Author

If so perhaps you could download all four .gz files to the examples/air/.data directory manually as a work-around?

Not blocked by this but I'm reluctant to add more environment workarounds. For now I'll remove -x from make test, so it doesn't bail on the first failure. Seems easier to track specific test failures than environment state.

As an aside, I've seen it work well to run the CI tests inside the docker container(s). IIRC, we factored out a base container that is rebuilt infrequently, and on each push, rebuilt the application container with --no-cache to closely match a fresh dev environment, running the tests using the same flow and commands that were used locally.

@neerajprad
Copy link
Member

Are you suggesting we pretty much revert to something like that? If so, I could make a PR for that if you like? (Also happy for someone else to do it, of course.)

Thanks, @null-a, that will be great! The observations library is deprecated in favor of tensorflow datasets which doesn't have multi MNIST at the moment, and it might be best to avoid adding tensorflow as a dependency anyways. I think we can just reinstate your original dataset creation utility in pyro.contrib.examples module. We already have the MNIST dataset in S3 backed by cloudfront at this url. Could you add a PR reinstating your earlier code? I'll be happy to contribute any fixes required, and get it merged.

@null-a
Copy link
Collaborator

null-a commented May 23, 2019

I wonder if the original issue @mattwescott is running into is a race condition that occurs when trying to fetch a data set from multiple processes? Trying this locally, if I clear out the directory in which the VAE example caches data (rm -rf examples/vae/utils/.data/) and run pytest tests/test_examples.py -k 'vae/vae.py' then I see a success. However, clearing the cached data and running pytest -n auto tests/test_examples.py -k 'vae/vae.py' results in a failure that looks quite similar to the one reported. (Something similar happens with the AIR example.) @neerajprad Is it possible that this could be the problem without it biting us elsewhere?

@neerajprad
Copy link
Member

neerajprad commented May 23, 2019

@null-a : I think you are right, and that is exactly what is happening in this case since we have multiple tests for each example (there are two for AIR). This probably leads to a race condition when two subprocesses are trying to download to the same directory. I think we should just change our make test-examples to not run pytest with xdist, and given that our examples are compute heavy, I doubt if there is much benefit to distributing them across different cores anyways. I will test and make this change shortly.

EDIT: Running with xdist halves the amount of time it takes to run test-examples on my laptop with 12 cores, from 21 minutes to 10 minutes. If there was a way to group tests to run on the same worker using xdist, that would be ideal, but the support for this feature is lacking. See pytest-dev/pytest-xdist#365. I think it is best if we change the default to not use xdist so that new users aren't tripped. There are other ways around this, for instance by populating all the datasets beforehand, but not sure if it is worth spending too much time on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants