Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

How to run CaffeOnSpark with a pre-existing model? #265

Open
lakshya97 opened this issue Jun 27, 2017 · 5 comments
Open

How to run CaffeOnSpark with a pre-existing model? #265

lakshya97 opened this issue Jun 27, 2017 · 5 comments

Comments

@lakshya97
Copy link

lakshya97 commented Jun 27, 2017

Hi,

I'm looking for a way to separate the training and testing phases in CaffeOnSpark. In other words, I'd like to create an MNIST Model and train it in one phase and then test it in another (and save that model for testing with different data). Is it possible to do this without interleaving the data (as is done in the wiki example)? For example, first I would train the model and generate it without testing anything. Then, I could use that existing model (without training a new one on the same training data all over again) on multiple different test datasets.

Is there a way to do this? Additionally, regardless of the separation of the phases, is there a way to use an existing/trained CaffeOnSpark model on new data (instead of creating an entirely new model on training data each time you wish to run it)? How could I do this/what commands do I need to modify?

Thanks!

@junshi15
Copy link
Collaborator

Yes, you can test with an existing model.
https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2
You just need to remove the "-train -persistent" options.

hadoop fs -rm -f /cifar10.model.h5 /cifar10_features_result
spark-submit --master ${MASTER_URL}
--files cifar10_quick_solver.prototxt,cifar10_quick_train_test.prototxt,mean.binaryproto
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-test
-conf cifar10_quick_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /cifar10.model.h5
-output /cifar10_test_result
hadoop fs -ls /cifar10.model.h5
hadoop fs -cat /cifar10_test_result

@lakshya97
Copy link
Author

What about for LMDB on YARN? I imagine it would be similar but we would remove the lines reading
-train
-features accuracy,loss -label label :

and replace it with just a -test, right?

Are there any other files we would need to modify? And where would we store the model (is there a need to remove it from hadoop?) and how would we tell CaffeOnSpark to read from that model instead of generating a new one? I thought we would remove the line reading
hadoop fs -rm -f hdfs:///mnist.model if we want to it to read from the existing model stored in hdfs, but is this wrong?

Thank you!!
(below is what I'd imagine it should look like?)

hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master yarn --deploy-mode cluster
--num-executors ${SPARK_WORKER_INSTANCES}
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-test
-conf lenet_memory_solver.prototxt
-devices ${DEVICES}
-connection ethernet
-model hdfs:///mnist.model
-output hdfs:///mnist_features_result
hadoop fs -ls hdfs:///mnist.model
hadoop fs -cat hdfs:///mnist_features_result/*

@junshi15
Copy link
Collaborator

"LMDB" is a data format, to use it, you need change "source_class" in lenet_memory_train_test.prototxt. We do not recommend "LMDB" for large data set since it is not a distributed data format.

"hadoop fs -rm" deletes the file/directory, if you don't want to delete it, don't do it. Note the job will fail if your program writes to an existing directory, since overwriting is not allowed.

Only "-train" generates new model. "-test" and "-features" read the provided model. Don't delete the existing model if you use either "-test" or "-features" since it won't be able to read it.

@lakshya97
Copy link
Author

Thank you, I did all that and it runs fine now :).
One other question: You said LMDB is not a distributed data format, but Spark still partitions the work across the workers, so we can still use it for distributed learning right? I am finding that the when I use 3 nodes for a ~1GB LMDB file, it is much faster than using 1 node for it (as I keep the batch size the same at 64, meaning I get 3x the throughput per iteration and thus would need 1/3 of the original number of iterations). Am I wrong?

Thank you

@junshi15
Copy link
Collaborator

CaffeOnSpark will copy entire LMDB file to all executors, since the we can not really partition it without reading it first, as opposite to dataframe or sequencefile, where you can read part of the file.

https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/LmdbRDD.scala#L43

Spark does partition the file afterwards, so each executor only processes partitions.
You effectively used 3X batch size, you may want to look at your accuracy, sometime, you need to tweak the learning rate. and you may need a little bit more than 1/3 original total number of iterations.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants