Snapshot model weights/solver state to HDF5 files #2836

erictzeng · 2015-07-30T02:38:23Z

This pull request enables Caffe to snapshot model weights and solver states to HDF5 files and makes this format the default. This format provides a number of advantages:

It obeys weight-sharing, only snapshotting one copy of each of the parameters in the network. The old snapshotting method would save redundant copies of weight-shared parameters.
This should enable snapshotting of networks that are arbitrarily large, whereas protobuf imposes a hard limit.
- Note that, while snapshots themselves can be arbitrarily large, parameters themselves receive new constraints: the maximum number of dimensions an HDF5 dataset can have is 32, and each dimension is capped at 2^64. If anyone is using 33+ dimensional parameters, we can discuss further...

To avoid confusion with the old snapshotting methods, snapshotting to HDF5 files adopts new file extensions, namely .caffemodel.h5 and .solverstate.h5. When restoring either weights or solver history from a file, the extension of the file is checked. If the extension is .h5, it is loaded as an HDF5 file. All other extensions are treated as a binary protobuf file and loaded as before.

The default snapshot format is switched to HDF5 in this PR. If you prefer the old method, you can add snapshot_format: BINARYPROTO to your solver prototxt to restore binary protobuf snapshotting.

A few miscellaneous details:

This PR is rebased off of one of @jeffdonahue's commits, primarily for its TestSnapshot test for gradient-based solvers.
The few HDF5 helper functions that previously resided in util/io.cpp have been moved out to their own file, util/hdf5.cpp, and additional helper functions have been added.
There were some nasty interface changes for both the Net and the Solver, since we now have methods for both BinaryProto and HDF5. Everything in Caffe checks out, but downstream users who have implemented their own non-SGD solvers/solvers with nonstandard snapshotting may have a bad time.

Potential caveats

Commit d896647 changes the behavior of the function hdf5_save_nd_dataset. Previously, said function always saved 4-D blobs. It has since been changed to save N-D blobs instead. This could potentially break people's workflows if they were relying on HDF5OutputLayers to output 4-D blobs.
Testing could be a bit more thorough. These are next on my list, but I wanted to throw this PR out there in the meanwhile. Off the top of my head:
- ~~There aren't any tests that compare the loaded solver history.~~
- There aren't any tests that verify that weight-shared networks are correctly snapshotted/restored.

Possible extensions

These extensions won't end up in this PR, but possible things to do after this wraps up:

It's probably worth looking into how to enable HDF5 compression, as that could potentially drastically reduce the size of serialized models.
I like the idea of being able to include the network structure in the .h5 file itself, something that the binary protobuf snapshotting partially does, though it only captures the network topology and discards the layer-specific parameters. One possible way to do this is to write the prototxt as a string to an additional dataset in the .caffemodel.h5 file, though currently there's no complete way to turn a Net into a NetParameter, so this would require extra engineering effort.

shelhamer · 2015-07-30T06:04:32Z

This satisfies part of #1211.

Yeongtae · 2015-08-01T05:58:52Z

can i access the networks which are blobs using hdf5? If it can, please show the example.

jeffdonahue · 2015-08-06T01:22:26Z

I've skimmed through this and it mostly looks good, thanks @erictzeng. My one piece of feedback right now is that kMaxBlobAxes should be changed from INT_MAX to 32 unless/until it becomes possible to serialize blobs with more dimensions than that. (One possibility is that all blobs could be stored as 1D HDF5 datasets, with shapes themselves also separately stored as 1D HDF5 datasets, but I don't think anything like that needs to be done here; supporting >2GB nets should probably be considered higher priority than supporting blobs with more than 32 axes.)

shelhamer · 2015-08-06T01:24:40Z

src/caffe/net.cpp

+      << "Error reading weights from " << trained_filename;
+    // Check that source layer doesn't have more params than target layer
+    int num_source_params = hdf5_get_num_links(layer_hid);
+    CHECK_LE(num_source_params, target_blobs.size())


~~Should this check equality? You might want to know for instance that the source layer has a bias but the target does not.~~ Sorry, the check in 799-808 covers the rest.

shelhamer · 2015-08-06T01:42:40Z

This will be a good switch, and the backward compatibility saves a lot of heartache, but we might consider bringing the documentation and examples along with us as there are references to the current extensions here and there.

This looks good to me code-wise (once Jeff's comment is addressed) but you could squash related changes and fixes when you're done.

Since the weight sharing tests don't cover save and restore (TestSharedWeightsResume does not use an actual Solver) you could add a snapshot test with a simple weight shared net for completeness.

Thanks @erictzeng!

jeffdonahue · 2015-08-06T02:19:37Z

Since the weight sharing tests don't cover save and restore (TestSharedWeightsResume does not use an actual Solver) you could add a snapshot test with a simple weight shared net for completeness.

The tests I added in #2866 do cover this (though they're less unit tests and more integration tests than what you propose, as they also rely on the solver snapshot/restore correctness).

shelhamer · 2015-08-06T02:25:13Z

@jeffdonahue oh sweet, TestSnapshot and company take care of it then. I think this should be merged once kMaxBlobAxes is switched and the history squashed.

bhack · 2015-08-06T06:50:22Z

Why not https://google.github.io/flatbuffers/?

shelhamer · 2015-08-06T21:27:56Z

@bhack this lets us keep the same dependencies and interface for defining models. Migrating away to protobuf for a new format needs a good argument and its own issue since model definitions would change.

bhack · 2015-08-06T21:31:24Z

@shelhamer Flatbuffers support .proto parsing for easier migration from Protocol Buffers

when using Dtype == double

restoring net/solver from snapshot

erictzeng · 2015-08-07T20:59:57Z

That should be all comments addressed! The constant has been lowered to 32 as requested, and history has been squashed. Let me know if anything else seems off.

@Yeongtae I'm not sure I fully understand what you're asking, but this PR allows you to access network parameters via HDF5, if that's what you want. The parameters are stored in a fairly simple structure. Here's how you'd peek at the conv1 parameters in lenet:

$ h5ls examples/mnist/lenet_iter_10000.caffemodel.h5/data/conv1
0                        Dataset {20, 1, 5, 5}
1                        Dataset {20}

The datasets 0 and 1 correspond to the weights and biases of the layer, respectively.

jeffdonahue · 2015-08-07T21:20:51Z

src/caffe/net.cpp

+                                     H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
+    CHECK_GE(layer_data_hid, 0)
+      << "Error saving weights to " << filename << ".";
+    hid_t layer_diff_hid = H5Gcreate2(diff_hid, layer_name.c_str(),


shouldn't the diff dataset only be created if write_diff is set?

Summary of changes: - HDF5 helper functions were moved into a separate file util/hdf5.cpp - hdf5_save_nd_dataset now saves n-d blobs, can save diffs instead of data - Minor fix for memory leak in HDF5 functions (delete instead of delete[]) - Extra methods have been added to both Net/Solver enabling snapshotting and restoring from HDF5 files - snapshot_format was added to SolverParameters, with possible values HDF5 or BINARYPROTO (default HDF5) - kMaxBlobAxes was reduced to 32 to match the limitations of HDF5

jeffdonahue · 2015-08-07T22:22:37Z

Everything looks good, thanks Eric!

Snapshot model weights/solver state to HDF5 files

bhack · 2015-08-07T22:29:00Z

My vote still go to flatbuffers as a natural google successor to protobuf. But with this merge hdf5 it is the de facto standard for caffe models now and nobody replied to the evaluation process of protobuff substitute.

Adapt HDF5DataLayer Prefetch to BVLC#2836

shaibagon · 2016-01-11T12:04:37Z

What about a python interface for saving a net to HDF5? This can be useful for "net surgery".
I tried to hack it myself, adding

void Net_SaveToHDF5(const Net<Dtype>& net, string filename, bool write_diff) {
  net.ToHDF5(filename.c_str(), write_diff);
}

To python/caffe/_caffe.cpp, and .def("save_to_hdf5", &Net_SaveToHDF5) to the Net class definition.

However, I got this error:

HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139737896589120:
#000: ../../../src/H5G.c line 310 in H5Gcreate2(): unable to create group
major: Symbol table
minor: Unable to initialize object
#1: ../../../src/H5Gint.c line 194 in H5G__create_named(): unable to create and link to group
major: Symbol table
minor: Unable to initialize object
#2: ../../../src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
major: Links
minor: Unable to initialize object
#3: ../../../src/H5L.c line 1882 in H5L_create_real(): can't insert link
major: Symbol table
minor: Unable to insert object
#4: ../../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#5: ../../../src/H5Gtraverse.c line 755 in H5G_traverse_real(): component not found
major: Symbol table
minor: Object not found
F0111 13:48:01.398217 28230 net.cpp:948] Check failed: layer_data_hid >= 0 (-1 vs. 0) Error saving weights to /path/to/model.h5.
*** Check failure stack trace: ***
Aborted (core dumped)

shaibagon · 2016-05-19T16:37:06Z

@shelhamer There seems to be some "hiccups" with snapshoting to hdf5 format.
See, for example, this SO question. I had similar issues myself.
Are you aware of these "hiccups"? are you working to solve them?

shelhamer · 2016-05-19T18:32:33Z

@shaibagon I'm not aware of any issue, so could you post an issue with details to reproduce the problem with Caffe master? I don't know anything about the OpenCV DNN package mentioned at that SO link.

Please mention @erictzeng in the issue as the author of this PR.

Snapshot model weights/solver state to HDF5 files * erictzeng/hdf5_snapshot: (29 commits) Update example bash scripts to expect .h5, new extensions in .gitignore TestSnapshot expects .h5 snapshots, explicitly checks history. Snapshot model weights/solver state to HDF5 files. TestGradientBasedSolver: add TestSnapshot to verify behavior when restoring net/solver from snapshot add double_data, double_diff to BlobProto for weights/snapshots saved when using Dtype == double Fix typo PythonLayer takes parameters by string [pytest] open exception file with mode for python3 [pycaffe,build] include Python first in caffe tool ImageData layer default batch size of 1, and check for zero batch size Change log levels in upgrade_proto [docs] add CONTRIBUTING.md which will appear on GitHub new Issue/PR pages [docs] fix contrastive loss eq [docs] fix lmdb fetch url and path [docs] clear up PYTHONPATH confusion Fix path to mnist_autoencoder.prototxt [docs] set lmdb url to github mirror [docs] matlab 2015a compatible Travis scripts for python3 and pytest for cmake. Also fixes CUDA CMake build issue BVLC#2722. [examples] fix link to point to new tutorial notebook ... Conflicts: .travis.yml include/caffe/python_layer.hpp scripts/travis/travis_build_and_test.sh scripts/travis/travis_install.sh src/caffe/proto/caffe.proto src/caffe/solver.cpp src/caffe/test/test_gradient_based_solver.cpp tools/caffe.cpp

erictzeng force-pushed the hdf5_snapshot branch 2 times, most recently from 0241c9f to 86cdba4 Compare July 30, 2015 05:42

erictzeng force-pushed the hdf5_snapshot branch from 51952a5 to c31fb4f Compare July 30, 2015 22:48

jeffdonahue mentioned this pull request Aug 6, 2015

Fix weight sharing #2866

Merged

shelhamer reviewed Aug 6, 2015
View reviewed changes

shelhamer added enhancement focus labels Aug 6, 2015

shelhamer mentioned this pull request Aug 6, 2015

Improve / Fix Weight Sharing #1211

Open

8 tasks

jeffdonahue added 2 commits August 7, 2015 13:48

add double_data, double_diff to BlobProto for weights/snapshots saved

f973819

when using Dtype == double

TestGradientBasedSolver: add TestSnapshot to verify behavior when

1e740e1

restoring net/solver from snapshot

erictzeng force-pushed the hdf5_snapshot branch from 5511688 to 6799ddc Compare August 7, 2015 20:55

jeffdonahue reviewed Aug 7, 2015
View reviewed changes

erictzeng force-pushed the hdf5_snapshot branch from 6799ddc to 73058c8 Compare August 7, 2015 21:43

erictzeng added 3 commits August 7, 2015 14:56

TestSnapshot expects .h5 snapshots, explicitly checks history.

5c89c64

Update example bash scripts to expect .h5, new extensions in .gitignore

c9b333e

erictzeng force-pushed the hdf5_snapshot branch from 73058c8 to c9b333e Compare August 7, 2015 21:57

jeffdonahue added a commit that referenced this pull request Aug 7, 2015

Merge pull request #2836 from erictzeng/hdf5_snapshot

fc77ef3

Snapshot model weights/solver state to HDF5 files

jeffdonahue merged commit fc77ef3 into BVLC:master Aug 7, 2015

ronghanghu mentioned this pull request Aug 8, 2015

Adaptive Solvers: AdaDelta, RMSprop, and ADAM #2860

Closed

3 tasks

jeffdonahue mentioned this pull request Aug 8, 2015

HDF5 snapshot crashes on duplicate layer names #2885

Closed

shelhamer mentioned this pull request Aug 8, 2015

Can't serialize when protobuf size > 2gb #2006

Closed

This was referenced Aug 9, 2015

Adam solver #2856

Closed

AdaDelta Solver (v3) #2782

Merged

ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015

rebase & clean up HDF5DataLayer Prefetch

b43e93b

Adapt HDF5DataLayer Prefetch to BVLC#2836

ronghanghu mentioned this pull request Aug 9, 2015

[Don't Merge] Rebase and Clean up Hdf5DataLayer Prefetch #2892

Open

ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015

rebase & clean up HDF5DataLayer Prefetch

3d0a331

Adapt HDF5DataLayer Prefetch to BVLC#2836

ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015

rebase & clean up HDF5DataLayer Prefetch

1c53821

Adapt HDF5DataLayer Prefetch to BVLC#2836

ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015

rebase & clean up HDF5DataLayer Prefetch

b00e872

Adapt HDF5DataLayer Prefetch to BVLC#2836

ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015

rebase & clean up HDF5DataLayer Prefetch

2b7c2e4

Adapt HDF5DataLayer Prefetch to BVLC#2836

ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 9, 2015

rebase & clean up HDF5DataLayer Prefetch

70168ba

Adapt HDF5DataLayer Prefetch to BVLC#2836

ronghanghu added a commit to ronghanghu/caffe that referenced this pull request Aug 10, 2015

rebase & clean up HDF5DataLayer Prefetch

11d0d74

Adapt HDF5DataLayer Prefetch to BVLC#2836

jeffdonahue mentioned this pull request Aug 20, 2015

Caffemodel snapshots with shared weights don't have multiple copies #2946

Closed

bhack mentioned this pull request Sep 21, 2015

Make HDF5 components optional for configurable build #2619

Closed

ronghanghu mentioned this pull request Oct 7, 2015

test_gradient_based_solver fails #3109

Closed

mshabunin mentioned this pull request Dec 11, 2015

add HDF5_INCLUDE_DIR for compiled with Caffe opencv/opencv_contrib#453

Closed

elezar mentioned this pull request Feb 8, 2016

solver trace feature to trace weights, loss, diffs and much more #3569

Open

shelhamer mentioned this pull request Feb 26, 2016

Separate dependencies for configurable installation #1738

Closed

bhack mentioned this pull request Jul 22, 2016

quantization, bug fix in deconv and graph enet tiny-dnn/tiny-dnn#206

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot model weights/solver state to HDF5 files #2836

Snapshot model weights/solver state to HDF5 files #2836

erictzeng commented Jul 30, 2015

shelhamer commented Jul 30, 2015

Yeongtae commented Aug 1, 2015

jeffdonahue commented Aug 6, 2015

shelhamer Aug 6, 2015

shelhamer commented Aug 6, 2015

jeffdonahue commented Aug 6, 2015

shelhamer commented Aug 6, 2015

bhack commented Aug 6, 2015

shelhamer commented Aug 6, 2015

bhack commented Aug 6, 2015

erictzeng commented Aug 7, 2015

jeffdonahue Aug 7, 2015

jeffdonahue commented Aug 7, 2015

bhack commented Aug 7, 2015

shaibagon commented Jan 11, 2016

shaibagon commented May 19, 2016

shelhamer commented May 19, 2016

Snapshot model weights/solver state to HDF5 files #2836

Snapshot model weights/solver state to HDF5 files #2836

Conversation

erictzeng commented Jul 30, 2015

Potential caveats

Possible extensions

shelhamer commented Jul 30, 2015

Yeongtae commented Aug 1, 2015

jeffdonahue commented Aug 6, 2015

shelhamer Aug 6, 2015

Choose a reason for hiding this comment

shelhamer commented Aug 6, 2015

jeffdonahue commented Aug 6, 2015

shelhamer commented Aug 6, 2015

bhack commented Aug 6, 2015

shelhamer commented Aug 6, 2015

bhack commented Aug 6, 2015

erictzeng commented Aug 7, 2015

jeffdonahue Aug 7, 2015

Choose a reason for hiding this comment

jeffdonahue commented Aug 7, 2015

bhack commented Aug 7, 2015

shaibagon commented Jan 11, 2016

shaibagon commented May 19, 2016

shelhamer commented May 19, 2016