Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow H5T_INTEGER in HDF5 files #2978

Merged
merged 2 commits into from
Sep 24, 2015
Merged

Conversation

lukeyeager
Copy link
Contributor

Caffe is currently hardcoded to only support the H5T_FLOAT datatype class (originally added by @sergeyk in #203). All I had to do was allow the H5T_INTEGER class and now I can use HDF5 datasets created with dtype='uint8'. That lets me create datasets with much smaller filesizes, comparable to LMDB sizes with uncompressed data (see comparison here - NVIDIA/DIGITS#226 (comment)).

Was there a particular reason that H5T_INTEGER was disallowed?

CHECK_EQ(class_, H5T_FLOAT) << "Expected float or double data";
switch (class_) {
case H5T_FLOAT:
LOG(INFO) << "Datatype class: H5T_FLOAT";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this and the below LOG(INFO)s (or change to VLOG, or one of the LOG variations that only happens the first time it's hit, if you prefer)? I think this creates too much noise while training nets with an HDF5DataLayer, especially if using relatively small HDF5 files.

@jeffdonahue
Copy link
Contributor

This LGTM except as commented above. Thanks @lukeyeager!

@lukeyeager
Copy link
Contributor Author

Updated LOG to LOG_FIRST_N. Neat trick!

@jeffdonahue
Copy link
Contributor

Great, thanks again @lukeyeager!

jeffdonahue added a commit that referenced this pull request Sep 24, 2015
Allow H5T_INTEGER in HDF5 files
@jeffdonahue jeffdonahue merged commit 349ff65 into BVLC:master Sep 24, 2015
f.write(script_dir + '/sample_data.h5\n')
f.write(script_dir + '/sample_data_2_gzip.h5\n')
f.write('src/caffe/test/test_data/sample_data.h5\n')
f.write('src/caffe/test/test_data/sample_uint8_gzip.h5\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops -- I missed this in my review: we should have kept the script_dir-relative paths here and below. I'll fix this in the near future (or feel free to send a patch, @lukeyeager or anyone else).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I had a good reason for doing this, but I can't remember what it was now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, this lets you run the generate_sample_data.py script from the src/caffe/test/test_data/ directory and still generate the correct paths in the text file. Otherwise, if you run the script again, the paths in the textfiles change from:

src/caffe/test/test_data/sample_data.h5

to:

/home/lyeager/caffe/caffe/src/caffe/test/test_data/sample_data.h5

I chose to fix it this way. The other way to fix it would be to remove the os.path.abspath() call on line 8. Would you rather me fix it that way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah...thanks for the explanation, I didn't realize that was an issue when I merged #2274. In retrospect I think we should have just stuck with the existing behavior which required this script to be run from Caffe root, unless we were going to move to running this script to generate the data on-the-fly at test time (rather than tracking the test data files as we do now), which I think is probably a better way to do it (but might be rather involved w.r.t. the potential payoff...). Anyway, I agree with you & retract my previous suggestion, and I think we should probably revert most or all of #2274 since we can't use the script_dir-relative paths everywhere.

@lukeyeager lukeyeager deleted the h5t_integer branch September 24, 2015 20:20
lukeyeager added a commit to lukeyeager/caffe that referenced this pull request Sep 24, 2015
aidangomez pushed a commit to aidangomez/caffe that referenced this pull request Sep 28, 2015
@ih4cku
Copy link
Contributor

ih4cku commented Oct 27, 2015

Is it ok to use uint8 for hdf5 data when caffe only uses H5LTread_dataset_float and H5LTread_dataset_double?

@lukeyeager
Copy link
Contributor Author

Try it out and let me know. It works for me, so I'm assuming H5LTread_dataset_float is able to read uint8 (or really H5T_INTEGER) data and automatically convert it to float32, which makes sense.

@ih4cku
Copy link
Contributor

ih4cku commented Oct 27, 2015

I made a simple test, no problem.
More specifically, I create the h5 file with h5py filled with uint8 data, read with C++ into a float array and check each value.

I had been lazy not verifying myself. Thanks for adding this feature.@lukeyeager

acmiyaguchi pushed a commit to acmiyaguchi/caffe that referenced this pull request Nov 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants