Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leveldb/lmdb refactoring #1238

Merged
merged 22 commits into from
Oct 15, 2014
Merged

leveldb/lmdb refactoring #1238

merged 22 commits into from
Oct 15, 2014

Conversation

kmatzen
Copy link
Contributor

@kmatzen kmatzen commented Oct 7, 2014

This refactoring places leveldb and lmdb access behind a common interface. The interface supports:

  • open
  • close
  • put
  • sequential scan

It would be nice to use this interface to implement an even simpler database where binary blobs are written sequentially to file and then mmapped in during training. It's not clear why leveldb and lmdb are used when all that's being done is a sequential scan and I'd like to have timing measurements to learn what is best.

There are also a few bug fixes scattered throughout this pull request. io.hpp allocates too little memory for a strcpy, there's a delete/delete[] mismatch, opencv 3.0 alpha doesn't work as-is, and more. See #1261.

Existing tests pass, but none of the tools are tested as far as I can tell. This patch might also break any special leveldb or lmdb configurations. There wasn't an explanation as to why some flags were chosen, so I copied and pasted flags blindly.

@shelhamer
Copy link
Member

Nice refactoring! Can you switch the cifar10 data to lmdb while you're at it since Caffe is shifting to lmdb defaults everywhere? #1128 #1131.

@shelhamer
Copy link
Member

@kmatzen please make another PR of your misc. fixes in the first commit. Regarding OpenCV, please take a look at #1247 and compare.

@kmatzen
Copy link
Contributor Author

kmatzen commented Oct 10, 2014

@shelhamer Sure, no problem. #1261 has the OpenCV 3 and other minor fixes. I updated create_cifar10.sh to use lmdb, but I did not touch other build scripts.

@shelhamer
Copy link
Member

@kmatzen please squash the lint commit and drop c4b632a now that it's a separate PR so this'll be clear to merge.

@sguada
Copy link
Contributor

sguada commented Oct 11, 2014

@kmatzen I think this refactoring with #1239 could allow to include ImageDataLayer as another type of DB, which in reality doesn't store any DB, just read the needed files when requested.

CHECK(status.ok()) << "Failed to open leveldb " << leveldb_names[i];
feature_dbs.push_back(shared_ptr<leveldb::DB>(db));
LOG(INFO)<< "Opening database " << database_names[i];
shared_ptr<Database> database = DatabaseFactory("leveldb");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One should be able to specify the type of database, instead of assume "leveldb"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sguada
Copy link
Contributor

sguada commented Oct 12, 2014

Overall a nice PR, @kmatzen please address the comments above before merging.

@kmatzen
Copy link
Contributor Author

kmatzen commented Oct 12, 2014

Just a few more comments about the interface that I don't quite like.

(1) If there a failure anywhere (e.g. commit when the db was open in RO mode), the whole thing just fails on a CHECK*. It would be nice to replace that behavior with either exceptions or error codes that the caller is then responsible for handling (e.g. LevelDB's Status). I don't really see either in Caffe's existing codebase. What's the accepted method?
(2) The key and value for put and the key for get are all passed by non-const pointer. I did this since LMDB's MDB_val.mv_data is of type char*, not const char*. I'd rather pass these by const reference. Any recommendations? I could pass by const reference and then make a copy inside put or get for LMDB. LevelDB is fine since Slice takes const char*.
(3) The iterator invalidation behavior is going to lead to bugs if used incorrectly. LevelDB iterators must be deallocated before the database is deallocated. If you create an iterator, then it means that the database must own and track that iterator, but creating a const_iterator usually implies that you are not mutating the collection or the collection itself is const. Maybe add a mutable vector<shared_ptr<leveldb::Iterator> > collection to the LeveldbDatabase and store a weak_ptr<leveldb::Iterator> in the LeveldbState?

@sguada sguada mentioned this pull request Oct 12, 2014
@sguada
Copy link
Contributor

sguada commented Oct 12, 2014

@kmatzen thanks for your updates. Regarding your questions:

  1. Not sure about the CHECK_, in general the CHECK should happen when it is crucial for a valid execution. If the DB cannot be open, it doesn't make sense to continue, however when checking if a key exists maybe it makes sense to not fail and return false instead.
  2. I think the arguments to put and get should be consistent with our standard, const &buffer_t for inputs and buffer_t_for outputs. Maybeputandget could return a bool to indicate if they succeed or fail.
  3. Not sure about this, I don't fully follow the logic behind that, but maybe adding a check before deallocating the database could be enough.

One more thing, I'm not sure if buffer_t is the best name for the type, since they are vector<char>, what about vector_char, or even better having key_type and value_type which in this case both are vector<char>.

Would be possible to get a vector with all the keys, by calling vector<key_type> keys()

@kloudkl
Copy link
Contributor

kloudkl commented Oct 13, 2014

This PR makes it possible to simplify #1074 and provide a simple solution to #1155.

Edit: I just read the code. #1155 has been solved.

@kloudkl
Copy link
Contributor

kloudkl commented Oct 13, 2014

This a very good example to prove the power of "programing to an interface, not an implementation".

@kloudkl
Copy link
Contributor

kloudkl commented Oct 13, 2014

The tests for different db implmentations are almost the same. They can reuse the common parts.

};

typedef vector<char> key_type;
typedef vector<char> value_type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between vector<char> and string? It seems that you are following the conventions of the standard template library. Why not make Database more generic?

template<typename KeyType, typename ValueType>
class Database {
  ...
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kloudkl
Copy link
Contributor

kloudkl commented Oct 13, 2014

Please pay attention to properly ordering the keys as described in #1158.

@kloudkl
Copy link
Contributor

kloudkl commented Oct 13, 2014

Besides the extract_features tool, there are also a few other tools that directly use the raw data store.

@kloudkl
Copy link
Contributor

kloudkl commented Oct 13, 2014

#1266 tries to get input data from a remote service which is not a database. A more generic abstraction is better called Dataset.

Kevin James Matzen added 3 commits October 14, 2014 19:40
…finable KCoder and VCoder which default to a set of DefaultCoder's based on types K and V. Reworked the DefaultCoder's such that if none are available, a static assertion fails with a relevant message.
@kmatzen
Copy link
Contributor Author

kmatzen commented Oct 14, 2014

Rebased

@sguada
Copy link
Contributor

sguada commented Oct 15, 2014

@kmatzen is this PR ready to be reviewed again and merged?

@kmatzen
Copy link
Contributor Author

kmatzen commented Oct 15, 2014

Yes, go for it.

sguada added a commit that referenced this pull request Oct 15, 2014
leveldb/lmdb refactoring
@sguada sguada merged commit a23c9bf into BVLC:dev Oct 15, 2014
@shelhamer
Copy link
Member

While this PR has unified the leveldb and lmdb interfaces the joint interface has yielded a net addition of 1,000+ lines of code while introducing new classes. In my opinion it's desirable to cut this back and simplify as I'm having a hard time reasoning about the DB operations now -- although I admit that could be my own failing.

My concern is sharpened by the LMDB crash I am now encountering with my reshaping data layer #1313

F1017 02:29:37.982517 28564 lmdb_dataset.cpp:274] Check failed: 0 == retval (0 vs. -30790) mdb_txn_begin failed MDB_READERS_FULL: Environment maxreaders limit reached

although all it does is use iter_->value. This could be my fault in mis-use of the interface, but I'm troubled by the amount of effort I am now spending to figure this out.

@sguada
Copy link
Contributor

sguada commented Oct 18, 2014

@shelhamer I also spend quite a bit of time trying to understand it. I think we need more test for this, like for instance I had a similar error while doing first_key() and last_key(), which I only discovered because I had a test where I asked for both.

Maybe the deserializer with sharing data to avoid copies, could be introducing some undesired side effects.

@shelhamer do you encounter the same error when using LEVELDB?

@kmatzen
Copy link
Contributor Author

kmatzen commented Oct 18, 2014

See #1319 and let me know if that helps.

@longjon
Copy link
Contributor

longjon commented Oct 18, 2014

With @shelhamer, I am concerned about the haste with which we are merging code here and in #1239.

I had hoped this PR would reduce the duplication and overhead of working with the databases, resulting in a mildly negative net number of lines of code added. Instead we are up 1400. That's too much for me to give a full and coherent analysis of right now, but here are a few concerns that come to mind on a first read:

  • We have type parameters for KCoder and VCoder, but it seems there is only one coder, and it's unclear why this abstraction is needed. (As far as I can tell, Allow using encoded images in Datum, LevelDB, LMDB #1239 doesn't use it.)
  • We have DEFAULT_CODER_NOT_AVAILABLE, which I guess is some kind of template metaprogramming trick to tell us something the compiler already knows when sizeof(T) == 0, although that is not the complement of the cases for which the DefaultCoder is available...
  • As far as I know, we always store data as Datum, but now we have string and vector<char> options as well, at the cost of a bunch more code. Is there an existing use case for this?
  • Do we really need a std::iterator for datasets? I'm not opposed to making good use of STL, but this seems chiefly to introduce a bunch of boilerplate...

I'm sure there may be some good reasons for the above that I've missed, but, as @shelhamer and @sguada seem to agree, it's not rapidly digestible. Abstractions should not be added before they are needed. More code means more bugs, more mental effort, and more maintenance.

Now we have the problem @shelhamer has run into, we have dev apparently broken as described in #1319, which is a partial fix, and some concerning comments such as #1316 (comment).

I'm all for moving fast in dev, but I don't see any reason to merge code until (1) it is both necessary and sufficient to provide some benefit over what came before, and (2) we are confident that it doesn't break things. Code doesn't have to be merged to exist; we have PRs, branches, and forks as well.

Of course, this is no fault to @kmatzen, who has clearly put a lot of good thought into how this should work. But I think discussion deserves to be superlinear in number of lines of code added, and I think there is no reason to merge code before having confidence in its correctness.

Now, we could attempt to revert dev to a known good state, or we could dig our way out by fixes such as #1319 and some culling as suggested by @shelhamer. Opinions are welcome, but I don't think @shelhamer or I have time to do a lot of hacking on this, so I think my hope is that we can fix the current issues and simplify the interface with a few small, obviously correct PRs.

@kmatzen
Copy link
Contributor Author

kmatzen commented Oct 18, 2014

I agree with what's been said here. My objective is not to push code upstream quickly. My objective is to receive a high quality code review. It would be nice in the future to receive more feedback such as "I don't understand this, please provide more comments", "you're missing test coverage for this one area", or even "this design is bad, let's discuss a better one" before it is merged.

@sguada
Copy link
Contributor

sguada commented Oct 18, 2014

Sorry for pushing quickly, @kmatzen could you do a simplified version which only have key=string and value=Datum. If not, I will work on that tomorrow

@kmatzen
Copy link
Contributor Author

kmatzen commented Oct 18, 2014

Sure. Do you want me to remove the instantiations for the other value types or do you want me to remove the templating entirely?

@sguada
Copy link
Contributor

sguada commented Oct 18, 2014

I think for starting it would be better to remove the whole template, and keep it simple, just abstracting the code was in DataLayer before.

Let's use a DatasetParam to build the Dataset (which would be defined in caffe.proto) and contains all the needed parameters, like source, backend (or any other parameters specific to the backend).
The basic functionality should be:

  • new Dataset(DatasetParam)
  • open(Mode)
  • close()
  • get(key, value)
  • put(key, value)
  • commit()
  • And a way to iterate either through next() or iterators.

To simplify even further we can only allow reads in ReadMode and writes in WriteMode, that way the transactions don't get mixed.

@longjon
Copy link
Contributor

longjon commented Oct 18, 2014

@kmatzen Yep, we appreciate your contributions, and sorry to be giving feedback post-hoc, that's our fault. (Personally I did not look at this until it was brought to my attention to due to being in deadline mode.)

Thanks for taking action on this guys; @sguada's plan looks good to me.

@shelhamer
Copy link
Member

Sounds good everybody. While we're revising how about DataSource instead of
DataSet, since it is really about a format / container and not a given set
of data?
On Saturday, October 18, 2014, longjon notifications@github.com wrote:

@kmatzen https://github.com/kmatzen Yep, we appreciate your
contributions, and sorry to be giving feedback post-hoc, that's our fault.
(Personally I did not look at this until it was brought to my attention to
due to being in deadline mode.)

Thanks for taking action on this guys; @sguada https://github.com/sguada's
plan looks good to me.


Reply to this email directly or view it on GitHub
#1238 (comment).

@futurely
Copy link

The confidence of the correctness of PRs should be proportional to the test coverage. #1177 aimed reduce the human efforts needed to figure it out.

@KangolHsu
Copy link

KangolHsu commented Oct 11, 2017

@shelhamer I got the similar error,but it occurs all of sudden。i swear i did not change anything。

F1011 17:58:04.521750 10269 db_lmdb.hpp:15] Check failed: mdb_status == 0 (-30790 vs. 0) MDB_READERS_FULL: Environment maxreaders limit reached

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants