-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data source identification breaks the loading of multiple instances of the same net #3108
Comments
Different symptom but same underlying issue reported here: #3037 |
I can confirm that this is a bug in need of attention.
|
Then for the sake of discussion, here is my fix from last month: jolibrain@70fd4f7 One of the many reasons you may not want this in vanilla Caffe is that it requires a C++11 compiler. |
So I have the same issue...is there a fix yet? |
@tarunsharma1 if you use the fork I maintain at https://github.com/beniz/caffe it should work fine. This fork remains up to date with master with a short delay. You'd need a C++11 compiler however. |
This is for other/new users who have the same issue and want a quick easy hack around it. This is not a permanent solution -> Turns out that the issue is with opening the same lmdb twice irrespective of whether you use CPU or GPU. A quick fix is to make a copy of your lmdb and give it a different name and use these two different lmdb names in your two networks respectively. Does not work Works |
So FTR, I've tested my branch against #3037 and it doesn't fix the problem there. I believe this is because my fix uses the thread id, which fixes the issue when training multiple models from the same data source using multiple threads, not from the same inner model. |
Fixed by the parallelism reformation in #4563 |
Source identification at https://github.com/BVLC/caffe/blob/master/include/caffe/data_reader.hpp#L68 was introduced by bcc8f50 and leads to raising a fatal check at https://github.com/BVLC/caffe/blob/master/src/caffe/data_reader.cpp#L98 whenever two nearly identical nets train concurrently (e.g. on a single GPU).
The problem occurs if two nets are trained concurrently and share:
This typically occurs when training several nets with different layer parameters but identical source and layer names.
My current solution to this problem is to enrich the source identification routine with a hash of the running thread, but my understanding is that it might break the original detection of identical sources from within the same net. For this reason, I am not sharing a PR in order to gather more thoughts on this issue.
The text was updated successfully, but these errors were encountered: