Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print blob debug info during training if SolverParameter "debug_info" field is set #796

Merged
merged 2 commits into from
Jul 26, 2014

Conversation

jeffdonahue
Copy link
Contributor

(Heavily inspired by the venerable cuda-convnet.)

This PR lets you add "debug_info: true" to your solver.prototxts to compute and print debug information during the forward pass, backward pass, and updates (only prints on "display" iterations when the training loss is normally displayed). Specifically, the information that's printed is L1 norms (sum of absolute values) followed by the mean absolute value in parentheses. This can be helpful for debugging issues with training. For example, if you see a nan/inf loss, this would let you figure out which layer is producing the first nan/inf and when. Here's what the output looks like when you add debug_info: true to the end of lenet_consolidated_solver.prototxt:

I0726 14:56:33.147119 27342 solver.cpp:143] Iteration 0, Testing net (#0)
I0726 14:56:35.349648 27342 solver.cpp:181] Test score #0: 0.067
I0726 14:56:35.349707 27342 solver.cpp:181] Test score #1: 2.30256
I0726 14:56:35.349717 27342 solver.cpp:143] Iteration 0, Testing net (#1)
I0726 14:56:41.799576 27342 solver.cpp:181] Test score #0: 0.0670334
I0726 14:56:41.799634 27342 solver.cpp:181] Test score #1: 2.30258
I0726 14:56:44.788625 27342 net.cpp:424]     [debug_info] [Forward] Layer mnist, top blob data data: 7065.09 (0.140806)
I0726 14:56:44.790845 27342 net.cpp:424]     [debug_info] [Forward] Layer mnist, top blob label data: 290 (4.53125)
I0726 14:56:44.795804 27342 net.cpp:424]     [debug_info] [Forward] Layer conv1, top blob conv1 data: 227884 (0.309087)
I0726 14:56:44.798096 27342 net.cpp:424]     [debug_info] [Forward] Layer pool1, top blob pool1 data: 64825.6 (0.351701)
I0726 14:56:44.807307 27342 net.cpp:424]     [debug_info] [Forward] Layer conv2, top blob conv2 data: 194818 (0.951262)
I0726 14:56:44.809547 27342 net.cpp:424]     [debug_info] [Forward] Layer pool2, top blob pool2 data: 53716 (1.04914)
I0726 14:56:44.811975 27342 net.cpp:424]     [debug_info] [Forward] Layer ip1, top blob ip1 data: 21170.2 (0.661568)
I0726 14:56:44.814205 27342 net.cpp:424]     [debug_info] [Forward] Layer relu1, top blob ip1 data: 16661.7 (0.520679)
I0726 14:56:44.816563 27342 net.cpp:424]     [debug_info] [Forward] Layer ip2, top blob ip2 data: 2039.79 (3.18717)
I0726 14:56:44.819965 27342 net.cpp:439]     [debug_info] [Backward] Layer loss, bottom blob ip2 diff: 0.378046 (0.000590697)
I0726 14:56:44.822183 27342 net.cpp:439]     [debug_info] [Backward] Layer ip2, bottom blob ip1 diff: 2.73874 (8.55855e-05)
I0726 14:56:44.824498 27342 net.cpp:449]     [debug_info] [Backward] Layer ip2, param blob 0 diff: 50.081 (0.0100162)
I0726 14:56:44.826732 27342 net.cpp:449]     [debug_info] [Backward] Layer ip2, param blob 1 diff: 0.168365 (0.0168365)
I0726 14:56:44.829035 27342 net.cpp:439]     [debug_info] [Backward] Layer relu1, bottom blob ip1 diff: 1.87708 (5.86588e-05)
I0726 14:56:44.831616 27342 net.cpp:439]     [debug_info] [Backward] Layer ip1, bottom blob pool2 diff: 2.14405 (4.1876e-05)
I0726 14:56:44.833940 27342 net.cpp:449]     [debug_info] [Backward] Layer ip1, param blob 0 diff: 527.024 (0.00131756)
I0726 14:56:44.836174 27342 net.cpp:449]     [debug_info] [Backward] Layer ip1, param blob 1 diff: 0.585842 (0.00117168)
I0726 14:56:44.838559 27342 net.cpp:439]     [debug_info] [Backward] Layer pool2, bottom blob conv2 diff: 2.14405 (1.0469e-05)
I0726 14:56:44.850530 27342 net.cpp:439]     [debug_info] [Backward] Layer conv2, bottom blob pool1 diff: 6.14938 (3.33625e-05)
I0726 14:56:44.852756 27342 net.cpp:449]     [debug_info] [Backward] Layer conv2, param blob 0 diff: 55.5542 (0.00222217)
I0726 14:56:44.855058 27342 net.cpp:449]     [debug_info] [Backward] Layer conv2, param blob 1 diff: 0.189715 (0.00379431)
I0726 14:56:44.857542 27342 net.cpp:439]     [debug_info] [Backward] Layer pool1, bottom blob conv1 diff: 6.14938 (8.34063e-06)
I0726 14:56:44.877032 27342 net.cpp:449]     [debug_info] [Backward] Layer conv1, param blob 0 diff: 3.22905 (0.0064581)
I0726 14:56:44.879184 27342 net.cpp:449]     [debug_info] [Backward] Layer conv1, param blob 1 diff: 0.21928 (0.010964)
I0726 14:56:44.879220 27342 solver.cpp:276] Iteration 100, lr = 0.00992565
I0726 14:56:44.882715 27342 net.cpp:466]     [debug_info] [Update] Layer conv1, param 0 data: 94.5529 (0.189106); diff: 0.101136 (0.000202272)
I0726 14:56:44.887137 27342 net.cpp:466]     [debug_info] [Update] Layer conv1, param 1 data: 1.17435 (0.0587177); diff: 0.00966474 (0.000483237)
I0726 14:56:44.890619 27342 net.cpp:466]     [debug_info] [Update] Layer conv2, param 0 data: 996.6 (0.039864); diff: 1.48126 (5.92504e-05)
I0726 14:56:44.894016 27342 net.cpp:466]     [debug_info] [Update] Layer conv2, param 1 data: 0.499523 (0.00999046); diff: 0.00582091 (0.000116418)
I0726 14:56:44.897387 27342 net.cpp:466]     [debug_info] [Update] Layer ip1, param 0 data: 1194.18 (0.00298545); diff: 14.1294 (3.53234e-05)
I0726 14:56:44.900827 27342 net.cpp:466]     [debug_info] [Update] Layer ip1, param 1 data: 1.96221 (0.00392441); diff: 0.0264453 (5.28906e-05)
I0726 14:56:44.904175 27342 net.cpp:466]     [debug_info] [Update] Layer ip2, param 0 data: 134.482 (0.0268964); diff: 1.19755 (0.00023951)
I0726 14:56:44.907567 27342 net.cpp:466]     [debug_info] [Update] Layer ip2, param 1 data: 0.668925 (0.0668925); diff: 0.00728625 (0.000728625)
I0726 14:56:44.907627 27342 solver.cpp:116] Iteration 100, loss = 0.308106
I0726 14:56:47.179862 27342 net.cpp:424]     [debug_info] [Forward] Layer mnist, top blob data data: 6074.86 (0.121071)
I0726 14:56:47.179925 27342 net.cpp:424]     [debug_info] [Forward] Layer mnist, top blob label data: 286 (4.46875)
I0726 14:56:47.181391 27342 net.cpp:424]     [debug_info] [Forward] Layer conv1, top blob conv1 data: 220081 (0.298503)
I0726 14:56:47.181501 27342 net.cpp:424]     [debug_info] [Forward] Layer pool1, top blob pool1 data: 61939.6 (0.336044)
I0726 14:56:47.184751 27342 net.cpp:424]     [debug_info] [Forward] Layer conv2, top blob conv2 data: 181365 (0.885573)
I0726 14:56:47.184839 27342 net.cpp:424]     [debug_info] [Forward] Layer pool2, top blob pool2 data: 42201.8 (0.824253)
I0726 14:56:47.185045 27342 net.cpp:424]     [debug_info] [Forward] Layer ip1, top blob ip1 data: 21178.1 (0.661815)
I0726 14:56:47.185094 27342 net.cpp:424]     [debug_info] [Forward] Layer relu1, top blob ip1 data: 13912.3 (0.434759)
I0726 14:56:47.185168 27342 net.cpp:424]     [debug_info] [Forward] Layer ip2, top blob ip2 data: 2183.11 (3.4111)
I0726 14:56:47.185251 27342 net.cpp:439]     [debug_info] [Backward] Layer loss, bottom blob ip2 diff: 0.362956 (0.000567119)
I0726 14:56:47.185317 27342 net.cpp:439]     [debug_info] [Backward] Layer ip2, bottom blob ip1 diff: 3.45429 (0.000107947)
I0726 14:56:47.185358 27342 net.cpp:449]     [debug_info] [Backward] Layer ip2, param blob 0 diff: 37.824 (0.0075648)
I0726 14:56:47.185397 27342 net.cpp:449]     [debug_info] [Backward] Layer ip2, param blob 1 diff: 0.162711 (0.0162711)
I0726 14:56:47.185441 27342 net.cpp:439]     [debug_info] [Backward] Layer relu1, bottom blob ip1 diff: 2.06665 (6.45827e-05)
I0726 14:56:47.185824 27342 net.cpp:439]     [debug_info] [Backward] Layer ip1, bottom blob pool2 diff: 2.93497 (5.73236e-05)
I0726 14:56:47.185894 27342 net.cpp:449]     [debug_info] [Backward] Layer ip1, param blob 0 diff: 523.536 (0.00130884)
I0726 14:56:47.185935 27342 net.cpp:449]     [debug_info] [Backward] Layer ip1, param blob 1 diff: 0.776996 (0.00155399)
I0726 14:56:47.186048 27342 net.cpp:439]     [debug_info] [Backward] Layer pool2, bottom blob conv2 diff: 2.93497 (1.43309e-05)
I0726 14:56:47.191087 27342 net.cpp:439]     [debug_info] [Backward] Layer conv2, bottom blob pool1 diff: 8.45792 (4.58872e-05)
I0726 14:56:47.191174 27342 net.cpp:449]     [debug_info] [Backward] Layer conv2, param blob 0 diff: 58.7563 (0.00235025)
I0726 14:56:47.191213 27342 net.cpp:449]     [debug_info] [Backward] Layer conv2, param blob 1 diff: 0.200319 (0.00400639)
...

@jeffdonahue
Copy link
Contributor Author

I decided that printing both the sum and mean of absolute values is a bit visually overwhelming, and the sum isn't particularly useful, so I changed to printing just the mean in the last commit. Here's what it looks like now:

I0726 15:20:45.515188  3817 solver.cpp:143] Iteration 0, Testing net (#0)
I0726 16:22:05.940804 17278 solver.cpp:181] Test score #0: 0.067
I0726 16:22:05.949410 17278 solver.cpp:181] Test score #1: 2.30256
I0726 16:22:05.949596 17278 solver.cpp:143] Iteration 0, Testing net (#1)
I0726 16:22:12.493443 17278 solver.cpp:181] Test score #0: 0.0670334
I0726 16:22:12.493547 17278 solver.cpp:181] Test score #1: 2.30258
I0726 16:22:15.888589 17278 net.cpp:423]     [Forward] Layer mnist, top blob data data: 0.140806
I0726 16:22:15.888959 17278 net.cpp:423]     [Forward] Layer mnist, top blob label data: 4.53125
I0726 16:22:15.891015 17278 net.cpp:423]     [Forward] Layer conv1, top blob conv1 data: 0.309087
I0726 16:22:15.891459 17278 net.cpp:423]     [Forward] Layer pool1, top blob pool1 data: 0.351701
I0726 16:22:15.896097 17278 net.cpp:423]     [Forward] Layer conv2, top blob conv2 data: 0.951262
I0726 16:22:15.896312 17278 net.cpp:423]     [Forward] Layer pool2, top blob pool2 data: 1.04914
I0726 16:22:15.896662 17278 net.cpp:423]     [Forward] Layer ip1, top blob ip1 data: 0.661568
I0726 16:22:15.897069 17278 net.cpp:423]     [Forward] Layer relu1, top blob ip1 data: 0.520679
I0726 16:22:15.897349 17278 net.cpp:423]     [Forward] Layer ip2, top blob ip2 data: 3.18717
I0726 16:22:15.898224 17278 net.cpp:437]     [Backward] Layer loss, bottom blob ip2 diff: 0.000590697
I0726 16:22:15.898407 17278 net.cpp:437]     [Backward] Layer ip2, bottom blob ip1 diff: 8.55855e-05
I0726 16:22:15.898635 17278 net.cpp:446]     [Backward] Layer ip2, param blob 0 diff: 0.0100162
I0726 16:22:15.898865 17278 net.cpp:446]     [Backward] Layer ip2, param blob 1 diff: 0.0168365
I0726 16:22:15.899135 17278 net.cpp:437]     [Backward] Layer relu1, bottom blob ip1 diff: 5.86588e-05
I0726 16:22:15.899755 17278 net.cpp:437]     [Backward] Layer ip1, bottom blob pool2 diff: 4.1876e-05
I0726 16:22:15.900161 17278 net.cpp:446]     [Backward] Layer ip1, param blob 0 diff: 0.00131756
I0726 16:22:15.900212 17278 net.cpp:446]     [Backward] Layer ip1, param blob 1 diff: 0.00117168
I0726 16:22:15.900527 17278 net.cpp:437]     [Backward] Layer pool2, bottom blob conv2 diff: 1.0469e-05
I0726 16:22:15.907366 17278 net.cpp:437]     [Backward] Layer conv2, bottom blob pool1 diff: 3.33625e-05
I0726 16:22:15.907748 17278 net.cpp:446]     [Backward] Layer conv2, param blob 0 diff: 0.00222217
I0726 16:22:15.908145 17278 net.cpp:446]     [Backward] Layer conv2, param blob 1 diff: 0.00379431
I0726 16:22:15.908788 17278 net.cpp:437]     [Backward] Layer pool1, bottom blob conv1 diff: 8.34063e-06
I0726 16:22:15.913436 17278 net.cpp:446]     [Backward] Layer conv1, param blob 0 diff: 0.0064581
I0726 16:22:15.913817 17278 net.cpp:446]     [Backward] Layer conv1, param blob 1 diff: 0.010964
I0726 16:22:15.913840 17278 solver.cpp:276] Iteration 100, lr = 0.00992565
I0726 16:22:15.914631 17278 net.cpp:461]     [Update] Layer conv1, param 0 data: 0.189106; diff: 0.000202272
I0726 16:22:15.915423 17278 net.cpp:461]     [Update] Layer conv1, param 1 data: 0.0587177; diff: 0.000483237
I0726 16:22:15.916214 17278 net.cpp:461]     [Update] Layer conv2, param 0 data: 0.039864; diff: 5.92504e-05
I0726 16:22:15.917011 17278 net.cpp:461]     [Update] Layer conv2, param 1 data: 0.00999046; diff: 0.000116418
I0726 16:22:15.917820 17278 net.cpp:461]     [Update] Layer ip1, param 0 data: 0.00298545; diff: 3.53234e-05
I0726 16:22:15.918645 17278 net.cpp:461]     [Update] Layer ip1, param 1 data: 0.00392441; diff: 5.28906e-05
I0726 16:22:15.919437 17278 net.cpp:461]     [Update] Layer ip2, param 0 data: 0.0268964; diff: 0.00023951
I0726 16:22:15.920228 17278 net.cpp:461]     [Update] Layer ip2, param 1 data: 0.0668925; diff: 0.000728625
I0726 16:22:15.920287 17278 solver.cpp:116] Iteration 100, loss = 0.308106
I0726 16:22:19.327172 17278 net.cpp:423]     [Forward] Layer mnist, top blob data data: 0.121071
I0726 16:22:19.327553 17278 net.cpp:423]     [Forward] Layer mnist, top blob label data: 4.46875
I0726 16:22:19.329766 17278 net.cpp:423]     [Forward] Layer conv1, top blob conv1 data: 0.298503
I0726 16:22:19.330212 17278 net.cpp:423]     [Forward] Layer pool1, top blob pool1 data: 0.336044
I0726 16:22:19.334609 17278 net.cpp:423]     [Forward] Layer conv2, top blob conv2 data: 0.885573
I0726 16:22:19.334990 17278 net.cpp:423]     [Forward] Layer pool2, top blob pool2 data: 0.824253
I0726 16:22:19.335494 17278 net.cpp:423]     [Forward] Layer ip1, top blob ip1 data: 0.661815
I0726 16:22:19.335890 17278 net.cpp:423]     [Forward] Layer relu1, top blob ip1 data: 0.434759
I0726 16:22:19.336326 17278 net.cpp:423]     [Forward] Layer ip2, top blob ip2 data: 3.4111
I0726 16:22:19.337154 17278 net.cpp:437]     [Backward] Layer loss, bottom blob ip2 diff: 0.000567119
I0726 16:22:19.337543 17278 net.cpp:437]     [Backward] Layer ip2, bottom blob ip1 diff: 0.000107947
I0726 16:22:19.337934 17278 net.cpp:446]     [Backward] Layer ip2, param blob 0 diff: 0.0075648
I0726 16:22:19.338327 17278 net.cpp:446]     [Backward] Layer ip2, param blob 1 diff: 0.0162711
I0726 16:22:19.338732 17278 net.cpp:437]     [Backward] Layer relu1, bottom blob ip1 diff: 6.45827e-05
I0726 16:22:19.339470 17278 net.cpp:437]     [Backward] Layer ip1, bottom blob pool2 diff: 5.73236e-05
I0726 16:22:19.339874 17278 net.cpp:446]     [Backward] Layer ip1, param blob 0 diff: 0.00130884
I0726 16:22:19.340273 17278 net.cpp:446]     [Backward] Layer ip1, param blob 1 diff: 0.00155399
I0726 16:22:19.340744 17278 net.cpp:437]     [Backward] Layer pool2, bottom blob conv2 diff: 1.43309e-05
I0726 16:22:19.347610 17278 net.cpp:437]     [Backward] Layer conv2, bottom blob pool1 diff: 4.58872e-05
I0726 16:22:19.347973 17278 net.cpp:446]     [Backward] Layer conv2, param blob 0 diff: 0.00235025
I0726 16:22:19.348204 17278 net.cpp:446]     [Backward] Layer conv2, param blob 1 diff: 0.00400639
I0726 16:22:19.348851 17278 net.cpp:437]     [Backward] Layer pool1, bottom blob conv1 diff: 1.14718e-05
I0726 16:22:19.353513 17278 net.cpp:446]     [Backward] Layer conv1, param blob 0 diff: 0.00488937
I0726 16:22:19.353886 17278 net.cpp:446]     [Backward] Layer conv1, param blob 1 diff: 0.00962426
I0726 16:22:19.353912 17278 solver.cpp:276] Iteration 200, lr = 0.00985258

I think I'll merge this myself if nobody objects within a few hours -- it's an invisible change unless you specifically enable it. The information in and formatting of the output can be adjusted later if people dislike it. One thing people may want is to, e.g., only see the [Update] information. This can also be introduced later by adding finer-grained debug options in the SolverParameter proto message.

@Yangqing
Copy link
Member

SGTM

@jeffdonahue
Copy link
Contributor Author

thanks for taking a look @Yangqing!

jeffdonahue added a commit that referenced this pull request Jul 26, 2014
Print blob debug info during training if SolverParameter "debug_info" field is set
@jeffdonahue jeffdonahue merged commit 0e3d970 into BVLC:dev Jul 26, 2014
@jeffdonahue jeffdonahue deleted the solver-debug-info branch July 26, 2014 23:39
@shelhamer shelhamer mentioned this pull request Aug 7, 2014
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Print blob debug info during training if SolverParameter "debug_info" field is set
RazvanRanca pushed a commit to RazvanRanca/caffe that referenced this pull request Nov 4, 2014
Print blob debug info during training if SolverParameter "debug_info" field is set
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants