-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepMark #101
Comments
Oh lastly, a good timeline for this would be to get an initial round of benchmarks by June 15th (since I only gave some of you a heads-up right now) |
So awesome and useful. What are the data sets one should benchmark on? ImageNet, CIFAR10? It would also be nice to compare the accuracy of current implementations for each framework (although that would probably be a lot of work). |
For text, I'd hope to expand beyond just RNN character generation. It doesn't capture many of the complexities of other models, such as variable sequence lengths or bidirectional RNNs. The Attention Sum Reader is a simple architecture (bidirectional GRU + dot product) that currently has SotA and could allow for optimizing sequences of different lengths, a major issue in RNNs. The model also has four different dataset sizes, small (Children's Book Test), medium (CNN), and large (Daily Mail), which are publicly available. |
This is great, thanks for organizing this! One thing I've also been thinking about like @daviddao is how to validate that the models are actually computing the same thing -- I've seen some benchmarks elsewhere that have raised personal doubts that the benchmark code written in different frameworks are computing the same function. As part of the benchmark framework, maybe the API could include a way to validate that given specified initialization and input, the outputs (forward and backward) are approximately equal. Open to thoughts :). Cheers! |
Nice! I am very excited for this. I have https://github.com/craffel/lstm_benchmarks, which is an out-of-date benchmark of Theano vs. rnnlib vs. currennt (which, at the time I wrote the benchmarks, were essentially the only options for LSTM). The task was CHIME noisy speech recognition, which has pretty limited adoption so I would not strongly advocate for it being added as a task. And I assume that rnnlib and current shouldn't be included in these benchmarks are they are RNN-only, right? I'll be happy to contribute to some of the Theano RNN benchmarks once it becomes appropriate to do so.
This would be very cool, but from my own experience with the LSTM benchmark it can be very difficult - you have to make sure literally every hyperparameter is identical, and you effectively can't use any RNGs. Not to say it's impossible, but it would add a lot of overhead to implementing new benchmarks. |
Caveat: per Paul Graham, better to go deep, do something very well, than kind of blur one's 'focus' over many things. I worry gently that if too many benchmarks then:
|
👍
Training on something like ImageNet would move away the focus from pure computation to fast dataset iteration -- this would be interesting as well, but should probably become a separate benchmark since not all frameworks actually provide any tools for this. The other extreme would be training on random dummy data (like sampled from a Gaussian), but this makes sense only if we can guarantee the running time does not depend on the input data. So probably we should have some realistic set of inputs for each task, just large enough to fill two batches or so?
This seems useful. It requires initial model parameters to be dumped in some format and loaded into each framework, but it would help to ensure that all implementations are the same. |
In the strongest case, weight initialization could be defined precisely as:
|
I guess getting the same stream of pseudo-random values in all different frameworks is more difficult than importing a set of tensors into all different frameworks. We wouldn't want to exclude candidates from being benchmarked because they fail to implement the same RNG. |
Having gone through the exact same process, to compare DeepCL with convnetjs, I found it significantly easier to make convnetjs use the exact same weight generator as DeepCL, than to load weights from a file https://github.com/hughperkins/DeepCL/blob/master/prototyping/convnetjs-reference/testconvnet2.js#L143-L201 . It was a long time ago, so I dont remember why. I do remember I initially tried writing weights to a file though, and I couldnt get it to work as easily as syncing weight generators, for some reason. |
If that's the case, one could of course create the initial weights in a reproducible way and save them to files, so implementers for the different benchmarked frameworks can choose whatever is easiest. |
@daviddao @vrv @hughperkins @f0k for V1, I thought we should just go with synthetic data. It's very hard to setup to-convergence benchmarks, as there are very fine details wrt convergence being guaranteed, for example: some of the models (like googlenetv3) have taken a year to reproduce outside of the paper. @Smerity In terms of evaluating perf, we can add a bidirectional RNN too. In fact, DeepSpeech2 has bidirectional-RNNs, so that should be sufficient? @vrv definitely a great idea, but very very hard and takes a ton of resources. I feel like atleast for V1, we should just go with code review + synthetic data. @craffel Awesome! At the moment I dont have a Point of Contact for Theano, maybe a combination of you, @f0k and @benanne could work (especially if they're implemented in Lasagne). |
That would be nice :) though I am personally interested in which of the Theano-based libraries manage to eek out the most performance, since their implementations are nonidentical. |
Thanks @soumith for organizing this effort! I think this would definitely help us advance the field to the next level. I am also very interested in benchmarking not only the training pipeline, but a more wide range of evaluation criteria. The reason is as follows: if I may make a bold claim, I believe that all frameworks will again very quickly converge to the same performance, because there is no fundamental difference between them. What we saw at convnet-benchmark is that almost everyone is using the same underlying library, and we are effectively benchmarking framework overheads, something that is good to know of course, but seems to be overwhelmed by other factors, such as ease to use etc. Given the wide attention of this benchmark, I think it would be great if we can draw attention to some of the more practical issues, such as small batch sizes in deployment time - several frameworks (including some non-open-source production systems I've worked on) have historically ignored this, and I think it is worthwhile to invite people to invest more on this direction. I have not got a perfece idea on this, of course. One thing we can do is to simply benchmark different batch sizes, but a more complex, and potentially useful, way is probably to set up a harness that can simulate requests generated from a Poisson distribution and comes with latency requirements, and see whether frameworks can address that in an optimal fashion - this might be too application specific, though. Just my 2 cents. (Also adding @ajtulloch to the conversation. Andrew first raised this point when we were discussing offline.) |
How about per-pixel scene labelling and optical flow? Regards -David
|
I agree that evaluate more aspects. Some of them is already covered in this proposal, for example
One of the most important factor, tradeoff between ease of use and optimization, was unfortunately not easy to benchmark as each people have their own taste. What @Yangqing suggested is more on measuring perf for production and serving pipeline, which could be a whole area of new directions. As this benchmark was primarily on training. One alternative could be making a deep-serving benchmark on DeepMark organization that dedicate to this topic. |
It would be interesting to log the power dissipation in each testcase, as
|
I like this idea. A Titan draws 250watts peak (I think?). 24 hours a day for a year, 250watts is about ~600usd, which is in the same order of magnitude as the purchase price. And power dissipation is going to become the main bottleneck plausibly in years to come. ("And over here we have our farm of 1000 Titan 2026s, and over there is the 20MW pebble bed we are using to power them" :-) ) |
Agreed. Soumith's current benchmarks are useful, but they mainly evaluate "who can make the thinnest wrapper around cuDNN, Neon, or similar?" It would be useful to benchmark implementations of emerging algorithms for which tuned libraries may not yet exist -- certain versions of LSTMs and RNNs, for instance. |
@forresti yea, for historical context it used to be not like that, but it is like that now. I think for LSTMs and RNNs, a lot of perf is still up for grabs. |
To be fair, cudnn, neon are competing with each other. The opencl implementations mostly copy the caffe cuda implementation of im2col as far as I know :-D but have different performance from cudnn. There is also 16-bit vs 32-bit. (Edit, by the way, the cudnn vs neon comparison is exactly what comes to mind about power consumption. I dont know if it's still the case, but as far as I know it used to be the case that cudnn ran cooler than neon, and it'd be useful to be able to see this in the results) |
@hughperkins Good point. I didn't mean to imply that there isn't a diverse array of low-level computational libraries for DNNs. To tune up my comment a bit: "When doing speed/efficiency benchmarks, it's hard to avoid conflating low-level computational libraries (cuDNN, Neon, various OpenCL efforts, ...) and higher-level frameworks (Caffe, Torch, Tensorflow, Theano, ...)." |
I would say that convolution is far from a solved problem. I still have a long list of optimizations I want to make. The biggest area to explore is how to best leverage lower precision without sacrificing accuracy. The obvious target there would be xnor nets but maybe a bit more precision is best for the highest levels of accuracy. The 4x int8 performance that Pascal will soon have (unfortunately not in P100 though) is a tantalizing format to target. And also obviously the native fp16 support. Another area is better efficiency at smaller batch sizes. I have some brand new work there that I'd like to show off. This is important for both inference and scaling to many nodes. Power comparisons are useful but only when looking at implementations that have the same computational throughput. Or just use some kind of flops/watt metric. With my newer kernels I'm getting very good at squeezing the most out of cache utilization and hence I'm hitting and maintaining higher boost clocks (while using smaller and more versatile tiles). As for the frameworks, the big area to focus on is graph based optimizations. Maximizing data locality (compounding), memory allocation vs compute trade-offs, auto-parallelizing independent work across streams and gpus, and lots of other creative things computational graphs greatly simplify. As for synthetic vs real data and parameters.. In fp32 I think only the distribution matters for performance comparisons. But in lower precision like fp16 it's very easy to saturate or underflow with synthetic data which leads to far higher performance than is warranted. At the very least you want to account for the fan-in when setting weight magnitudes (Kaiming, Xavier, etc). Batch norm helps a lot here too. Basically you should be able to prove that you can train with the params you benchmark with. At the end of the day we care about speed and usability. I think these benchmarks should make both pretty clear. For usability you'll be able to inspect the script to see who has the cleanest syntax and solves the most problems for you without extra steps. |
@scott-gray That all sounds great. :) |
Well, the ideal would be joules per batch. But I think this will be tricky to measure. Might need some specialized hardware device, that sits on the power bus? |
Maybe it wouldn't be quite so tricky. You'd just need to collect some running average of the on chip power stats during the execution of the epoch. Something like this would give you realtime stats:
Or even better tie your benchmark script directly into NVML queries: But I guess you'd want to be running these queries continuously so maybe as a separate process would be better. You'd just need to synchronize the collection with the execution of the network. Just a small bit of shell scripting should achieve this. |
Interesting. Seems it's just a c-interface, so accessible using ffi etc.
|
And python bindings can be found here: |
But, it's worth pointing out that the boost clock is already tightly coupled with these real-time power and temperature measurements so the overall timings should be reflective of this. So perhaps it's not worth the effort. |
I'd hope this also supports CPU training/testing. |
@jbalma yes I find strong scaling the training problem to be very interesting, and of course that is an active area of research. I hope you find a good forum for pursuing it. Let me know if I can help. I would also point out with regards to profiling, an external profiling tool will of course not be able to correlate timings with the graph structure of the network, so it cannot group statistics by layer, which is essential for understanding the performance of the network. I think all successful machine learning frameworks will be instrumented for profiling eventually, and tensorflow appears to be taking the lead there. Now imagine the power of being able to compare network performance graphs across different frameworks, because they all were designed to output profiler stats in the same format. |
|
If we want metrics per layer (not per operation, or per operation type), we'll need to figure out how/where to introduce timing checkpoints in the graph without hindering optimization. That's a fundamental problem for frameworks based on computation graphs: They are free to rearrange operations across layer boundaries, so you cannot necessarily correlate the optimized graph with the network layers at all. |
On Thu, May 19, 2016 at 7:46 AM, Jan Schlüter notifications@github.com
As always, we can always do better! |
or, we could use the existing per-layer timings method, which seems to me to be a reasonably rigorous way of doing it, and avoids fudge-factors based on differing opinions on how to do in-situ timings. like:
I'm sure there are a bunch more of such questions.... |
The event-based timing could be realized as Ops that insert events into the CUDA stream. If inserted between layers, these Ops would naturally block any optimizations across layers if existing optimizers are not adapted to ignore them. But again, this would prevent real in-situ timings, because the graph might not be fully optimized.
Yes... in-situ timings would be the best, but depending on the framework, we cannot get per-layer timings without changing the process, and maybe not even distinguish the forward and backward pass, just training (fw+bw) and inference (fw only). For the start, we should probably have per-epoch or per-batch timings only. |
On the Caffe side, Mike Houston and team at NVIDIA have agreed to do the benchmark scripts. So that concludes all the volunteers for each framework :) |
@soumith |
Sure :) |
Opinion: the new system should be a collection of specialized repos, each with their own repo, being:
... etc. Therefore, I see no reason for this repo disappearing in any way, shape or form, personally :-) |
forresti wrote:
On a somewhat related note, since GEMM seems to be at the heart of convolution. It's used by fft, winograd, im2col, and presumably also implicit gemm. So, could it be worth having some GEMM benchmarks? For OpenCL, there are at least 3 GEMM implementations I know of, ie: clBLAS, clBLAST, ViennaCL. (Edit: and for CUDA, there's at least: cublas, and the sass gemm implementation that is part of neon) |
@hughperkins Do you mean something like https://github.com/dividiti/gemmbench? |
Possibly. I'm not sure what I mean to be honest. I'm not sure that's quite exactly what I was thinking of. I was thinking of something more like the simple tables in convnet-benchmarks, but comparing these 5 or so GEMM implementations. Presumably, the actual workloads, if we are targeting convolution, should be workloads sampled by running a forwards-backwards batch through a few common convolutional models, such as those currently in convnet-benchmarks. |
@hughperkins Also, what makes benchmarking the GEMMs more difficult is that their performance differs a lot on a variety of devices. And then the GEMMs can also be autotuned, which works more or less depending on what BLAS and architecture combination is used. I think benchmarking on the network and layer level is enough for the DeepMark project. |
Ah, sounds good :-) |
Hi Andrew, So ... guess what? I wrote a correctness checking script :-) And ... ironically enough... it targets Neon :-D Because I need it for testing the OpenCL port. It outputs for each layer:
Example results, for neon on Titan X, using vgga model: Maxwell kernels, Winograd, SASS:
Kepler kernels, Direct, OpenCL:
https://github.com/hughperkins/neon-benchmarks (Edited with the layer 0 results for Maxwell CUDA kernels, summary page at https://github.com/hughperkins/neon-benchmarks/blob/master/results/vgga_summary.md ) |
For the benchmarks with correctness checker, added stride and padding, so it can handle eg alexnet now: https://github.com/hughperkins/neon-benchmarks/blob/master/results/alexnet_summary.md Nervana Neon CUDA/SASS Winograd kernels for Maxwell
Nervana Neon Kepler direct kernels, in CUDA
OpenCL port of Nervana Neon Kepler direct kernels
However, there's still a couple of things we'd want, if we wanted to generalize this to other networks:
Actually, torch can run from python, as can theano, tensorflow, mxnet (I think?), caffe, DeepCL, chainer. So maybe python is all that is needed??? |
Andrew, hmmm, just noticed, this is figure 4 in your paper. Hadnt noticed that before :-D Well, I hadnt read the paper earlier... I see that you are using element max though. I think this might be sensitive to outliers, and also larger for layers with more output values? Maybe median or average, possibly also with standard deviation, is good? (edited to spell median with a 'd' and an 'i' ...) |
Possibly relevant: Fathom: Reference Workloads for Modern Deep Learning Methods
|
See also Baidu Research DeepBench |
Hi everyone, For now, we have run some classic benchmarks with a K40/Maxwell Titan X/GTX 1080/ Pascal Titan X, using various architectures and frameworks. Results are available here. |
@pedropgusmao from the curve it seems that you basically need to increase the workspace limit in Caffe: The value was in order to support all platforms (old and new) but it usually comes slow for most recent cudnn with most recent hardware. |
@Yangqing , thank you very much. Do you have any suggested values for workspace_limit_bytes considering both those GPUs and the DGX-1? Again, thanks. |
@pedropgusmao take a look at https://www.tensorflow.org/performance/performance_guide with TF 1.0.0. I believe we are also working on benchmarks for some of these models, so you'll have comparison code at some point soon. |
@vrv, thanks a lot! We will modify our code to follow those suggestions. We look forward to see your results. |
Hi all,
The reason I've been slow on convnet-benchmarks these days is because i've been working on the side on DeepMark.
I initially wrote convnet-benchmarks to increase competition among frameworks so that we can work towards faster ConvNets, and they served their purpose well. After the release of convnet-benchmarks, multiple frameworks pulled up their socks to speedup convnets, with a deep sense of prestige for being on top of these benchmarks. In these two years, we as a community accelerated GPU ConvNets across all frameworks between 4x to 10x, efficiently implementing tricks such as FFT, Winograd, and powered by faster hardware. Alex Khrizevsky, Yangqing Jia, Scott Gray, Nicolas Vasilache, Sander Dieleman, Michael Mathieu, Julien Denmouth and many other human compilers helped make this a reality -- looking at the diversity in terms of where each of us work(ed) professionally shows that this kind of acceleration was truly a community effort with a ton of openness, something that is plain awesome! :)
I've also enjoyed reading the deeply technical discussions that take place on convnet-benchmarks (my favorites: #93 , #66 , #59 in recent times ).
Moving on, convnet-benchmarks do not accurately capture everything we think of when we say deep learning. We don't have Recurrent Nets, we don't have video use-cases, speech, NLP etc. There is a need for such comprehensive benchmarks, especially as the space is getting ready for dedicated hardware chips, multi-GPU and multi-machine frameworks and more complex use-cases.
I've sat down with a few of you at NIPS and GTC to discuss and freeze the initial round of benchmarks for what I am calling DeepMark. My initial plan was to work on the initial set of benchmark scripts by myself and cover the most popular frameworks, and then let the direction and maintenance of the benchmarks be community-driven. But the breadth of this effort has been overwhelming to say the least. After careful thought, I've decided that I'll just ask everyone to pitch in for their part of the benchmarks with making scripts etc., especially as many of you were very receptive to the idea offline.
Here are the initial set of use-cases we want to cover:
Networks
Images
Video
Audio
Text
Platform
Metrics
Frameworks
Everyone who wants to join-in, but I thought an initial set that is important to cover would be:
Scripts format
Guarantees
Governance
Timeframe
My hope is that these new set of benchmarks will not only increase competition but will also be beneficial in other ways to the community, serving as common examples to get started, etc.
Let me know what you think :)
Soumith
cc: @hughperkins @f0k @scott-gray @rajatmonga @vrv @benanne @nouiz @Yangqing @tqchen @unnonouno
The text was updated successfully, but these errors were encountered: