-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime measurement #51
Comments
Thanks for your interest! I have not used pytorch profile before. I would probably try it recently, but it will take some days. |
Hi Marspit, I have just tested the runtime with a short script where a same tensor is fed into ENet for 100 times. The model has been warmed up and gradient calculation is forbidden. As I have no 2080ti now, the experiments are conducted on a 3090 without no other payload. Our results are: (1) Measuring by time.time() (of course synchronized): 2.9389s So I don't think our measurement lead to such a large speed inconsistency up to 50%. From individual experiences IO interaction could be one performance bottleneck if some other programs are being executed at the same time. |
Hi JUGGHM, |
2.9s is for 100-time repetition on one 3090, indicating an inference speed of ~30ms per sample. |
Hi there,
thank you very much for your excellent work and for publishing it.
I am trying to implement a "light weight" version of the ENet which aims to be faster in computation.
In order to have the runtime of the ENet on my hardware (Tesla V100 GPU) as a benchmark, I was trying to measure it. Being aware of this issue #4, I took account of the torch.cuda.synchronize() command.
Measuring the time this way, I obtained 10,5 ms runtime for processing a single image.
However, I realized that I can not compute more than 6 depth images per second (while having a GPU load of 100 %), which indicated to me that something was wrong.
Doing further investigations, I came across the PyTorch profiler which seems to be the official tool for correct GPU time measurement . Measuring the time that way I got 150 ms, which is in accordance with my maximum frame rate, as data preprocessing on the CPU comes on top.
Is it possible, that the times you measured are still not the proper execution times of the network, but rather the kernel launch times ?
The text was updated successfully, but these errors were encountered: