Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime measurement #51

Open
MarSpit opened this issue May 30, 2022 · 4 comments
Open

Runtime measurement #51

MarSpit opened this issue May 30, 2022 · 4 comments

Comments

@MarSpit
Copy link

MarSpit commented May 30, 2022

Hi there,
thank you very much for your excellent work and for publishing it.

I am trying to implement a "light weight" version of the ENet which aims to be faster in computation.
In order to have the runtime of the ENet on my hardware (Tesla V100 GPU) as a benchmark, I was trying to measure it. Being aware of this issue #4, I took account of the torch.cuda.synchronize() command.
Measuring the time this way, I obtained 10,5 ms runtime for processing a single image.
However, I realized that I can not compute more than 6 depth images per second (while having a GPU load of 100 %), which indicated to me that something was wrong.
Doing further investigations, I came across the PyTorch profiler which seems to be the official tool for correct GPU time measurement . Measuring the time that way I got 150 ms, which is in accordance with my maximum frame rate, as data preprocessing on the CPU comes on top.

Is it possible, that the times you measured are still not the proper execution times of the network, but rather the kernel launch times ?

@JUGGHM
Copy link
Owner

JUGGHM commented May 31, 2022

PyTorch profiler

Thanks for your interest! I have not used pytorch profile before. I would probably try it recently, but it will take some days.

@JUGGHM
Copy link
Owner

JUGGHM commented Jun 13, 2022

Hi there, thank you very much for your excellent work and for publishing it.

I am trying to implement a "light weight" version of the ENet which aims to be faster in computation. In order to have the runtime of the ENet on my hardware (Tesla V100 GPU) as a benchmark, I was trying to measure it. Being aware of this issue #4, I took account of the torch.cuda.synchronize() command. Measuring the time this way, I obtained 10,5 ms runtime for processing a single image. However, I realized that I can not compute more than 6 depth images per second (while having a GPU load of 100 %), which indicated to me that something was wrong. Doing further investigations, I came across the PyTorch profiler which seems to be the official tool for correct GPU time measurement . Measuring the time that way I got 150 ms, which is in accordance with my maximum frame rate, as data preprocessing on the CPU comes on top.

Is it possible, that the times you measured are still not the proper execution times of the network, but rather the kernel launch times ?

Hi Marspit, I have just tested the runtime with a short script where a same tensor is fed into ENet for 100 times. The model has been warmed up and gradient calculation is forbidden. As I have no 2080ti now, the experiments are conducted on a 3090 without no other payload. Our results are:

(1) Measuring by time.time() (of course synchronized): 2.9389s
(2) Measuring by torch.cuda.event() reference: 2.9519s
(3) Measuring by torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA]): GPU time 2.946s, CPU time 2.575s

So I don't think our measurement lead to such a large speed inconsistency up to 50%. From individual experiences IO interaction could be one performance bottleneck if some other programs are being executed at the same time.

@MarSpit
Copy link
Author

MarSpit commented Jun 13, 2022

Hi JUGGHM,
many thanks for checking my point and doing the new measurements. The times you measured in different ways seem pretty similar, not sure why I get such a difference measuring it with the Profiler and with time.time() with the synchronize command.
However, there seems to be a big difference to the time you state for the ENet on the 2080Ti GPU, which is 0,064 s vs the 2,9.. s you measured now.
Do you have a guess what the reason for this is? 2,9.. s seems pretty long on a GPU?

@JUGGHM
Copy link
Owner

JUGGHM commented Jun 14, 2022

Hi JUGGHM, many thanks for checking my point and doing the new measurements. The times you measured in different ways seem pretty similar, not sure why I get such a difference measuring it with the Profiler and with time.time() with the synchronize command. However, there seems to be a big difference to the time you state for the ENet on the 2080Ti GPU, which is 0,064 s vs the 2,9.. s you measured now. Do you have a guess what the reason for this is? 2,9.. s seems pretty long on a GPU?

2.9s is for 100-time repetition on one 3090, indicating an inference speed of ~30ms per sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants