-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed and memory consumption of binned PrecisionRecallCurve
#1493
Improve speed and memory consumption of binned PrecisionRecallCurve
#1493
Conversation
…large number of samples
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #1493 +/- ##
========================================
- Coverage 89% 51% -38%
========================================
Files 216 216
Lines 11243 11275 +32
========================================
- Hits 9983 5720 -4263
- Misses 1260 5555 +4295 |
Hi @Callidior thanks for this PR. In general the implementation looks good to me. In #1492 you mention that which implementation is faster is influenced by the number of samples, number of thresholds (and potentially number of classes as well). Is there a way we could derive a condition by on the product of those or should we just make this adjustable via a flag (with the new implementation as default)? I am just asking this since while you are saying that you'd benefit from the new implementation, there might also be a lot of people benefitting from the current implementation (e.g. when memory isn't too much a constraint). So ideally we would find a way to make both implementations available or at least switch appropriately. |
Hi @justusschock! Thanks for your feedback. I completely agree that there are also common use cases (such as image-level classification with < 100 samples per batch), for which the current implementation would be the better choice. An automatic switch that selects the more suitable algorithm would be my preferred solution as well. For the setup I benchmarked above for binary classification, one would probably switch to the old implementation (this PR) from 10k samples and above. I could run some more tests with different numbers of thresholds, but the result might even by hardware-dependent. What is more predictable, though, is the memory consumption. I would definitely not spend much more than 100 MB for computing a metric if I could do the same with 1-2 MB. We could estimate the memory needed by the current implementation based on What do you think? |
A more detailed speed benchmark on the binary case (time in ms):
The documentation of
That is true for the metric's state but not for intermediate computations during the update step. The documentation does not mention speed but puts a promise on memory efficiency in the foreground, which is currently not kept. With the current implementation, I don't see any benefit of using the binned version of the metric. If you have few samples, the accurate version ( @justusschock: Given that users can already switch between the accurate and the binned version, I don't think an explicit second switch for two different implementations of the binned version makes sense. |
@Callidior thank you for investigating this issue.
The thing your analysis is missing is that you are assuming that all samples are evaluated at the same time, however this is not how it is done in practise in any deep learning pipeline where samples arrive in mini-batches. Thus, there is a significant difference in memory between doing preds, target = torch.rand(1e6), torch.randint(2, (1e6,))
output = metric(preds, target) and preds, target = torch.rand(1e6), torch.randint(2, (1e6,))
for i in range(1000):
metric.update(preds[1000*i:1000*(i+1)], target[1000*i:1000*(i+1)])
output = metric.compute() this is the reason for the binned version existing, because the non-binned version would have to save all samples in memory in both cases whereas the binned version only needs to save reduced state (the |
I would agree with @justusschock that the correct approach is implement some kind of heuristic that changes between |
@SkafteNicki Fair point. I updated the PR to implement the suggested heuristic switch between the two implementations. Further benchmark results for the binary and multi-class case on both CPU and GPU can be found below, as well as the benchmark script. The picture on CPU doesn't look too differently from the situation on the GPU. Based on those benchmarks, I'd propose the following heuristic:
Speed BenchmarkGPU: NVIDIA GeForce RTX 3090 BinaryGPUCPUMulti-ClassGPUWhite cells mean that more than 24 GB of GPU memory would have been required.
|
@Callidior this heuristic is fine with me. Thanks for all the detailed benchmarks! |
src/torchmetrics/functional/classification/precision_recall_curve.py
Outdated
Show resolved
Hide resolved
src/torchmetrics/functional/classification/precision_recall_curve.py
Outdated
Show resolved
Hide resolved
PrecisionRecallCurve
Head branch was pushed to by a user without write access
What does this PR do?
Fixes #1492 by reverting the implementation of the update step of binary and multiclass PrecisionRecallCurve with a fixed number of thresholds to the implementation from torchmetrics 0.9.
I did not touch the multilabel case, because it is difficult to adapt the
ignore_idx
filtering logic, a first attempt of implementing it on a per-threshold basis was much slower, and multilabel classification is not common in semantic segmentation scenarios, which is where we typically have to deal with huge numbers of samples.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Performance Comparison
Hardware: NVIDIA RTX 3090
BinaryPrecisionRecallCurve.update()
thresholds=200
MulticlassPrecisionRecallCurve.update()
num_classes=100, thresholds=200
The implementation proposed here is not faster in all cases but always more memory efficient. In particular, it is up to 100x slower for few numbers of samples such as in a typical image-level classification scenario. However, it is up to 100x faster for scenarios with many samples per batch such as in semantic segmentation use-cases. In these scenarios, the extreme memory consumption of the previous implementation is the largest deal breaker.
Did you have fun?
Make sure you had fun coding 🙃