-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM Error while running eval6 poisoning speech command audio_p10_undefended.json #1761
Comments
@Uncertain-Quark Can you post the armory logs that happen prior to the OOM error? I'm having trouble locating which tensor allocation this is. |
If I crank up my batch_size, I get a similar error:
However, I do not think that this is where your error is occurring, as the tensor allocation in your case is 2D. |
I think that what is happening is internal to the ART TensorFlowV2Classifier: |
So, I am not sure if this is the core of the issue, but could get rid of the issue I was facing by changing this in https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/art/preprocessing/standardisation_mean_std/tensorflow.py : line 78-80
So in the code above, the subtraction loads the tensor into GPU which was causing it to load (85511, 16000) tensor. Could resolve it by converting x to x.numpy() But, I am not sure if that is the optimal way of solving it |
The main problem is that we are working with a (85511, 16000) tensor in the first place. That is either 10.19 GB or 5.09 GB, depending on whether it is a float64 or float32. This normalization operation likely doubles that (storing both the original tensor and the normalized tensor, at least temporarily), which is probably where you are exceeding 11 GB in your GPU. Switching that in the ART code would only fix the issue for datasets of a certain size and GPUs of a certain size. A much better solution is to wrap the numpy arrays in a generator so that the TF operations are only working on (batch_size, 16000) tensors. |
@Uncertain-Quark I have a fix in this PR: #1767 Once merged in, it would just require adding this to your config:
|
@davidslater after 0.16.2 release, I am stuck with the issue again. This is the error log: Traceback (most recent call last): File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/runpy.py", line 196, in _run_module_as_main
tensorflow.python.framework.errors_impl.ResourceExhaustedError: {{function_node _wrapped__Sub_device/job:localhost/replica:0/task:0/device:GPU:0}} failed to allocate memory [Op:Sub] |
While running the following test case : scenario_configs/eval6/poisoning/audio_dlbd/audio_p10_undefended.json, I run into an OOM issue.
Below is the exact error:
I am using a NVIDIA 1080 Ti with armory 0.16.0
Is the issue only with the fact that VRAM is not enough? I am able to run the same scenario on CPU clusters.
I tried to reduce the batch size from 64 all the way down to 2 and still gives the same error.
The text was updated successfully, but these errors were encountered: