OOM Error while running eval6 poisoning speech command audio_p10_undefended.json #1761

Uncertain-Quark · 2022-11-14T21:24:18Z

While running the following test case : scenario_configs/eval6/poisoning/audio_dlbd/audio_p10_undefended.json, I run into an OOM issue.

Below is the exact error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[85511,16000] and type
 float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Sub]

I am using a NVIDIA 1080 Ti with armory 0.16.0
Is the issue only with the fact that VRAM is not enough? I am able to run the same scenario on CPU clusters.

I tried to reduce the batch size from 64 all the way down to 2 and still gives the same error.

The text was updated successfully, but these errors were encountered:

davidslater · 2022-11-16T23:34:42Z

@Uncertain-Quark Can you post the armory logs that happen prior to the OOM error? I'm having trouble locating which tensor allocation this is.

davidslater · 2022-11-16T23:41:36Z

If I crank up my batch_size, I get a similar error:

  File "/workspace/armory/scenarios/poison.py", line 313, in fit                                                                                                                     
    self.model.fit(                                                                                                                                                                  
    │    │     └ <function InputFilter.__init__.<locals>.make_replacement.<locals>.replacement_function at 0x7f5883283a60>                                                           
    │    └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.functional.Functional object at 0x7f58506...
    └ <armory.scenarios.poison.Poison object at 0x7f5965611d90>                                                                                                                      

  File "/opt/conda/lib/python3.9/site-packages/art/estimators/classification/classifier.py", line 74, in replacement_function                                    
    return fdict[func_name](self, *args, **kwargs)
           │     │          │      │       └ {'batch_size': 512, 'nb_epochs': 1, 'verbose': False, 'shuffle': True}                                                                  
           │     │          │      └ (array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
           │     │          │                 4.8828125e-04,  6.4086914e-04,  7.6293945e-04],
           │     │          │              ...                                           
           │     │          └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.functional.Functional object at 0x7f58506...
           │     └ 'fit'                                                                                                                                                             
           └ {'__module__': 'art.estimators.classification.tensorflow', '__qualname__': 'TensorFlowV2Classifier', '__doc__': '\n    This c...
  File "/opt/conda/lib/python3.9/site-packages/art/estimators/classification/tensorflow.py", line 961, in fit
    self._train_step(self.model, images, labels)
    │    │           │    │      │       └ <tf.Tensor: shape=(512,), dtype=int64, numpy=
    │    │           │    │      │         array([11,  6, 11, 11,  7, 11, 11, 11,  5, 11, 10,  2,  7, 11, 11, 11, 11,
    │    │           │    │      │             ...
    │    │           │    │      └ <tf.Tensor: shape=(512, 16000), dtype=float32, numpy=
    │    │           │    │        array([[-2.8991699e-03, -3.3569336e-03, -3.1127930e-03, ...,
    │    │           │    │                 2...
    │    │           │    └ <property object at 0x7f5888ddaf40>
    │    │           └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.functional.Functional object at 0x7f58506...
    │    └ <function get_art_model.<locals>.train_step at 0x7f5859cba670>
    └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.functional.Functional object at 0x7f58506...

  File "/workspace/armory/baseline_models/tf_graph/audio_resnet50.py", line 60, in train_step
    predictions = model(samples, training=True)
                  │     └ <tf.Tensor: shape=(512, 16000), dtype=float32, numpy=
                  │       array([[-2.8991699e-03, -3.3569336e-03, -3.1127930e-03, ...,
                  │                2...
                  └ <keras.engine.functional.Functional object at 0x7f58506c0a90>

  File "/opt/conda/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7209, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
          │    │                    └ _NotOkStatusException()
          │    └ <function _status_to_exception at 0x7f589e786040>
          └ <module 'tensorflow.python.eager.core' from '/opt/conda/lib/python3.9/site-packages/tensorflow/python/eager/core.py'>

tensorflow.python.framework.errors_impl.ResourceExhaustedError: Exception encountered when calling layer "conv4_block5_3_conv" "                 f"(type Conv2D).

{{function_node __wrapped__BiasAdd_device_/job:localhost/replica:0/task:0/device:GPU:0}} OOM when allocating tensor with shape[512,8,9,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:BiasAdd]

Call arguments received by layer "conv4_block5_3_conv" "                 f"(type Conv2D):
  • inputs=tf.Tensor(shape=(512, 8, 9, 256), dtype=float32)

However, I do not think that this is where your error is occurring, as the tensor allocation in your case is 2D.

davidslater · 2022-11-16T23:55:28Z

I think that what is happening is internal to the ART TensorFlowV2Classifier:
https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/art/estimators/tensorflow.py#L185
Essentially, it tries to create Tensors out of the large numpy inputs, which are sent to the GPUs, hitting an error here in the Armory scenario code:
https://github.com/twosixlabs/armory/blob/master/armory/scenarios/poison.py#L313

Uncertain-Quark · 2022-11-18T02:09:56Z

So, I am not sure if this is the core of the issue, but could get rid of the issue I was facing by changing this in https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/art/preprocessing/standardisation_mean_std/tensorflow.py : line 78-80

x_norm = x - self._broadcastable_mean
x_norm = x_norm / self._broadcastable_std
x_norm = tf.cast(x_norm, dtype=ART_NUMPY_DTYPE)  # pylint: disable=E1123,E1120

So in the code above, the subtraction loads the tensor into GPU which was causing it to load (85511, 16000) tensor. Could resolve it by converting x to x.numpy()

But, I am not sure if that is the optimal way of solving it

davidslater · 2022-11-18T20:16:53Z

The main problem is that we are working with a (85511, 16000) tensor in the first place. That is either 10.19 GB or 5.09 GB, depending on whether it is a float64 or float32. This normalization operation likely doubles that (storing both the original tensor and the normalized tensor, at least temporarily), which is probably where you are exceeding 11 GB in your GPU.

Switching that in the ART code would only fix the issue for datasets of a certain size and GPUs of a certain size.

A much better solution is to wrap the numpy arrays in a generator so that the TF operations are only working on (batch_size, 16000) tensors.

davidslater · 2022-11-18T22:53:21Z

@Uncertain-Quark I have a fix in this PR: #1767

Once merged in, it would just require adding this to your config:

    ...
    "scenario": {
        "kwargs": {
            "fit_generator": true
        },
    ...

Uncertain-Quark · 2023-01-10T20:43:36Z

@davidslater after 0.16.2 release, I am stuck with the issue again. This is the error log:

Traceback (most recent call last):

File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
│ │ └ {'name': 'main', 'doc': '\nMain script for running scenarios. Users will run a scenario by calling:\n armory r...
│ └ <code object at 0x7f454058fb50, file "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/s...
└ <function _run_code at 0x7f4629d19900>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
│ └ {'name': 'main', 'doc': '\nMain script for running scenarios. Users will run a scenario by calling:\n armory r...
└ <code object at 0x7f454058fb50, file "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/s...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/main.py", line 228, in
run_config(
└ <function run_config at 0x7f451f411120>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/main.py", line 144, in run_config
scenario.evaluate()
│ └ <function Scenario.evaluate at 0x7f4511fa9120>
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>

File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/scenario.py", line 440, in evaluate
self._evaluate()
│ └ <function Scenario._evaluate at 0x7f4511fa9090>
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/scenario.py", line 429, in _evaluate
self.load()
│ └ <function Poison.load at 0x7f4511faa050>
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/poison.py", line 477, in load
self.fit()
│ └ <function Poison.fit at 0x7f4511fa9cf0>
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/poison.py", line 332, in fit
self.model.fit(
│ │ └ <function InputFilter.init..make_replacement..replacement_function at 0x7f45480ea710>
│ └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.sequential.Sequential object at 0x7f45102...
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/art/estimators/classification/classifier.py", line 73, in replacement_function
return fdict[func_name](self, *args, **kwargs)
│ │ │ │ └ {'batch_size': 64, 'nb_epochs': 20, 'verbose': False, 'shuffle': True}
│ │ │ └ (array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
│ │ │ 4.8828125e-04, 6.4086914e-04, 7.6293945e-04],
│ │ │ ...
│ │ └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.sequential.Sequential object at 0x7f45102...
│ └ 'fit'
└ {'module': 'art.estimators.classification.tensorflow', 'qualname': 'TensorFlowV2Classifier', 'doc': '\n This c...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/art/estimators/classification/tensorflow.py", line 959, in fit
x_preprocessed, y_preprocessed = self._apply_preprocessing(x, y, fit=True)
│ │ │ └ array([[0., 0., 0., ..., 0., 0., 0.],
│ │ │ [0., 0., 0., ..., 0., 0., 0.],
│ │ │ [0., 0., 0., ..., 0., 0., 0.],
│ │ │ ...,...
│ │ └ array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
│ │ 4.8828125e-04, 6.4086914e-04, 7.6293945e-04],
│ │ ...
│ └ <function TensorFlowV2Estimator._apply_preprocessing at 0x7f4550441630>
└ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.sequential.Sequential object at 0x7f45102...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/art/estimators/tensorflow.py", line 192, in _apply_preprocessing
x, y = preprocess.forward(x, y)
│ │ │ │ └ <tf.Tensor: shape=(85511, 12), dtype=float32, numpy=
│ │ │ │ array([[0., 0., 0., ..., 0., 0., 0.],
│ │ │ │ [0., 0., 0., ..., 0., 0., 0...
│ │ │ └ <tf.Tensor: shape=(85511, 16000), dtype=float32, numpy=
│ │ │ array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
│ │ │ ...
│ │ └ <function StandardisationMeanStdTensorFlow.forward at 0x7f451047d990>
│ └ StandardisationMeanStdTensorFlow(mean=0.0, std=1.0, apply_fit=True, apply_predict=True)
└ <tf.Tensor: shape=(85511, 16000), dtype=float32, numpy=
array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/art/preprocessing/standardisation_mean_std/tensorflow.py", line 78, in forward
x_norm = x - self._broadcastable_mean
│ │ └ array(0., dtype=float32)
│ └ StandardisationMeanStdTensorFlow(mean=0.0, std=1.0, apply_fit=True, apply_predict=True)
└ <tf.Tensor: shape=(85511, 16000), dtype=float32, numpy=
array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 7215, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
│ │ └ _NotOkStatusException()
│ └ <function _status_to_exception at 0x7f455b28add0>
└ <module 'tensorflow.python.eager.core' from '/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/tensorflow...

tensorflow.python.framework.errors_impl.ResourceExhaustedError: {{function_node _wrapped__Sub_device/job:localhost/replica:0/task:0/device:GPU:0}} failed to allocate memory [Op:Sub]

davidslater added this to the Release 0.16.1 milestone Nov 15, 2022

davidslater added the bug Something isn't working label Nov 15, 2022

davidslater self-assigned this Nov 15, 2022

davidslater mentioned this issue Nov 18, 2022

Oom error poisoning #1767

Merged

davidslater closed this as completed Nov 21, 2022

Uncertain-Quark mentioned this issue Jan 11, 2023

BUG in audio poisoning speech commands #1857

Closed

swsuggs mentioned this issue Feb 1, 2023

Speech commands explanatory model #1869

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM Error while running eval6 poisoning speech command audio_p10_undefended.json #1761

OOM Error while running eval6 poisoning speech command audio_p10_undefended.json #1761

Uncertain-Quark commented Nov 14, 2022

davidslater commented Nov 16, 2022

davidslater commented Nov 16, 2022

davidslater commented Nov 16, 2022

Uncertain-Quark commented Nov 18, 2022 •

edited

Loading

davidslater commented Nov 18, 2022

davidslater commented Nov 18, 2022

Uncertain-Quark commented Jan 10, 2023 •

edited

Loading

OOM Error while running eval6 poisoning speech command audio_p10_undefended.json #1761

OOM Error while running eval6 poisoning speech command audio_p10_undefended.json #1761

Comments

Uncertain-Quark commented Nov 14, 2022

davidslater commented Nov 16, 2022

davidslater commented Nov 16, 2022

davidslater commented Nov 16, 2022

Uncertain-Quark commented Nov 18, 2022 • edited Loading

davidslater commented Nov 18, 2022

davidslater commented Nov 18, 2022

Uncertain-Quark commented Jan 10, 2023 • edited Loading

Uncertain-Quark commented Nov 18, 2022 •

edited

Loading

Uncertain-Quark commented Jan 10, 2023 •

edited

Loading