Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM Error while running eval6 poisoning speech command audio_p10_undefended.json #1761

Closed
Uncertain-Quark opened this issue Nov 14, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@Uncertain-Quark
Copy link

While running the following test case : scenario_configs/eval6/poisoning/audio_dlbd/audio_p10_undefended.json, I run into an OOM issue.

Below is the exact error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[85511,16000] and type
 float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Sub]

I am using a NVIDIA 1080 Ti with armory 0.16.0
Is the issue only with the fact that VRAM is not enough? I am able to run the same scenario on CPU clusters.

I tried to reduce the batch size from 64 all the way down to 2 and still gives the same error.

@davidslater davidslater added this to the Release 0.16.1 milestone Nov 15, 2022
@davidslater davidslater added the bug Something isn't working label Nov 15, 2022
@davidslater davidslater self-assigned this Nov 15, 2022
@davidslater
Copy link
Contributor

@Uncertain-Quark Can you post the armory logs that happen prior to the OOM error? I'm having trouble locating which tensor allocation this is.

@davidslater
Copy link
Contributor

If I crank up my batch_size, I get a similar error:

  File "/workspace/armory/scenarios/poison.py", line 313, in fit                                                                                                                     
    self.model.fit(                                                                                                                                                                  
    │    │     └ <function InputFilter.__init__.<locals>.make_replacement.<locals>.replacement_function at 0x7f5883283a60>                                                           
    │    └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.functional.Functional object at 0x7f58506...
    └ <armory.scenarios.poison.Poison object at 0x7f5965611d90>                                                                                                                      

  File "/opt/conda/lib/python3.9/site-packages/art/estimators/classification/classifier.py", line 74, in replacement_function                                    
    return fdict[func_name](self, *args, **kwargs)
           │     │          │      │       └ {'batch_size': 512, 'nb_epochs': 1, 'verbose': False, 'shuffle': True}                                                                  
           │     │          │      └ (array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
           │     │          │                 4.8828125e-04,  6.4086914e-04,  7.6293945e-04],
           │     │          │              ...                                           
           │     │          └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.functional.Functional object at 0x7f58506...
           │     └ 'fit'                                                                                                                                                             
           └ {'__module__': 'art.estimators.classification.tensorflow', '__qualname__': 'TensorFlowV2Classifier', '__doc__': '\n    This c...
  File "/opt/conda/lib/python3.9/site-packages/art/estimators/classification/tensorflow.py", line 961, in fit
    self._train_step(self.model, images, labels)
    │    │           │    │      │       └ <tf.Tensor: shape=(512,), dtype=int64, numpy=
    │    │           │    │      │         array([11,  6, 11, 11,  7, 11, 11, 11,  5, 11, 10,  2,  7, 11, 11, 11, 11,
    │    │           │    │      │             ...
    │    │           │    │      └ <tf.Tensor: shape=(512, 16000), dtype=float32, numpy=
    │    │           │    │        array([[-2.8991699e-03, -3.3569336e-03, -3.1127930e-03, ...,
    │    │           │    │                 2...
    │    │           │    └ <property object at 0x7f5888ddaf40>
    │    │           └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.functional.Functional object at 0x7f58506...
    │    └ <function get_art_model.<locals>.train_step at 0x7f5859cba670>
    └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.functional.Functional object at 0x7f58506...

  File "/workspace/armory/baseline_models/tf_graph/audio_resnet50.py", line 60, in train_step
    predictions = model(samples, training=True)
                  │     └ <tf.Tensor: shape=(512, 16000), dtype=float32, numpy=
                  │       array([[-2.8991699e-03, -3.3569336e-03, -3.1127930e-03, ...,
                  │                2...
                  └ <keras.engine.functional.Functional object at 0x7f58506c0a90>

  File "/opt/conda/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/opt/conda/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7209, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
          │    │                    └ _NotOkStatusException()
          │    └ <function _status_to_exception at 0x7f589e786040>
          └ <module 'tensorflow.python.eager.core' from '/opt/conda/lib/python3.9/site-packages/tensorflow/python/eager/core.py'>

tensorflow.python.framework.errors_impl.ResourceExhaustedError: Exception encountered when calling layer "conv4_block5_3_conv" "                 f"(type Conv2D).

{{function_node __wrapped__BiasAdd_device_/job:localhost/replica:0/task:0/device:GPU:0}} OOM when allocating tensor with shape[512,8,9,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:BiasAdd]

Call arguments received by layer "conv4_block5_3_conv" "                 f"(type Conv2D):
  • inputs=tf.Tensor(shape=(512, 8, 9, 256), dtype=float32)

However, I do not think that this is where your error is occurring, as the tensor allocation in your case is 2D.

@davidslater
Copy link
Contributor

I think that what is happening is internal to the ART TensorFlowV2Classifier:
https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/art/estimators/tensorflow.py#L185
Essentially, it tries to create Tensors out of the large numpy inputs, which are sent to the GPUs, hitting an error here in the Armory scenario code:
https://github.com/twosixlabs/armory/blob/master/armory/scenarios/poison.py#L313

@Uncertain-Quark
Copy link
Author

Uncertain-Quark commented Nov 18, 2022

So, I am not sure if this is the core of the issue, but could get rid of the issue I was facing by changing this in https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/art/preprocessing/standardisation_mean_std/tensorflow.py : line 78-80

x_norm = x - self._broadcastable_mean
x_norm = x_norm / self._broadcastable_std
x_norm = tf.cast(x_norm, dtype=ART_NUMPY_DTYPE)  # pylint: disable=E1123,E1120

So in the code above, the subtraction loads the tensor into GPU which was causing it to load (85511, 16000) tensor. Could resolve it by converting x to x.numpy()

But, I am not sure if that is the optimal way of solving it

@davidslater
Copy link
Contributor

The main problem is that we are working with a (85511, 16000) tensor in the first place. That is either 10.19 GB or 5.09 GB, depending on whether it is a float64 or float32. This normalization operation likely doubles that (storing both the original tensor and the normalized tensor, at least temporarily), which is probably where you are exceeding 11 GB in your GPU.

Switching that in the ART code would only fix the issue for datasets of a certain size and GPUs of a certain size.

A much better solution is to wrap the numpy arrays in a generator so that the TF operations are only working on (batch_size, 16000) tensors.

@davidslater
Copy link
Contributor

@Uncertain-Quark I have a fix in this PR: #1767

Once merged in, it would just require adding this to your config:

    ...
    "scenario": {
        "kwargs": {
            "fit_generator": true
        },
    ...

@Uncertain-Quark
Copy link
Author

Uncertain-Quark commented Jan 10, 2023

@davidslater after 0.16.2 release, I am stuck with the issue again. This is the error log:

Traceback (most recent call last):

File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
│ │ └ {'name': 'main', 'doc': '\nMain script for running scenarios. Users will run a scenario by calling:\n armory r...
│ └ <code object at 0x7f454058fb50, file "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/s...
└ <function _run_code at 0x7f4629d19900>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
│ └ {'name': 'main', 'doc': '\nMain script for running scenarios. Users will run a scenario by calling:\n armory r...
└ <code object at 0x7f454058fb50, file "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/s...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/main.py", line 228, in
run_config(
└ <function run_config at 0x7f451f411120>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/main.py", line 144, in run_config
scenario.evaluate()
│ └ <function Scenario.evaluate at 0x7f4511fa9120>
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>

File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/scenario.py", line 440, in evaluate
self._evaluate()
│ └ <function Scenario._evaluate at 0x7f4511fa9090>
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/scenario.py", line 429, in _evaluate
self.load()
│ └ <function Poison.load at 0x7f4511faa050>
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/poison.py", line 477, in load
self.fit()
│ └ <function Poison.fit at 0x7f4511fa9cf0>
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/armory/scenarios/poison.py", line 332, in fit
self.model.fit(
│ │ └ <function InputFilter.init..make_replacement..replacement_function at 0x7f45480ea710>
│ └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.sequential.Sequential object at 0x7f45102...
└ <armory.scenarios.poison.Poison object at 0x7f462941aef0>
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/art/estimators/classification/classifier.py", line 73, in replacement_function
return fdict[func_name](self, *args, **kwargs)
│ │ │ │ └ {'batch_size': 64, 'nb_epochs': 20, 'verbose': False, 'shuffle': True}
│ │ │ └ (array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
│ │ │ 4.8828125e-04, 6.4086914e-04, 7.6293945e-04],
│ │ │ ...
│ │ └ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.sequential.Sequential object at 0x7f45102...
│ └ 'fit'
└ {'module': 'art.estimators.classification.tensorflow', 'qualname': 'TensorFlowV2Classifier', 'doc': '\n This c...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/art/estimators/classification/tensorflow.py", line 959, in fit
x_preprocessed, y_preprocessed = self._apply_preprocessing(x, y, fit=True)
│ │ │ └ array([[0., 0., 0., ..., 0., 0., 0.],
│ │ │ [0., 0., 0., ..., 0., 0., 0.],
│ │ │ [0., 0., 0., ..., 0., 0., 0.],
│ │ │ ...,...
│ │ └ array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
│ │ 4.8828125e-04, 6.4086914e-04, 7.6293945e-04],
│ │ ...
│ └ <function TensorFlowV2Estimator._apply_preprocessing at 0x7f4550441630>
└ art.estimators.classification.tensorflow.TensorFlowV2Classifier(model=<keras.engine.sequential.Sequential object at 0x7f45102...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/art/estimators/tensorflow.py", line 192, in _apply_preprocessing
x, y = preprocess.forward(x, y)
│ │ │ │ └ <tf.Tensor: shape=(85511, 12), dtype=float32, numpy=
│ │ │ │ array([[0., 0., 0., ..., 0., 0., 0.],
│ │ │ │ [0., 0., 0., ..., 0., 0., 0...
│ │ │ └ <tf.Tensor: shape=(85511, 16000), dtype=float32, numpy=
│ │ │ array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
│ │ │ ...
│ │ └ <function StandardisationMeanStdTensorFlow.forward at 0x7f451047d990>
│ └ StandardisationMeanStdTensorFlow(mean=0.0, std=1.0, apply_fit=True, apply_predict=True)
└ <tf.Tensor: shape=(85511, 16000), dtype=float32, numpy=
array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/art/preprocessing/standardisation_mean_std/tensorflow.py", line 78, in forward
x_norm = x - self._broadcastable_mean
│ │ └ array(0., dtype=float32)
│ └ StandardisationMeanStdTensorFlow(mean=0.0, std=1.0, apply_fit=True, apply_predict=True)
└ <tf.Tensor: shape=(85511, 16000), dtype=float32, numpy=
array([[-4.5776367e-04, -5.4931641e-04, -3.6621094e-04, ...,
...
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 7215, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
│ │ └ _NotOkStatusException()
│ └ <function _status_to_exception at 0x7f455b28add0>
└ <module 'tensorflow.python.eager.core' from '/home/usr/miniconda3/envs/armory_core/lib/python3.10/site-packages/tensorflow...

tensorflow.python.framework.errors_impl.ResourceExhaustedError: {{function_node _wrapped__Sub_device/job:localhost/replica:0/task:0/device:GPU:0}} failed to allocate memory [Op:Sub]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants