Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tensor flow whisper model for audio classification #22109

Conversation

adit299
Copy link
Contributor

@adit299 adit299 commented Mar 11, 2023

What does this PR do?

Adding support for audio classification within TensorFlow whisper model

Fixes #21777

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@adit299 adit299 force-pushed the Add_TensorFlow_Whisper_model_for_audio_classification branch from cd0aa86 to 246f2d2 Compare March 12, 2023 00:02
…dd_TensorFlow_Whisper_model_for_audio_classification
@adit299
Copy link
Contributor Author

adit299 commented Mar 15, 2023

I just had a few questions on how to proceed with adding the TensorFlow Whisper model, just to make sure I'm on the right track.

(1) Just so that I am clear on what the task is asking for, I need to recreate what is being done in PR #21754, except in TensorFlow. So, more specifically recreate the WhisperForAudioClassification class in TensorFlow, within the modeling_tf_whisper.py file.

(2) I see that there are a lot of additional lines of code within PR #21754 in various files that seem to be "registering" that the Whisper model now supports audio classification. Would I have to add any lines of code similar to this within my PR? Is there any documentation I can take a look at to learn more about this? (or anything that would help me understand more about this task in general)

@sanchit-gandhi

@amyeroberts
Copy link
Collaborator

Hi @adit299 Thanks for opening this PR - excited to have this implemented in TF!

Regarding your questions:

  1. Yes, exactly.
  2. Yes, the other (equivalent TF) additions will also need to be added. Some of the additions in [Whisper] Add model for audio classification #21754 are automatically generated e.g. those in dummy_pt_objects.py. There's an in-depth guide to adding TensorFlow models here which should cover the process. Let us know if there's anything missing or unclear.

@sanchit-gandhi
Copy link
Contributor

Super cool @adit299! Feel free to ping us if you have any more questions / queries! More than happy to help with the integration here!

@adit299
Copy link
Contributor Author

adit299 commented Apr 4, 2023

Hello,

Just wanted to check in and provide an update. I have finished adding the TFWhisperForAudioClassification class within the modeling_tf_whisper.py file. One question regarding this:

(1) Within the modeling_tf_auto.py file I don't see any OrderedDict named TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES (or any OrderedDict that is equivalent to the MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES present within the modeling_auto.py file). I was wondering where the TFWhisperForAudioClassification class should go within the modeling_tf_auto.py file.

I will continue work on developing the model tester, and will post any issues I run into here.

@sanchit-gandhi

@amyeroberts
Copy link
Collaborator

@adit299 - that's great news on the update!

For the auto mapping, if the tensorflow equivalent TF_MODEL_FOR_XXX doesn't exist, then it can be added to modeling_tf_auto.py. This means this is the first audio classification model to be added for TensorFlow 🔥🔥🔥

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Apr 26, 2023

Recently, we merged TensorFlow Wav2Vec2 For Sequence Classification: #22073

You could propagate the modelling code changes form this PR onto Whisper as a quick way of getting this working @adit299 (as we do for the PyTorch code)

@adit299
Copy link
Contributor Author

adit299 commented Apr 27, 2023

By propagate, do you mean just looking at that PR and using the code written for that task as help for this current task? If so, I have already been doing that. If you are referring to some other procedure please do let me know about this as I am not aware. That would certainly help!

Questions I had:

(1) I noticed that within the Pytorch implementation of the whisper tests, it refers to a class GenerationTesterMixin which does not seem to have a similarly named Tensorflow equivalent. Would I have to add this class? I am also confused about what these classes are doing (ex. what is TFModelTesterMixin doing, etc.), so any clarification you can provide is appreciated!

class TFWhisperEncoderModelTest(TFModelTesterMixin, TFGenerationTesterMixin, unittest.TestCase):

(2) I was having trouble with translating the test_encoder_outputs method in TensorFlow. Mainly these lines:

for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()

Again, a bit confused about what model.to(torch_device) is doing. I will look into this a bit more, but again any clarifications about what this method is doing would help.

Thanks again for the speedy responses!
@sanchit-gandhi @amyeroberts

@amyeroberts
Copy link
Collaborator

@adit299 By propagate, we mean apply the equivalent changes from the Wav2Vec2 PR to this PR - it won't be a direct copy-paste, but there will be large proportions in common. It's sounds like this is what you're doing, which is great :)

With respect to your questions:

  1. GenerationTesterMixin

Yes, I don't think this class exists yet and you wouldn't have to add this class as part of this PR. Is there anything that should be added for the TF model tests @gante ?

In terms of what these classes are doing, the mixin classes group together related functionality e.g. common tests that should be added to all models. For example, TFModelTesterMixin contains tests for the TensorFlow models. This way we can create other classes using a composition of mixins.

  1. .to and .eval methods
    model.to(...) is a pytorch specific method. See docs here: https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=#torch.nn.Module.to. It's moving the model onto the specified torch device. model.eval() is also a PyTorch method: https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=#torch.nn.Module.to.

@gante
Copy link
Member

gante commented May 4, 2023

@amyeroberts there is no generation-specific test mixin for TF. TFModelTesterMixin has some basic generate checks :)

@sanchit-gandhi
Copy link
Contributor

Looks cool already @adit299! Let us know if you need a hand with the integration or when you'd like a PR review 🤗

@sanchit-gandhi
Copy link
Contributor

Hey @adit299 - feel free to comment here when this PR is ready for review and we can take a look! Seems to be close to completion

@huggingface huggingface deleted a comment from github-actions bot Jun 12, 2023
@adit299
Copy link
Contributor Author

adit299 commented Jun 12, 2023

Hey @sanchit-gandhi, apologies for the delay! Yes, this PR is ready for review. I haven't had much luck in getting some tests to pass however. I appreciate any help you guys can provide by taking a look.

@adit299 adit299 marked this pull request as ready for review June 12, 2023 18:07
@amyeroberts
Copy link
Collaborator

@adit299 Unfortunately, diving into people's PRs to debug isn't something we can do as it's just not a scalable solution with a repo of this size. If you need help from us, then please share a detailed description of the issue, what you've tried already and ideally highlighting any relevant pieces of code.

@adit299
Copy link
Contributor Author

adit299 commented Jun 19, 2023

Understandable, @amyeroberts . There are 5 tests failing right now. Here is all the information requested (to the best of my knowledge):

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_compile_tf_model

Error -

E       TypeError: Exception encountered when calling layer 'tf_whisper_for_audio_classification_4' (type TFWhisperForAudioClassification).
E
E       call() got an unexpected keyword argument 'decoder_input_ids'
E
E       Call arguments received by layer 'tf_whisper_for_audio_classification_4' (type TFWhisperForAudioClassification):
E         • input_features={'input_features': 'tf.Tensor(shape=(2, 80, 59), dtype=float32)', 'decoder_input_ids': 'tf.Tensor(shape=(1, 2), dtype=int32)'}
E         • head_mask=None
E         • encoder_outputs=None
E         • labels=None
E         • output_attentions=None
E         • output_hidden_states=None
E         • return_dict=None

../../../src/transformers/modeling_tf_utils.py:434: TypeError

What I tried -

I suspected it had something to do with:

https://github.com/adit299/transformers/blob/3d3c7d4213e08d69254edb9c04ac28b3dfbd40f4/tests/test_modeling_tf_common.py#L739C4-L819

But that doesn't seem to be the case. Maybe the Whisper decoder is being mistakenly invoked? I am just not sure.

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_hidden_states_output - AssertionError: Lists differ: [30, 16] != [60, 16]

Error -

../../test_modeling_tf_common.py:1002: in check_hidden_states_output
    self.assertListEqual(
E   AssertionError: Lists differ: [30, 16] != [60, 16]
E
E   First differing element 0:
E   30
E   60
E
E   - [30, 16]
E   ?  ^
E
E   + [60, 16]
E   ?  ^

The assertion failing is:

 self.assertListEqual(
                    list(hidden_states[0].shape[-2:]),
                    [self.model_tester.seq_length, self.model_tester.hidden_size],
                )

What I tried - Not sure about this one.

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_pt_tf_model_equivalence - AttributeError: tf_whisper_encoder_17.conv1.weight not found in PyTorch model

Error -

E               AttributeError: tf_whisper_encoder_17.conv1.weight not found in PyTorch model

../../../src/transformers/modeling_tf_pytorch_utils.py:322: AttributeError

What I tried - Not sure about this one as well

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_resize_token_embeddings - NotImplementedError

Error -
../../../src/transformers/modeling_tf_utils.py:1343: NotImplementedError

What I tried - I think this one is out of my control

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_save_load - TypeError: Exception encountered when calling layer 'tf_whisper_for_audio_classification_20' (type TFWhisperForAudioClassification

What I tried - connected to the first error, solving that should solve this.

Please do let me know if any other clarification is needed! Apologies for the long post!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@amyeroberts
Copy link
Collaborator

Hi @adit299, thanks for giving more details about debugging the tests and apologies for the delay in my response.

I suggest looking through the artefacts from the CI run, specifically failure_long.txt as they will give you a more detailed error message an trackback to help figure out the issues.

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_compile_tf_model
I think your suspicions are correct. You'll need to add a new branch in the if/else logic to create the correct inputs for this model.

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_hidden_states_output
In this case it seems the sequence length of the hidden size doesn't match what's expected. I would create a model using the test config and check its architecture and the hidden states outputs when passed a dummy input.

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_pt_tf_model_equivalence
I looks like a weight is in the TF model and not in the PT model. I'd check the params in each model - looking at tf_model.trainable_parameters() and pt_model.state_dict() to see if you can identify if this is a case of a weight not being loaded, or name not properly matched.

If you create the TF whisper model with pytorch weights, do you get any warnings about weights being randomly initialized?

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_resize_token_embeddings - NotImplementedError

This is raised because the model doesn't have a get_input_embeddings method implemented

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_save_load

From the CI artefacts, it looks like this is failing because of decoder_input_ids being in the input

…dd_TensorFlow_Whisper_model_for_audio_classification
@adit299 adit299 marked this pull request as draft July 17, 2023 13:35
…dd_TensorFlow_Whisper_model_for_audio_classification
@adit299
Copy link
Contributor Author

adit299 commented Aug 7, 2023

Hello,

Apologies for the delay. I am attempting to instantiate an instance of the TFWhisperForAudioClassification model to debug some of the issues I'm having. So, I try to run this:

>>> from transformers import TFWhisperForAudioClassification

I end up getting this error:

RecursionError: maximum recursion depth exceeded while calling a Python object

Which stems from these lines of code:

def keys(self):
mapping_keys = [
self._load_attr_from_module(key, name)
for key, name in self._config_mapping.items()
if key in self._model_mapping.keys()
]
return mapping_keys + list(self._extra_content.keys())

When I run a debugger, the problematic statement is:

if key in self._model_mapping.keys()

Just executing self._model_mapping.keys() on its own results in the RecursionError.

I have been trying to see what is causing this, but I'm at a loss. Is this why you suggest creating the model using a test config? Could you show how to do that if it is relevant to avoiding this error? I contemplated increasing the Recursion Depth on my machine (its currently at 1000), but I'm hesitant to think that would solve it.

Thanks again for your patience, I realize I'm quite the n00b 😅

@amyeroberts @sanchit-gandhi

@@ -439,6 +445,10 @@
)

TF_MODEL_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, TF_MODEL_MAPPING_NAMES)
TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES = _LazyAutoMapping(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adit299 The recursion is happening because of this line - the variable is being assigned to itself

Suggested change
TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES = _LazyAutoMapping(
TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING = _LazyAutoMapping(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this solves the issue!

@adit299
Copy link
Contributor Author

adit299 commented Aug 28, 2023

Hello,

I am currently attempting to resolve the error:

Error -

E       TypeError: Exception encountered when calling layer 'tf_whisper_for_audio_classification_4' (type TFWhisperForAudioClassification).
E
E       call() got an unexpected keyword argument 'decoder_input_ids'
E
E       Call arguments received by layer 'tf_whisper_for_audio_classification_4' (type TFWhisperForAudioClassification):
E         • input_features={'input_features': 'tf.Tensor(shape=(2, 80, 59), dtype=float32)', 'decoder_input_ids': 'tf.Tensor(shape=(1, 2), dtype=int32)'}
E         • head_mask=None
E         • encoder_outputs=None
E         • labels=None
E         • output_attentions=None
E         • output_hidden_states=None
E         • return_dict=None

../../../src/transformers/modeling_tf_utils.py:434: TypeError

Since this error is the root cause of several of the tests failing. I think the issue is that TFWhisperForAudioClassification inherits from the class TFWhisperPreTrainedModel, which has the following methods:

class TFWhisperPreTrainedModel(TFPreTrainedModel):
config_class = WhisperConfig
base_model_prefix = "model"
main_input_name = "input_features"
def _get_feat_extract_output_lengths(self, input_lengths: tf.Tensor) -> int:
"""
Computes the output length of the convolutional layers
"""
input_lengths = (input_lengths - 1) // 2 + 1
return input_lengths
@property
def dummy_inputs(self) -> Dict[str, tf.Tensor]:
"""
Dummy inputs to build the network.
Returns:
`Dict[str, tf.Tensor]`: The dummy inputs.
"""
return {
self.main_input_name: tf.random.uniform(
[1, self.config.num_mel_bins, self.config.max_source_positions * 2 - 1], dtype=tf.float32
),
"decoder_input_ids": tf.constant([[1, 3]], dtype=tf.int32),
}
@property
def input_signature(self):
return {
"input_features": tf.TensorSpec((None, self.config.num_mel_bins, None), tf.float32, name="input_features"),
"decoder_input_ids": tf.TensorSpec((None, None), tf.int32, name="decoder_input_ids"),
"decoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="decoder_attention_mask"),
}

I believe the dummy_inputs method is introducing decoder_input_ids into the input. By commenting out a couple of lines:

@property
    def dummy_inputs(self) -> Dict[str, tf.Tensor]:
        """
        Dummy inputs to build the network.

        Returns:
            `Dict[str, tf.Tensor]`: The dummy inputs.
        """
        return {
            self.main_input_name: tf.random.uniform(
                [1, self.config.num_mel_bins, self.config.max_source_positions * 2 - 1], dtype=tf.float32
            ),
            # "decoder_input_ids": tf.constant([[1, 3]], dtype=tf.int32),
        }

    @property
    def input_signature(self):
        return {
            "input_features": tf.TensorSpec((None, self.config.num_mel_bins, None), tf.float32, name="input_features"),
            # "decoder_input_ids": tf.TensorSpec((None, None), tf.int32, name="decoder_input_ids"),
            "decoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="decoder_attention_mask"),
        }

The number of tests failing reduces to 4. Although, obviously, this introduces new errors (I have attached the new errors at the bottom for reference). The pytorch equivalent to this method does not contain the dummy_inputs and input_signature method :

class WhisperPreTrainedModel(PreTrainedModel):
config_class = WhisperConfig
base_model_prefix = "model"
main_input_name = "input_features"
supports_gradient_checkpointing = True
_no_split_modules = ["WhisperEncoderLayer", "WhisperDecoderLayer"]
def _init_weights(self, module):
std = self.config.init_std
if isinstance(module, (nn.Linear, nn.Conv1d)):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
def _set_gradient_checkpointing(self, module, value=False):
if isinstance(module, (WhisperDecoder, WhisperEncoder)):
module.gradient_checkpointing = value
def _get_feat_extract_output_lengths(self, input_lengths: torch.LongTensor):
"""
Computes the output length of the convolutional layers
"""
input_lengths = (input_lengths - 1) // 2 + 1
return input_lengths

My questions are:

(1) Should I attempt to change the TensorFlow PreTrainedMethod to be similar to the Pytorch implementation?

or

(2) Is there some better way to proceed?

Once this is resolved, I am very close to finishing with this pull request. Thanks again for your patience!
@amyeroberts @sanchit-gandhi


New Errors:

FAILED tests/models/whisper/test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_resize_token_embeddings - ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.
FAILED tests/models/whisper/test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_save_load - AssertionError: 5.524128 not less than or equal to 1e-05

@amyeroberts
Copy link
Collaborator

@adit299 dummy_inputs and input_signature are methods unique to the tensorflow models and aren't needed in the pytorch implementation.

TFWhisperForAudioClassification should implement its own dummy_inputs and input_signature which override the methods it inherits from TFWhisperPreTrainedModel.

I'm going to be away mid-September to mid-October. If you have any other tensorflow specific questions, or questions about the differences between the TF and PT models, please ping @Rocketknight1 in my absence.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Oct 11, 2023
@Rocketknight1 Rocketknight1 reopened this Oct 11, 2023
@github-actions github-actions bot closed this Oct 20, 2023
@Rocketknight1 Rocketknight1 reopened this Oct 20, 2023
@github-actions github-actions bot closed this Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add TensorFlow Whisper model for audio classification
6 participants