You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, i have a question about Audio Embedding. In the paper, you mentioned that "Given the contextual influence on sequential audio data, we extracted the corresponding 5-second audio segment for the S frames." However, in code talk_video.py line 250, you set audio_tensor to the corresponding 5 frames of the audio embedding. Is that "5-second" a typo in the paper? Or did i misunderstand the pipeline.
From my understanding, the audio is first extracted from the video, then the audio is processed by wave2vec2 to obtain the audio embedding. So the audio embedding has same length as the video data(unit is number of frames). Does that means you cut the videos into 5 second slices before go to the data_preprocess.py scripts?
Thanks for reading and answering my concerns.
The text was updated successfully, but these errors were encountered:
Hi, i have a question about Audio Embedding. In the paper, you mentioned that "Given the contextual influence on sequential audio data, we extracted the corresponding 5-second audio segment for the S frames." However, in code talk_video.py line 250, you set audio_tensor to the corresponding 5 frames of the audio embedding. Is that "5-second" a typo in the paper? Or did i misunderstand the pipeline.
From my understanding, the audio is first extracted from the video, then the audio is processed by wave2vec2 to obtain the audio embedding. So the audio embedding has same length as the video data(unit is number of frames). Does that means you cut the videos into 5 second slices before go to the data_preprocess.py scripts?
Thanks for reading and answering my concerns.
The text was updated successfully, but these errors were encountered: