Objective: Build and evaluate a speech emotion recognition (SER) system to effectively classify emotional states from spoken audio.
Multimodal dataset for emotional expression analysis
Contains 7,442 audio-visual clips of actors speaking sentences
Features diverse demographic variations (age, gender, ethnicity)
The dataset consists of 91 actors (48 male, 43 female, ages 20-74), with 12 unique sentences, 6 emotion categories (Anger, Disgust, Fear, Happy, Neutral, Sad), and 4 emotion intensity levels (Low, Medium, High, Unspecified)
File Naming Convention:
Data Exploration and Preprocessing:
Preprocess the audio data for feature extraction (e.g. pad the audio signals to the maximum length).
Feature Engineering:
Choose two distinct feature spaces to represent the emotional information in the audio:
Time/Frequency Domain: Extract features like zero-crossing rate and energy to capture temporal characteristics.
Mel Spectrogram: Convert audio to spectrograms, creating visual representations of frequency content.
Model Development:
Time Domain Model Architecture:
Feature Extraction:
Conv1D layers (16, 32, 32, 64 filters) with increasing filter sizes (3, 3, 5, 7) capture temporal patterns.
MaxPooling (pool size 2) downsamples for efficiency.
Dense Classification:
Dense layers (64, 32 units) learn higher-level features.
6-unit output with softmax for multi-class classification.
Mel Spectogram Model Architecture:
Feature Extraction:
Conv2D layers (16, 32, 64, 64, 128, 128 filters) with increasing filter sizes (3, 3, 5, 5, 7) capture temporal patterns.
MaxPooling (pool size 2) downsamples for efficiency.
Dense Classification:
Dense layers (128, 64 units) learn higher-level features.
6-unit output with softmax for multi-class classification.