This project aims to analyze and classify emotions from speech signals using various deep learning models. The datasets used in this project include Crema, Ravdess, Savee, and Tess. Data augmentation techniques such as noise introduction, pitch shifting, and stretching were applied to enhance the dataset. Features such as MFCCs, Energy and Entropy of Energy, Zero Crossing Rate, Mel-Spectrogram, and Spectral Features were extracted for analysis.
Speech sentiment analysis is the task of classifying emotions in spoken language. It has applications in various fields such as customer service, mental health monitoring, and human-computer interaction. This project leverages multiple datasets and a combination of data augmentation and feature extraction techniques to build and evaluate four machine learning models: ANN, CNN, LSTM, and CNN-LSTM.
The following datasets were used in this project:
- Crema: Crowd-sourced Emotional Multimodal Actors dataset.
- Ravdess: Ryerson Audio-Visual Database of Emotional Speech and Song.
- Savee: Surrey Audio-Visual Expressed Emotion dataset.
- Tess: Toronto Emotional Speech Set.
To enhance the dataset and improve model robustness, the following data augmentation techniques were applied:
- Noise Introduction: Adding background noise to the audio samples.
- Pitch Shifting: Altering the pitch of the audio samples.
- Stretching: Changing the speed of the audio samples without altering the pitch.
The following features were extracted from the audio data to capture various aspects of the speech signals:
- MFCCs (Mel-Frequency Cepstral Coefficients): Capture detailed spectral information and are widely used in speech and audio processing.
- Energy and Entropy of Energy: Provide information about the intensity and variability of speech, which are important for detecting emotions.
- Zero Crossing Rate: Useful for distinguishing between different types of speech sounds.
- Mel-Spectrogram: Represents the power spectrum in the mel scale. Captures both temporal and spectral features of the audio signal.
- Spectral Features (Centroid, Spread, Roll-off): Provide a comprehensive description of the spectral characteristics of the speech signal.
Four machine learning models were built and evaluated for this project:
- ANN (Artificial Neural Network)
- CNN (Convolutional Neural Network)
- LSTM (Long Short-Term Memory)
- CNN-LSTM (Convolutional Neural Network - Long Short-Term Memory)
The models were evaluated based on their accuracy and other relevant metrics. The CNN model showed the highest performance, achieving a validation accuracy of 70%.
To run this project, you need to have Python installed along with the required libraries. You can install the dependencies using the following command:
pip install -r requirements.txt
To use the models for speech sentiment analysis, follow these steps:
- Clone this repository:
git clone https://github.com/DevG06/Speech-Sentiment-Analysis.git
- Navigate to project directory:
cd Speech-Sentiment-Analysis
- Run the Jupyter notebooks provided to preprocess the data, extract features, and train the models.
This project is licensed under the MIT License.