SPEECH-EMOTION-RECOGNTION

This repo illustrates how to use Speech-emotion-recognition module with ROS.

We only use voice with korean, not text.

We need Ros, audio_common and pytorch.

In audio_common_msgs, we must add Audio_result.msg and command.msg

requirements must be installed. And Ros settings also required. I use ros-kinnetic.

Datasets

KESDy18
- We use KESDy18 Korean emotion datasets.
- This includes 2880 wav files. And we only use 4 emotions. (0 = angry, 1 = neutral, 2 = sad, 3 = happy)
- You can download datafiles in here after submit License Agreement.
AIHUB
- We use Aihub Korean emotion datasets.
- This includes about 50,000 wav files with text, ages, ...
- We only use about 2200 data that emotion is classified clearly.
- You can download datafiles in here
CUSTOM DATA
- I recorded this myself.
- 11 sentences, with 2 levels, and 4 emotions. So 88 datafiles.

Feature Extraction & Model

Feature Extraction
- For feature extration we make use of the LIBROSA library
  1. using mfccs to feature extraction. cut audio file in 2.5 duration and make 32 mfccs tensor shape to train in DenseNet121.
  2. using mel-spectrogram make audio file to spectrum image and save it. load images to train DenseNet(pretrained=True).
- Model we use DenseNet121. we choose to use densenet since model have to be light to run on 'cpu' settings.
  
  IMAGE SAMPLE
  - It's hard to see the difference.
  ANGRY
  
  SAD
  
  NEUTRAL
  
  HAPPY

Train Result

Result without augmentation(DenseNet121)

Data	Pretrained	Feature Extraction	accuracy/(custom data)
ETRI	False	mfccs	70%/25%
ETRI	False	mel-spectrogram	73%/29%
*AIHUB*	*True*	*mel-spectrogram*	*69%/40%*
AIHUB	False	mel-spectrogram	60%/35%
ETRI+AIHUB	True	mel-spectrogram	68%/33%
ETRI+AIHUB	False	mel-spectrogram	63%/28%

using mfccs in ETRI make overfitting in train data. and not good at accuracy. so we decide to use mel-spectrogram.
ETRI dataseDts also too artificial, so not fit with custom data.

Result confusion matrix (accuracy = 73%)

Result confusion matrix for custom data (accuracy = 40%)

Result with data augmentation

Data Pretrained Feature Extraction accuracy/(custom data)

AIHUB + CUSTOM True mel-spectrogram 86.57%/83%
1. Finally we use data augmentation. Result confusion matrix(AIhub + Custom Data W augmentation)(accuracy = 83%)

How to use

How to trained (pytorch)
- First, clone this repo.
- Train code created by jupyter notebook (python 3.8.12).
- Trainer code located in './trainer'
- Locate wav files in './data' and do preprocessing to csv or list.
- Select model in torchvision.models(using DenseNet in this code) and chage input size(in_features) to fit the model.
```
model.classifier = nn.Linear(in_features=1024, out_features=4)
```
How to record
- run record_4sec.py -> .wav file will saved in './predict_audio'
```
python record_4sec.py
```
How to predict
- Locate wav file in './predict_audio'.
  you must locate only one file in this dir or fix the code.
```
python predict_torch_img.py
```
How to use in ROS
1. Run audio_capture.launch
2. ```
 rosrun speech-emotion-ros predict.py
```
3. ```
 rosrun speech-emotion-ros command.py
```
4. 'start' button to start recording, 'end' button to end recording and predict.

Reference

benchmark link :
1. https://github.com/MITESHPUTHRANNEU/Speech-Emotion-Analyzer
2. https://github.com/Data-Science-kosta/Speech-Emotion-Classification-with-PyTorch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SPEECH-EMOTION-RECOGNTION

Datasets

KESDy18

AIHUB

CUSTOM DATA

Feature Extraction & Model

Feature Extraction

IMAGE SAMPLE

ANGRY

SAD

NEUTRAL

HAPPY

Train Result

Result without augmentation(DenseNet121)

Result with data augmentation

How to use

How to trained (pytorch)

How to record

How to predict

How to use in ROS

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

SPEECH-EMOTION-RECOGNTION

Datasets

KESDy18

AIHUB

CUSTOM DATA

Feature Extraction & Model

Feature Extraction

IMAGE SAMPLE

ANGRY

SAD

NEUTRAL

HAPPY

Train Result

Result without augmentation(DenseNet121)

Result with data augmentation

How to use

How to trained (pytorch)

How to record

How to predict

How to use in ROS

Reference