This repo illustrates how to use Speech-emotion-recognition module with ROS.
We only use voice with korean, not text.
We need Ros, audio_common and pytorch.
In audio_common_msgs, we must add Audio_result.msg and command.msg
requirements must be installed. And Ros settings also required. I use ros-kinnetic.
-
- We use KESDy18 Korean emotion datasets.
- This includes 2880 wav files. And we only use 4 emotions. (0 = angry, 1 = neutral, 2 = sad, 3 = happy)
- You can download datafiles in here after submit License Agreement.
-
- We use Aihub Korean emotion datasets.
- This includes about 50,000 wav files with text, ages, ...
- We only use about 2200 data that emotion is classified clearly.
- You can download datafiles in here
-
- I recorded this myself.
- 11 sentences, with 2 levels, and 4 emotions. So 88 datafiles.
-
-
For feature extration we make use of the LIBROSA library
- using mfccs to feature extraction. cut audio file in 2.5 duration and make 32 mfccs tensor shape to train in DenseNet121.
- using mel-spectrogram make audio file to spectrum image and save it. load images to train DenseNet(pretrained=True).
-
Model we use DenseNet121. we choose to use densenet since model have to be light to run on 'cpu' settings.
- It's hard to see the difference.
-
-
Data Pretrained Feature Extraction accuracy/(custom data) ETRI False mfccs 70%/25% ETRI False mel-spectrogram 73%/29% AIHUB True mel-spectrogram 69%/40% AIHUB False mel-spectrogram 60%/35% ETRI+AIHUB True mel-spectrogram 68%/33% ETRI+AIHUB False mel-spectrogram 63%/28% - using mfccs in ETRI make overfitting in train data. and not good at accuracy. so we decide to use mel-spectrogram.
- ETRI dataseDts also too artificial, so not fit with custom data.
- Result confusion matrix (accuracy = 73%)
- Result confusion matrix for custom data (accuracy = 40%)
-
Data Pretrained Feature Extraction accuracy/(custom data) AIHUB + CUSTOM True mel-spectrogram 86.57%/83% - Finally we use data augmentation. Result confusion matrix(AIhub + Custom Data W augmentation)(accuracy = 83%)
-
- First, clone this repo.
- Train code created by jupyter notebook (python 3.8.12).
- Trainer code located in './trainer'
- Locate wav files in './data' and do preprocessing to csv or list.
- Select model in torchvision.models(using DenseNet in this code) and chage input size(in_features) to fit the model.
model.classifier = nn.Linear(in_features=1024, out_features=4)
-
- run record_4sec.py -> .wav file will saved in './predict_audio'
python record_4sec.py
- run record_4sec.py -> .wav file will saved in './predict_audio'
-
- Locate wav file in './predict_audio'.
you must locate only one file in this dir or fix the code.python predict_torch_img.py
- Locate wav file in './predict_audio'.
-
- Run audio_capture.launch
-
rosrun speech-emotion-ros predict.py
-
rosrun speech-emotion-ros command.py
- 'start' button to start recording, 'end' button to end recording and predict.