This repository proposes an implementation of a Sign Recognition Model using the MediaPipe library for landmark extraction and Dynamic Time Warping (DTW) as a similarity metric between signs.
pip install -r requirements.txt
The architecture of the videos/
folder must be:
|data/
|-videos/
|-Hello/
|-<video_of_hello_1>.mp4
|-<video_of_hello_2>.mp4
...
|-Thanks/
|-<video_of_thanks_1>.mp4
|-<video_of_thanks_2>.mp4
...
To automatically create a small dataset of French signs:
- Install
ffmpeg
(for MacOSbrew install ffmpeg
) - Run:
python yt_download.py
- Add more YouTube links in
yt_links.csv
if needed
N.B. The current dataset is insufficient to obtain good results. Feel free to add more links or import your own videos
python main.py
- The Holistic Model of MediaPipe allows us to extract the keypoints of the Hands, Pose and Face models. For now, the implementation only uses the Hand model to predict the sign.
-
In this project a HandModel has been created to define the Hand gesture at each frame. If a hand is not present we set all the positions to zero.
-
In order to be invariant to orientation and scale, the feature vector of the HandModel is a list of the angles between all the connexions of the hand.
-
The SignModel is created from a list of landmarks (extracted from a video)
-
For each frame, we store the feature vectors of each hand.
- The SignRecorder class stores the HandModels of left hand and right hand for each frame when recording.
- Once the recording is finished, it computes the DTW of the recorded sign and all the reference signs present in the dataset.
- Finally, a voting logic is added to output a result only if the prediction confidence is higher than a threshold.
-
DTW is widely used for computing time series similarity.
-
In this project, we compute the DTW of the variation of hand connexion angles over time.