Skip to content

rhasspy/rhasspy-silence

Repository files navigation

Rhasspy Silence

Continuous Integration GitHub license

Detect speech/silence in voice commands with webrtcvad.

Requirements

Installation

$ git clone https://github.com/rhasspy/rhasspy-silence
$ cd rhasspy-silence
$ ./configure
$ make
$ make install

How it Works

rhasspy-silence uses a state machine to decide when a voice command has started and stopped. The variables that control this machine are:

  • skip_seconds - seconds of audio to skip before voice command detection starts
  • speech_seconds - seconds of speech before voice command has begun
  • before_seconds - seconds of audio to keep before voice command has begun
  • minimum_seconds - minimum length of voice command (seconds)
  • maximum_seconds - maximum length of voice command before timeout (seconds, None for no timeout)
  • silence_seconds - seconds of silence before a voice command has finished

The sensitivity of webrtcvad is set with vad_mode, which is a value between 0 and 3 with 0 being the most sensitive.

State machine diagram for silence detection

If there is no timeout, the final voice command audio will consist of:

  • before_seconds worth of audio before the voice command had started
  • At least min_seconds of audio during the voice command

Energy-Based Silence Detection

Besides just webrtcvad, silence detection using the denoised energy of the incoming audio is also supported. There are two energy-based methods:

  • Threshold - simple threshold where energy above is considered speech and energy below is silence
  • Max/Current Ratio - ratio of maximum energy and current energy value is compared to a threshold
    • Ratio below threshold is considered speech, ratio above is silence
    • Maximum energy value can be provided (static) or set from observed audio (dynamic)

Both of the energy methods can be combined with webrtcvad. When combined, audio is considered to be silence unless both methods detect speech - i.e., webrtcvad classifies the audio chunk as speech and the energy value/ratio is above threshold. You can even combine all three methods using SilenceMethod.ALL.

Command Line Interface

A CLI is included to test out the different parameters and silence detection methods. After installation, pipe raw 16-bit 16Khz mono audo to the bin/rhasspy-silence script:

$ arecord -r 16000 -f S16_LE -c 1 -t raw | bin/rhasspy-silence <ARGS>

The characters printed to the console indicate how rhasspy-silence is classifying audio frames:

  • . - silence
  • ! - speech
  • S - transition from silence to speech
  • - - transition from speech to silence
  • [ - start of voice command
  • ] - end of voice command
  • T - timeout

By changing the --output-type argument, you can have the current audio energy or max/current ratio printed instead. These values can then be used to set threshold values for further testing.

Splitting By Silence

You can use rhasspy-silence to split audio into WAV files by silence using:

$ sox audio.wav -t raw - | bin/rhasspy-silence --quiet --split-dir splits --trim-silence

This will split raw 16Khz 16-bit PCM audio into WAV files in a directory named splits. By default, files will simply be numbered (0.wav, 1.wav, etc). Set --split-format to change this.

Adding --trim-silence is optional, and can be controlled further with other --trim-* options (see --help).

Trimming Silence

Silence can be trimmed from the start and end of an audio file with:

$ sox audio.wav -t raw - | bin/rhasspy-silence --quiet --trim-silence > trimmed.wav

See the other --trim-* options with --help for more control.

CLI Arguments

usage: rhasspy-silence [-h]
                       [--output-type {speech_silence,current_energy,max_current_ratio,none}]
                       [--chunk-size CHUNK_SIZE] [--skip-seconds SKIP_SECONDS]
                       [--max-seconds MAX_SECONDS] [--min-seconds MIN_SECONDS]
                       [--speech-seconds SPEECH_SECONDS]
                       [--silence-seconds SILENCE_SECONDS]
                       [--before-seconds BEFORE_SECONDS]
                       [--sensitivity {1,2,3}]
                       [--current-threshold CURRENT_THRESHOLD]
                       [--max-energy MAX_ENERGY]
                       [--max-current-ratio-threshold MAX_CURRENT_RATIO_THRESHOLD]
                       [--silence-method {vad_only,ratio_only,current_only,vad_and_ratio,vad_and_current,all}]
                       [--split-dir SPLIT_DIR] [--split-format SPLIT_FORMAT]
                       [--trim-silence] [--trim-ratio TRIM_RATIO]
                       [--trim-chunk-size TRIM_CHUNK_SIZE]
                       [--trim-keep-before TRIM_KEEP_BEFORE]
                       [--trim-keep-after TRIM_KEEP_AFTER] [--quiet] [--debug]

optional arguments:
  -h, --help            show this help message and exit
  --output-type {speech_silence,current_energy,max_current_ratio,none}
                        Type of printed output
  --chunk-size CHUNK_SIZE
                        Size of audio chunks. Must be 10, 20, or 30 ms for
                        VAD.
  --skip-seconds SKIP_SECONDS
                        Seconds of audio to skip before a voice command
  --max-seconds MAX_SECONDS
                        Maximum number of seconds for a voice command
  --min-seconds MIN_SECONDS
                        Minimum number of seconds for a voice command
  --speech-seconds SPEECH_SECONDS
                        Consecutive seconds of speech before start
  --silence-seconds SILENCE_SECONDS
                        Consecutive seconds of silence before stop
  --before-seconds BEFORE_SECONDS
                        Seconds to record before start
  --sensitivity {1,2,3}
                        VAD sensitivity (1-3)
  --current-threshold CURRENT_THRESHOLD
                        Debiased energy threshold of current audio frame
  --max-energy MAX_ENERGY
                        Fixed maximum energy for ratio calculation (default:
                        observed)
  --max-current-ratio-threshold MAX_CURRENT_RATIO_THRESHOLD
                        Threshold of ratio between max energy and current
                        audio frame
  --silence-method {vad_only,ratio_only,current_only,vad_and_ratio,vad_and_current,all}
                        Method for detecting silence
  --split-dir SPLIT_DIR
                        Split incoming audio by silence and write WAV file(s)
                        to directory
  --split-format SPLIT_FORMAT
                        Format for split file names (default: '{}.wav', only
                        with --split-dir)
  --trim-silence        Trim silence when splitting (only with --split-dir)
  --trim-ratio TRIM_RATIO
                        Max/current energy ratio used to detect silence (only
                        with --trim-silence)
  --trim-chunk-size TRIM_CHUNK_SIZE
                        Size of audio chunks for detecting silence (only with
                        --trim-silence)
  --trim-keep-before TRIM_KEEP_BEFORE
                        Number of audio chunks before speech to keep (only
                        with --trim-silence)
  --trim-keep-after TRIM_KEEP_AFTER
                        Number of audio chunks after speech to keep (only with
                        --trim-silence)
  --quiet               Set output type to none
  --debug               Print DEBUG messages to the console