Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model performance varies significantly depending on wakeword temporal separation in audio #13

Closed
dscripka opened this issue Nov 24, 2022 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@dscripka
Copy link

I'm noticing an odd issue when attempting to benchmark the performance of Porcupine models against audio files with different characteristics (background noise, SNR, etc.). Specifically, there seems to be significant variation in the model's true positive rate simply by changing the temporal spacing of the wake word in the testing data. For example, when using the "Alexa" dataset and pre-trained "alexa_linux.ppn" from the latest version of Porcupine, I see the true-positive rate of the model behave as shown below:

image

Happy to provide additional details and even the test files that were created, if that would be useful.

I've also noticed similar performance variations to wake word temporal separation using custom Porcupine models and manually recorded test clips, so it seems possible that the issue is not limited to just the "alexa_linux_ppn" model.

Expected behaviour

The model should perform similarly regardless of the temporal separation of wake words in an input audio stream.

Actual behaviour

The model shows variations of up to 10 percentage points in the true positive rate depending on the temporal separation of wake words.

Steps to reproduce the behaviour

  1. Use the "Alexa" dataset from here

  2. Use the functions in mixer.py as a foundation, create test clips of varying lengths by mixing with background noise from the DEMAND dataset (specifically, the "DLIVING" recording). The SNR was fixed at 10 db, and the same segment of the noise audio file was used for every test clip. Each test clip was converted to a 16-bit, 16khz, single-channel WAV format.

  3. Initialize Porcupine, and run the test clips sequentially through the model using the default frame size (512) and default sensitivity level (0.5). Capture all of the true positive predictions and divide by the total number of test clips to calculate the true positive rate.

@kenarsa kenarsa self-assigned this Nov 24, 2022
@kenarsa kenarsa added the question Further information is requested label Nov 24, 2022
@kenarsa
Copy link
Member

kenarsa commented Nov 24, 2022

@dscripka thank you. I will schedule this to be looked at. Hopefully in the next couple of weeks.

@dscripka
Copy link
Author

Thank you, @kenarsa. I've done some additional testing and have developed a simpler experiment that can be more easily reproduced based on simply zero-padding the clean test clips.

## Reproducible zero-padding code
import numpy as np
import torchaudio
import pvporcupine
import matplotlib.pyplot as plt

# Load clips for testing and convert to 16-bit PCM format
clip_paths = paths = [str(i) for i in Path("path/to/clean/test/files").glob("**/*.flac")]
clip_data = [(torchaudio.load(pth)[0].numpy().squeeze()*32767).astype(np.int16) for pth in paths]

# Iterate over clips with varying padding lengths
acc = []
paddings = np.arange(0, 16000*3, 1000)
for pad_samples in tqdm(paddings):
    # Instantiate porcupine for each padding size
    porcupine = pvporcupine.create(
        access_key=access_key,
        keyword_paths=keyword_paths,
        sensitivities=sensitivities
    )
   
    n = 0
    for clip in clip_data:
        clip = np.pad(clip, (pad_samples, pad_samples), 'constant')
        for i in list(range(0, len(clip)-porcupine.frame_length, porcupine.frame_length)):
            frame = clip[i:i+porcupine.frame_length]
            if porcupine.process(frame) >= 0:
                n += 1
                break # skip rest of file as soon as a single detection is made
    
    acc.append(n/len(clip_paths))

# Make plot    
plt.plot(paddings*2/16000, acc)
plt.xlabel("Padding Duration (s)")
plt.ylabel("Wakeword Accuracy")

fig1

The magnitude of the variation is much lower here, potentially meaning that mixing with background noise may exacerbate the underlying issue. But it seems that these variations (>2% absolute at the largest delta) are still significant, as zero-padding should generally have no measurable effect on the performance of the model.

@kenarsa
Copy link
Member

kenarsa commented Mar 15, 2024

sorry, it doesn't seem like this is something we can prioritize to investigate in the foreseeable future. thanks for reporting, though.

@kenarsa kenarsa closed this as completed Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants