Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the purpose of pre-warming code. #124

Open
rerdavies opened this issue Oct 18, 2024 · 2 comments
Open

What is the purpose of pre-warming code. #124

rerdavies opened this issue Oct 18, 2024 · 2 comments

Comments

@rerdavies
Copy link

@mikeoliphant , @sdatkinson

I'm trying to figure out what purpose of the prewarm() function is.

prewarm() is performing about 8,000 process() calls on large Wavenet models. This is a significant amount of compute, and consists of code that is particularly cache-unfriendly. As a result, it's causing audio underruns in my realtime audio thread. (NAM spends about 70% of its time waiting for memory fetches to complete, so low-priority threads polluting the L1 cache is a significant problem).

If the purpose is to ensure that buffers are pre-allocated, then it's not doing the right thing. Most of the Matrixes that cause memory allocation are sized depending on num_frames. So calling process 1 frame at a time is just pre-allocating the smallest possible matrix buffers. More sensible would be to just do .... process(...., 1024), or process(.... 512). Although it's not clear that Eigen does reallocs of Matrix memory (will check this), or whether it does free/malloc calls (in which pre-warming does nothing).

If the purpose is to ensure that memory for particularly large Matrix buffers is paged in before passing the DSP off to a real-time thread, then this doesn't work either. The Large dynamic buffers are over 65536 bytes long, so running 8000 process cycles isn't going to touch anything significant. And I'm pretty sure the really large buffers are paged in by calls to zeroBuffer, anyway.

Is the purpose to ensure that the longest possible memory of previous inputs (the sum of worst-case dilations in all _Layers in the processing path) is of zero-valued inputs?

At an absolute minimum, the pre-warming code should be processing N frames at a time, since the additional CPU cost of 1-frame-at-a-time processing is dramatic. I'm happy to push a fix to that effect; but I'm curious as to whether the whole pre-warming scheme shouldn't be revisited.

Advice appreciated.

@sdatkinson
Copy link
Owner

sdatkinson commented Oct 19, 2024

The purpose of warm-up is to avoid a "pop" at the start of processing by letting the internal state of the model to reach the steady-state that corresponds to a long (enough) period of silence being fed as input.

It's easier to see conceptually in terms of RNNs like LSTMs: the hidden state is carried over as predictions are done. If you start with hidden and cell states at zero, they may move somewhere before settling. That movement causes the network predictions to move; you'll hear it as a "pop" at the start of processing.

But it happens for the real-time implementation of the convolutional NNs (like the "WaveNet") as well since the convolutions store a state so they don't have to repeat predictions when a new (small) buffer comes in. For these, warm-up basically gets that internal state to the correct values.

One subtle difference is that for ConvNets you can explicitly figure out how much warmup is needed--it's just the receptive field of the NN. For LSTMs, it's not so simple--in fact, some LSTMs may never converge to a constant steady state.

So that's why it's there. Everything else about pre-allocating etc is maybe a nice side-effect, but I'd like to clean things up so that's done explicitly instead of via warmup--aside from it not working like one might naively expect, yeah it's just yucky to try and implicitly achieve that as a secondary goal, so I wouldn't [loudly] advertise it like that 🙂

At an absolute minimum, the pre-warming code should be processing N frames at a time, since the additional CPU cost of 1-frame-at-a-time processing is dramatic.

Agreed, that's reasonable. The underrun issue you're encountering is a concrete issue that I think is nice motivation to improve--by all means, PR me 🙂

[EDIT] It seems that I already implemented block-based warmup here:

void nam::DSP::prewarm()
Is it being overridden somewhere by a sample-by-sample warmup somewhere that you're seeing?

@sdatkinson
Copy link
Owner

Ah, and to say it out loud: Doing a better job pre-allocating and managing the memory utilized by the models is something I intend to see improved, so if you are interested in contributing to that end, then I'd welcome it 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants