Here's RNNoise
This demo presents the RNNoise project, showing how deep learning can be applied to noise suppression. The main idea is to combine
signal processing with deep learning to create a real-time noise suppression algorithm that's small and
. No expensive GPUs required — it runs easily on a Raspberry Pi. The result is much simpler (easier to tune) and sounds better than traditional noise suppression systems (been there!).
Noise Suppression
Noise suppression is a pretty old topic in speech processing, dating back to at least the 70s. As the name implies, the idea is to take a noisy signal and remove as much noise as possible while causing minimum distortion to the speech of interest.
This is a conceptual view of a conventional noise suppression algorithm. A voice activity detection (VAD) module detects when the signal contains voice and when it's just noise. This is used by a noise spectral estimation module to figure out the spectral characteristics of the noise (how much power at each frequency). Then, knowing how the noise looks like, it can be "subtracted" (not as simple as it sounds) from the input audio.
From looking at the figure above, noise suppression looks simple enough: just three conceptually simple tasks and we're done, right? Right — and wrong! Any undergrad EE student can write a noise suppression algorithm that works... kinda... sometimes. The hard part is to make it work well, all the time, for all kinds of noise. That requires very careful tuning of every knob in the algorithm, many special cases for strange signals and lots of testing. There's always some weird signal that will cause problems and require more tuning and it's very easy to break more things than you fix. It's 50% science, 50% art. I've been there before with the noise suppressor in the speexdsp library. It kinda works, but it's not great.
Deep Learning and Recurrent Neural Networks
Deep learning is the new version of an old idea: artificial neural networks. Although those have been around since the 60s, what's new in recent years is that:
We now know how to make them deeper than two hidden layers
We know how to make recurrent networks remember patterns long in the past
We have the computational resources to actually train them
Recurrent neural networks (RNN) are very important here because they make it possible to model time sequences instead of just considering input and output frames independently. This is especially important for noise suppression because we need time to get a good estimate of the noise. For a long time, RNNs were heavily limited in their ability because they could not hold information for a long period of time and because the gradient descent process involved when back-propagating through time was very inefficient (the vanishing gradient problem). Both problems were solved by the invention of
gated units
, such as the Long Short-Term Memory (LSTM), the Gated Recurrent Unit (GRU), and their many variants.
RNNoise uses the Gated Recurrent Unit (GRU) because it performs slightly better than LSTM on this task and requires fewer resources (both CPU and memory for weights). Compared to
recurrent units, GRUs have two extra
. The
gate controls whether the state (memory) is used in computing the new state, whereas the
gate controls how much the state will change based on the new input. This update gate (when off) makes it possible (and easy) for the GRU to remember information for a long period of time and is the reason GRUs (and LSTMs) perform much better than simple recurrent units.
Comparing a simple recurrent unit with a GRU. The difference lies in the GRU's
gates, which make it possible to learn longer-term patterns. Both are soft switches (value between 0 and 1) computed based on the previous state of the whole layer and the inputs, with a sigmoid activation function. When the update gate
is on the left, then the state can remain constant over a long period of time — until a condition causes
to switch to the right.
A Hybrid Approach
Thanks to the successes of deep learning, it is now popular to throw deep neural networks at an entire problem. These approaches are called
— it's neurons all the way down. End-to-end approaches have been applied to speech recognition and to speech synthesis On the one hand, these end-to-end systems have proven just how powerful deep neural networks can be. On the other hand, these systems can sometimes be both suboptimal, and wasteful in terms of resources. For example, some approaches to noise suppression use layers with thousands of neurons — and tens of millions of weights — to perform noise suppression. The drawback is not only the computational cost of running the network, but also the
of the model itself because your library is now a thousand lines of code along with tens of megabytes (if not more) worth of neuron weights.
That's why we went with a different approach here: keep all the basic signal processing that's needed anyway (not have a neural network attempt to emulate it), but let the neural network learn all the tricky parts that require endless tweaking next to the signal processing. Another thing that's different from
existing work on noise suppression with deep learning is that we're targeting real-time communication rather than speech recognition, so we can't afford to
look ahead
more than a few milliseconds (in this case 10 ms).
Defining the problem
To avoid having a very large number of outputs — and thus a large number of neurons — we decided against working directly with samples or with a spectrum. Instead, we consider frequency
that follow the Bark scale, a frequency scale that matches how we perceive sounds. We use a total of 22 bands, instead of the 480 (complex) spectral values we would otherwise have to consider.
Layout of the Opus bands vs the actual Bark scale. For RNNoise, we use the same base layout as Opus. Since we overlap the bands, the boundaries between the Opus bands become the center of the overlapped RNNoise bands. The bands are wider at higher frequency because the ear has poorer frequency resolution there. At low frequencies, the bands are narrower, but not as narrow as the Bark scale would give because then we would not have enough data to make good estimates.
Of course, we cannot reconstruct audio from just the energy in 22 bands. What we can do though, is compute a gain to apply to the signal for each of these bands. You can think about it as using a 22-band equalizer and rapidly changing the level of each band so as to attenuate the noise, but let the signal through.
There are several advantages to operating with per-band gains. First, it makes for a much simpler model since there are fewer bands to compute. Second, it makes it impossible to create so-called
musical noise
artifacts, where only a single tone gets though while its neighbours are attenuated. These artifacts are common in noise suppression and quite annoying. With bands that are wide enough, we either let a whole band through, or we cut it all. The third advantage comes from how we optimize the model. Since the gains are always bounded between 0 and 1, simply using a sigmoid activation function (whose output is also between 0 and 1) to compute them ensures that we can never do something
stupid, like adding noise that wasn't there in the first place.
▶ Show nerdy details
The main drawback of the lower resolution we get from using bands is that we do not have a fine enough resolution to suppress the noise between pitch harmonics. Fortunately, it's not so important and there is even an easy trick to do it (see the pitch filtering part below).
Since the output we're computing is based on 22 bands, it makes little sense to have more frequency resolution on the input, so we use the same 22 bands to feed spectral information to the neural network. Because audio has a
dynamic range, it's much better to compute the log of the energy rather than to feed the energy directly. And while we're at it, it never hurts to decorrelate the features using a DCT. The resulting data is a
based on the Bark scale, which is closely related to the Mel-Frequency Cepstral Coefficients (MFCC) that are very commonly used in speech recognition.