Linear predictive coding (LPC) is a widely used technique in audio signal processing, especially in speech signal processing. It has found particular use in voice signal compression, allowing for very high compression rates. As widely adopted as it is, LPC is covered in many textbooks and is taught in most advanced audio signal processing courses. So why another article on LPC?
Despite covering LPC during my undergraduate coursework in electrical engineering, it wasn't until implementing LPC for my own research project that I understood what the goals of LPC were and what it was doing mathematically to meet those goals. A great part of the confusion stemmed from the somewhat cryptic name, Linear-Predictive-Coding. “Linear” made sense, however “Predictive” and “Coding”, sounded baffling, at least to the undergraduate me. What is it trying to predict? And coding? What does that mean?
As we will find in the following sections, the name LPC does make sense. However, it is one catered to a specific use, speech signal transmission, possibly obscuring other applications, such as cross-synthesis. It takes a moment to understand the exclamation, “I used linear predictive coding to cross-synthesize my voice with a creaking ship!”.
Thus the purpose of this article is to explain LPC from a general perspective, one that makes sense to me and hopefully to others trying to grasp what LPC is. Another purpose is to present a complete derivation in simple linear algebra terms before mentioning other concepts and approaches.
Finally, I prefer calling LPC, All-Pole Resonance Modeling.
The original objective of LPC was to model human voice production. LPC is a source-filter model in that there is a sound source that goes through a filter. (Figure 1) The source, $e\left(n\right)$, models the vocal chords, while the resonant filter, $h\left(n\right)$, models the vocal tract. The resulting signal is,
$$x\left(n\right)=h\left(n\right)\ast e\left(n\right).$$ | (1) |
There are two possible signals for the source: an impulse train or random white noise. These signals model pitched sounds and plosive/fricatives respectively. The common characteristic for both impulse train and white noise is that they are spectrally flat; all spectral information is modeled in the filter. The keen reader will notice the source signal is labeled $e\left(n\right)$ . This was chosen so for reasons to be revealed in the following sections.
LPC assumes the filter is a p-th order all-pole filter. Though not physiologically exact, it provides an extendable method for modeling resonances. This also allows for a tractable solution when estimating $h\left(n\right)$ from $x\left(n\right)$, which we will cover in §4.
Though initially developed for speech signals, the assumption of a spectrally flat source signal and a resonant filter applies well to modeling signals from most tonal instruments as well as many naturally occuring sounds.
For a filter design problem, like that of designing sound with an analog synthesizer, one would control filters, $h\left(n\right)$, with a source signal, $e\left(n\right)$, to create an interesting result, $x\left(n\right)$. However, for the problem at hand, the only known is $x\left(n\right)$, the resulting signal of the system. From $x\left(n\right)$, we need to estimate $h\left(n\right)$ and $e\left(n\right)$.
Let's take a moment to think about a system with $p$ poles only. For a discrete time system, each pole corresponds to a delay, thus the system has memory and the current sample $x\left(n\right)$ will be the result of the current input $e\left(n\right)$ and past samples $x\left(n-k\right)$, $k=1\dots p$. Inversely, this means that $h\left(n\right)$ imposes a relation between the past samples and the current sample. If we have enough examples of this imposed relation, we should be able to extract some information about the filter. Let's push this idea further.
We can formulate the relation between the input and output as,
$$\begin{array}{llll}\hfill X\left(z\right)& =H\left(z\right)E\left(z\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\frac{1}{1-\sum _{k=1}^{p}{a}_{k}{z}^{-k}}E\left(z\right).\phantom{\rule{2em}{0ex}}& \hfill \text{(2)}\end{array}$$
With a simple reordering, $X\left(z\right)$ can be expressed in terms of the input $E\left(z\right)$ and past samples ${z}^{-k}X\left(z\right)$.
The problem has now become that of finding the coefficients ${a}_{k}$, $k=1\dots p$. Again for reasons to be revealed, let's leave $e\left(n\right)$ as it is for now.
On a sidenote, this all-pole approach is also known as auto-regression (AR), since it is finding the future value of itself from the past.
If we have $N\gg p$ samples, which is easily the case, then we have $N$ equations for ${a}_{k}$, thus ${a}_{k}$ is overdetermined. ^{1} Let's transform equation 3b to matrix form.
For ${x}_{i}=\left[\begin{array}{c}\hfill x\left(n-1+i\right)\cdots x\left(n-p+i\right)\hfill \end{array}\right]\in {\mathbb{R}}_{1\times p}$ and
we get the following $N$ equations.
$$\begin{array}{llll}\hfill x\left(n\right)=& {x}_{0}\cdot a+e\left(n\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill x\left(n+1\right)=& {x}_{1}\cdot a+e\left(n+1\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \vdots & \hfill \text{(4)}\\ \hfill x\left(n+N\right)=& {x}_{N}\cdot a+e\left(n+N\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$$
Forming the variables by stacking the variables,
$b=\left[\begin{array}{c}\hfill x\left(n\right)\hfill \\ \hfill \vdots \hfill \\ \hfill x\left(n+N\right)\hfill \end{array}\right]\in {\mathbb{R}}_{N\times 1}$,
$e=\left[\begin{array}{c}\hfill e\left(n\right)\hfill \\ \hfill \vdots \hfill \\ \hfill e\left(n+N\right)\hfill \end{array}\right]\in {\mathbb{R}}_{N\times 1}$ and
$A=\left[\begin{array}{c}\hfill {x}_{0}\hfill \\ \hfill \vdots \hfill \\ \hfill {x}_{N}\hfill \end{array}\right]\in {\mathbb{R}}_{N\times p}$,
then replacing the variables in equation 4 and rearranging the matrix multiplication term, we get
where $\widehat{b}\equiv A\cdot a$.
The equation above is known as linear regression, where $\widehat{b}$ is an estimate of $b$ based on $A$ with weights $a$. The error of estimation or residual is $e$.
As mentioned before, the equation is overdetermined and there is no exact solution, we can only try to find a “good fit”. A well studied solution or “fit” to this problem is least squares, the solution of choice for LPC. For the least squares solution, we choose the values ${a}_{k}$ that minimize $\parallel e\left(n\right){\parallel}^{2}$, the power of the residual $e$. The solution is simple to compute:
$$\begin{array}{llll}\hfill a& ={A}^{\u2020}b\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & ={\left({A}^{T}A\right)}^{-1}{A}^{T}b\phantom{\rule{2em}{0ex}}& \hfill \text{(6)}\end{array}$$where ${A}^{\u2020}$ is the Moore-Penrose pseudoinverse.
Now that we have found values for $a$, by equation 6, we can calculate the residuals $e$. If we have done a good job of modeling the resonances, we should be left with either an impulse train of a certain frequency(pitched), a random noise signal(unpitched) or a mixture of the two. If not, some of the resonance will have leaked into $e$, and it will not be spectrally flat.
We first need to determine whether the signal is pitched or unpitched which requires pitch detection. Pitch detection is a widely studied subject. It is not a trivial task and is beyond the scope of this article. Simple methods include, zero-crossing detection and auto-correlation methods. More sophisticated methods use cepstrums, adaptive filtering or perceptual models. For the pitched case, we need the power of the source signal along with the fundamental frequency. For a random signal, all we need to measure is the variance of the samples ${\sigma}^{2}$, which is also the power of the source signal, $e\left(n\right)$.
It is known that 3 peak frequencies of the vocal tract, called formants, are enough to descriminate most vowels. Thus setting $p=6$ will work well for speech signal. This is one reason LPC works well with speech. Choosing the optimal value $p$ for other classes of signals is an open problem.
We have found the coefficients ${a}_{k}$ and have a model for the signal. Great! However, the solution from §4 will only produce a single resonant sound, like “ah”, not the whole sentence “I'm cool like that.”. Real world signals, like music and speech, are time varying. So what do we do? Divide and conquer! Specifically, we need to cut the signal into small chunks and model each chunk using the equations from §4. This also adds a few conditions to the equations covered in §4, but the overall process holds.
The usual approach in audio processing is to use a technique called overlap-add (OLA). For OLA, we window the signal with a window function $w\left(n\right)$ that has a constant OLA property.
$$\sum _{m=-\infty}^{\infty}w\left(n-mR\right)=1$$ | (7) |
Equation 7 is the constant OLA equation, where $R$ is called the step size. Thus instead of running LPC on the full signal $x\left(n\right)$, we use a windowed version ${x}_{m}\left(n\right)=w\left(n\right)x\left(n+mR\right)$. Because of the constant overlap-add property, simply adding the windowed signals will return the original signal.
$$\sum _{m=-\infty}^{\infty}w\left(n-mR\right)x\left(n\right)=x\left(n\right)$$ | (8) |
A window function of length $N$ is zero outside the window. Thus the windowing imposes the condition that any value outside length $N$ is zero. This means for ${x}_{i}$ in equation 4, $x\left(n-m+i\right)=0$, if $n-m+i<0$ or $n-m+i\le N$, where $m=1\dots p$.
We mention this because the terms in equation 5b now become an auto-correlation which is often how LPC is explained. It is also known as the Yule-Walker equations. This approach allows for faster implementations using FFT to calculate the auto-correlations.
Now let's look at an example using LPC on speech signal. Let's choose a window of length $N=240$, which is the length of a 30ms window at 8kHz, reasonable values for speech signals. Let's also use 50% OLA, which means $R=0.5N=120$. Since it is speech and we are only interested in 3 peaks (the formants), so we choose $p=6$. For each step, 120 samples, after LPC, we are left with 7 numbers; the 6 coefficients ${a}_{k}$ and the variance of the residual ${\sigma}^{2}$. Thus we have reduced the amount of data from 120 samples to 7 (about 17 to 1). That's quite impressive. In a sense, we have encoded the original audio into a compact form.
To decode the signal, we run the coefficients through the LPC model in Figure 1. Specifically, using the variance ${\sigma}^{2}$ to control the source, and the coefficients ${a}_{k}$ for the filter, we get an estimated signal ${\widehat{x}}_{m}\left(n\right)=w\left(n\right)\left(e\left(n\right)\ast {h}_{m}\left(n\right)\right)$. We then overlap-add the decoded windowed signals to obtain the full signal $\widehat{x}\left(n\right)$. This can be viewed as playing an LPC synth with LPC encoded parameters. We will explore this idea in §8.
Here are some results from an example implementation. It includes a rudimentary pitch/noise detector and a simple source-filter implementation for resynthesis.
Here is MATLAB code used to create the examples. (download)
Now that we have covered how LPC works, let's review the name.
Linear Prediction. The system in Figure 1 is a linear system. We use least squares which solves linear equations. Actually, the system is using linear prediction where in equations 3b and 5b, we are using the past values of $x\left(n\right)$ linearly to find the coefficients ${a}_{k}$ that best estimate or predict the current value.
Coding. In §6, we covered how LPC can be used to encode(analyze) and decode(resynthesize) speech signals. So LPC can be viewed as a coding algorithm. Another example of a well known audio coding algorithm is the MP3 codec(coder/decoder). The two audio coding schemes make an interesting comparison. Where LPC tries to model how the sound is created (source modeling), MP3 models how the sound is perceived (listener modeling).
Viewing LPC as a coding method can obscure other possible uses. Looking at LPC from an analysis/synthesis perspective allows for different applications. A good example is that of cross-synthesis. (Figure 2) After running a signal through LPC, we are left with a model of the filter, $h\left(n\right)$ and a model of the source, $e\left(n\right)$. The filter controls the resonance or tone of the signal, while the source controls the utterance. If we run two different signals through LPC, it would be possible to use the source of one model (carrier, ${e}_{c}\left(n\right)$) and the filter of the other model (modulator, ${h}_{m}\left(n\right)$) to create a different, hopefully interesting, output signal $y\left(n\right)={h}_{m}\left(n\right)\ast {e}_{c}\left(n\right)$, thus cross-synthesis.
This creates particularly interesting results when taking the filter from speech signals and the source from a spectrally rich sound texture such as wind, or creaking wood.
Let's look at some examples.
Here is MATLAB code used to create the examples. (download)
As useful as LPC has proven to be, it is not without limitations. For one, the audio signal should somewhat fit in to the source-filter model. A signal with many sounds mixed may not work so well.
Further improvements in sound quality can be acheived without significantly increasing the data needed by using excitation codes. This approach is called Code-Excited Linear Prediction (CELP).
The keen reader may have noticed that within a windowed segment, there is no change in power. The assumption is that the signal within a windowed segment is stationary. A transient or attack will become smeared by a windows width. Thus, LPC itself is not suited for sounds with many transients such as fire crackling. A recent approach to address this problem is that of time-frequency LPC (TFLPC) or cascaded time-frequency linear prediction (CTFLP), where the time envelope of the source signal after LPC is modeled in the time-domain also using an all-pole model.
Despite the limitations, LPC is a good starting point for analyzing and modeling sounds and worth understanding.
^{1} The order of the all-pole filter, $p$, sets a lower bound for the number of samples LPC needs to effectively model the signal.