I have worked on the problem of audio demodulation with Malcolm Slaney, Les Atlas, and Pascal Clark, among others. My solution is to pose audio demodulation as a convex optimization problem. In certain cases, the problem can be further simplified and solved with quadratic programming. This method is advantageous because it is intuitively defined in a principled and percptually-motivated framework. The optimization approach also allows us to strictly enforce the exclusion of high-frequency content from the modulator, regardless of the nature of the carrier signal.
We have demonstrated that this more principled approach to demodulation separates the low-frequency modulators from the fine-temporal structure more effectively than the Hilbert envelope (Sell and Slaney, TASLP 2010 and ICASSP 2010). As a result of this cleaner separation, the linear convex envelope outperformed the Hilbert envelope in a speech recognition experiment on the Broadcast News database (Clark, Sell, and Atlas, ICASSP 2011). The cleaner separation also allowed for experimentation into separability of speech information between the modulator and carrier as a function of subband bandwidth, providing insight into the minimal bandwidth necessary for modulation-based speech recognition (Sell and Slaney, ICASSP 2010).
The current version (v1.0) of the Optimization Demodulation Package poses demodulation in two forms, one in the linear domain and one in the logarithmic domain.
For now, these methods find a numerical solution, solving for the modulator value at each sample. Depending on the sample rate and window length, these calculations can take several minutes (especially the logarithmic method).
See Solving Demodulation as an Optimiziation Problem (Sell and Slaney, IEEE Transactions on Audio Speech and Signal Processing, Nov 2010), available HERE, for detailed descriptions of these methods. The code package itself contains a simple demonstration for usage.
DOWNLOAD CODEAny feedback on the code, error reports, or suggestions are appreciated.
The examples that follow use linear convex demodulation to estimate the signal envelopes, which solves the following optimization problem:
The linear method minimizes the norm of the modulator without allowing the modulator to ever have a smaller magnitude than the signal. The cost function also includes a weighted norm on the spectrum. However, this weighting is designed so that high frequencies (over 60Hz) are so highly penalized that they will never occur, and low frequencies (under 40Hz) are so light penalized that the cost is negligible.
An intuitive way to think of the problem is that the modulator is like a blanket being thrown over the top of furniture, like a set of chairs. In this analogy, the norm cost plays the role of gravity, pulling the modulator downward, and the spectral weighting sets the rigidity of the blanked, preventing it from just molding precisely to the shape of the furniture. The restriction that the modulator never have a smaller magnitude than the signal is analagous, then, to the tall points of the furniture holding the blanket up.
Because of the strength of the weighted spectral norm, high frequency content never leaks into the modulator, regardless of the spectral content in the carrier. As can be seen below, the correct envelope is extracted for a variety of carrier types.
In practice, demodulation is usually performed on subbands of the signal (the outputs of a filterbank). However, the bandwidth of these filters has strong implications for the relative content of the carrier and modulator. Because of the strict ability of linear convex demodulation to separate the low-frequency modulator from the fine-temporal structure, we can use it as a tool to examine this relationship. As can be seen in the image below, the information in this particular speech signal transitions from the carrier to the modulator as the bandwidth narrows.
We were able to further test this effect with a speech detection experiment on modulators and carriers, showing recognizability as a function of subband bandwidth.
Two interesting lessons can be learned from the above plot. First, it can easily be seen that the Hilbert envelope fails to exclude the fine-temporal structure at wide bandwidths, because the speech is always recognizable with the Hilbert envelope, regardless of subband bandwidth. Second, the speech information transitions from modulator to carrier at subband bandwidths of roughly 500Hz, suggesting this is the maximum bandwidth required for modulator-based speech detection.
My current work in demodulation is exploring the effectiveness of convex demodulation features for speech and speaker recognition. I am also developing an extension of the method that simultaneously extracts spectral and temporal envelopes for an arbitrary signal.