The Short-Time Fourier Transform (STFT)

Computation of the STFT consists of the following steps:

1. Read $ M$ samples of the input signal $ x$ into a local buffer,

$\displaystyle x_m(n) \mathrel{\stackrel{\mathrm{\Delta}}{=}}x(n-mR), \qquad n=-M_h,-M_h+1,\ldots\,,-1,0,1,\ldots\,,M_h-1,M_h

where $ x_m$ is called the $ m$th frame of the input signal, and $ M\mathrel{\stackrel{\mathrm{\Delta}}{=}}2M_h+1$ is the frame length (which we assume is odd for reasons to be discussed later). The time advance $ R$ (in samples) from one frame to the next is called the hop size.

2. Multiply the data frame pointwise by a length $ M$ spectrum analysis window $ w(n), n=-M_h,\ldots\,,M_h$ to obtain the $ m$th windowed data frame:

$\displaystyle \tilde{x}_m(n) \mathrel{\stackrel{\mathrm{\Delta}}{=}}x_m(n) w(n), \qquad n=-{\frac{M-1}{2}},\ldots\,,{\frac{M-1}{2}}

3. Extend $ \tilde{x}_m$ with zeros on both sides to obtain a zero-padded windowed data frame:

$\displaystyle \tilde{x}_m^\prime (n) \mathrel{\stackrel{\mathrm{\Delta}}{=}}\le...
...}-1 \\ [5pt]
0, & -\frac{N}{2}\leq n < -{\frac{M-1}{2}} \\

where $ N$ is the FFT size, chosen to be a power of two larger than $ M$. The number $ N/M$ is called the zero-padding factor.

4. Take a length $ N$ FFT of $ \tilde{x}_m$ to obtain the STFT at time $ m$:

$\displaystyle \tilde{x}_m^\prime (e^{j\omega_k })=\sum _{n=-N/2}^{N/2-1} \tilde{x}_m^\prime (n) e^{-j\omega_k n T}

where $ \omega_k = 2\pi k f_s / N $, and $ f_s=1/T$ is the sampling rate in Hz. The STFT bin number is $ k$. Each bin $ \tilde{x}_m^\prime (e^{j\omega_k })$ of the STFT can be regarded as a sample of the complex signal at the output of a lowpass filter whose input is $ \tilde{x}_m^\prime (n) e^{-j\omega_k m T}$; this signal is $ \tilde{x}_m^\prime (n)$ frequency-shifted so that frequency $ \omega_k $ is moved to 0 Hz. In this interpretation, the hop size $ R$ is the downsampling factor applied to each bandpass output, and the analysis window $ w(\,\cdot\,)$ is the impulse response of the anti-aliasing filter used with the downsampling.

The zero-padding factor is the interpolation factor for the spectrum, i.e., each FFT bin is replaced by $ N/M$ bins, interpolating the spectrum.

Download parshl.pdf

``PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation'', by Julius O. Smith III and Xavier Serra, Proceedings of the International Computer Music Conference (ICMC-87, Tokyo), Computer Music Association, 1987.
