Before Step 2 above, the FFT hop size within the MRSTFT of Step 1 would typically be determined by the shortest window length used (and its type). However, after the non-uniform downsampling in Step 2, the effective window lengths (and shapes) have been modified. If the spectrum is not undersampled by this operation, the effective duration of the time-domain window at each frequency will always be shorter than that of the original FFT window. In principle, the shape of the effective time-domain window becomes the product of the original FFT window used in the MRSTFT times the ``auditory window,'' which is given by the inverse Fourier transform of the auditory filter frequency response (spectral interpolation kernel) translated to zero center-frequency. (This is only approximately true when the auditory filter frequency response spans multiple frequency ranges for which FFTs were performed at different resolutions.)
Since the time-domain window durations are shortened by the spectral smoothing inherent in Step 2, the proper step size from frame to frame is something less than that dictated by the MRSTFT windows. One reliable method for determining the maximum allowable hop size for each FFT in the MRSTFT is to study the inverse Fourier transform of the widest (highest-frequency) auditory filter shape (translated to 0 Hz center-frequency) used as a smoothing kernel in that FFT. This new window can be multiplied by the original window and overlapped and added to itself, as in Eq.(7.2), at various increasing hop-sizes (starting with which is always valid), until the overlap-add begins to show ripple at the frame rate . Alternatively, the bandwidth of the highest-frequency auditory filter can be used to determine the appropriate hop size in the time domain, as elaborated in Chapter 9 (especially §9.8.1).