ISCA Archive ICSLP 1992
ISCA Archive ICSLP 1992

Generation of natural sounding speech stimuli by means of linear cepstral interpolation

Arjan van Hessen

This paper describes a method for the generation of sets of natural sounding speech stimuli, slowly changing from one speech signal to another. Stimulus continua created with this method were used in a large number of psycho-physical identification and discrimination experiments [1]. Two recorded speech stimuli between which a continuum is made are first analyzed according to the Sine Wave Generation method [2,3,4,5,6,7,8]. This results in a set of parameters per frame, containing the frequency, and the amplitude and phase of vocal tract and vocal source, at the 50 major peaks of the Short-Time FFT spectrum. Because the vocal tract amplitudes in each frame comprise the information about the spectral envelope, modifying these amplitudes results in a modified spectral envelope, and thus in a different "timbre".

Linear interpolation between the spectral envelope amplitudes (SE-amplitudes) of the two recorded speech sounds results in a set of spectral envelopes that slowly change from one sound to the other. Replacing the original SE-amplitudes of one of the two original stimuli (the mother stimulus) with those of the interpolated set, results (after resynthesizing) .in set of stimuli that differ only in timbre; they slowly change from one sound to another.

High quality speech is obtained because the stimuli are resynthesized with all their original parameters; only the SE-amplitudes are modified. The thus created speech sounds contain all the speaker specific characteristics of the "mother stimulus" and sound very natural because no important information is lost.