Reconstructing Neutral Speech from Tracheoesophageal Speech

Abinay Reddy N, Achuth Rao MV, G. Nisha Meenakshi, Prasanta Kumar Ghosh

In this work, we propose a tracheoesophageal (TE) speech to neutral speech conversion system using data collected from a laryngectomee. In laryngectomees, in the absence of vocal folds, it is the vibration of the esophagus that gives rise to a low-frequency pitch during speech production. This pitch is manifested as impulse-like noise in the recorded speech. We propose a method to first ‘whisperize’ the TE speech prior to the linear predictive coding (LPC) based synthesis which uses pitch derived from the energy contour. In order to perform ‘whisperization’, we model the LPC residual signal as the sum of white noise and impulses introduced by the esophageal vibrations. We model these impulses and white noise using Bernoulli-Gaussian distribution and Gaussian distribution, respectively. The strength and location of the impulses are estimated using Gibbs sampling in order to remove the impulse-like noise from speech to obtain whispered speech. Subjective evaluation via listening test reveals that the ‘whisperization’ step in the proposed method aids in synthesizing a more natural sounding neutral speech. A different listening test shows that the listeners prefer the synthesized speech from the proposed method ∼ 93% (absolute) times more than the best baseline scheme.

doi: 10.21437/Interspeech.2018-1907

Cite as: N, A.R., Rao MV, A., Meenakshi, G.N., Ghosh, P.K. (2018) Reconstructing Neutral Speech from Tracheoesophageal Speech. Proc. Interspeech 2018, 1541-1545, doi: 10.21437/Interspeech.2018-1907

