Acoustic model adaptation can mitigate the degradation in recognition accuracy caused by speaker or environment mismatch. While there are many methods for speaker or environment adaptation, far less attention has been focused on methods that compensate for both causes simultaneously. We recently proposed an algorithm called factored adaptation which jointly estimates speaker and environment transforms in a manner which facilitates the reuse of transforms across sessions. For example, a speaker transform estimated in one environment can later be used even if the speaker's environment changes. In this paper, we introduce a new factored adaptation algorithm that uses a combination of feature-space and model-space transforms. We describe an iterative EM algorithm for transform estimation that also incorporates speaker and environment clustering in cases where the speaker or environment labels are unknown. On a large vocabulary voice search task, the proposed method consistently outperforms conventional adaptation.
Index Terms: speaker adaptation, environment adaptation, robustness, factored transforms, acoustic factorization