ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Directional Speech Recognition for Speaker Disambiguation and Cross-talk Suppression

Ju Lin, Niko Moritz, Ruiming Xie, Kaustubh Kalgaonkar, Christian Fuegen, Frank Seide

With advances in mobile computing, smart glasses are becoming powerful enough to generate real-time closed captions of live conversations. Such system must distinguish speech from the conversation partner from the wearer's, and in public places it must not transcribe speech from unrelated bystanders to avoid confusion and to honor privacy. We propose an end-to-end modeling approach that leverages the smart glasses' microphone array. But we go beyond beamforming for improved target-speaker SNR: We feed multiple audio channels simultaneously to a single ASR model as a basis for speaker-attributed transcription and suppression of bystander cross-talk. Our proposed multi-channel directional ASR model processes multiple beamformer outputs for different steering directions simultaneously and combines it with serialized output training. Under room-acoustics and noise simulation, we demonstrate near perfect wearer/conversation-partner disambiguation and suppression of cross-talk speech from non-target directions.