ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

End-to-End Dependency Parsing of Spoken French

Adrien Pupier, Maximin Coavoux, Benjamin Lecouteux, Jerome Goulian

Research efforts in syntactic parsing have focused on written texts. As a result, speech parsing is usually performed on transcriptions, either in unrealistic settings (gold transcriptions) or on predicted transcriptions. Parsing speech from transcriptions, though straightforward to implement using out-of-the-box tools for Automatic Speech Recognition (ASR) and dependency parsing has two important limitations. First, relying on transcriptions will lead to error propagation due to recognition mistakes. Secondly, many acoustic cues that are important for parsing (prosody, pauses, ...) are no longer available in transcriptions. To address these limitations, we introduce wav2tree, an end-to-end dependency parsing model whose only input is the raw signal. Our model builds on a pretrained wav2vec2 encoder with a CTC loss to perform ASR. We extract token segmentation from the CTC layer to construct vector representations for each predicted token. Then, we use these token representations as input to a generic parsing algorithm. The whole model is trained end-to-end with a multitask objective (ASR, parsing) to reduce error propagation. Our experiments on the Orfeo treebank of spoken French show that direct parsing from speech is feasible: wav2tree outperforms a pipeline approach based on wav2vec (for ASR) and flauBERT (for parsing).