Current speech synthesizers typically lack backchannel tokens. Those
synthesiser, which include backchannels, typically only support a limited
set of stereotypical functions. However, this does not mirror the subtleties
of backchannels in spontaneous conversations. If we want to be able
to build an artificial listener, that can display degrees of attentiveness,
we need a speech synthesizer with more fine-grained control of the
prosodic realisations of its backchannels.
In the current study
we used a corpus of three-party face-to-face discussions to sample
backchannels produced under varying conversational dynamics. We wanted
to understand i) which prosodic cues are relevant for the perception
of varying degrees of attentiveness ii) how much of a difference is
necessary for people to perceive a difference in attentiveness iii)
whether a preliminary classifier could be trained to distinguish between
more and less attentive backchannel token.