The evaluation of conversational-speech translation systems rises many technical issues. For the sake of stimulating the discussion, some general problems and proposals are briefly introduced, which will be integrated with the presentations given by the invited panelists. Speech translation requires carefully considering the goal of the task itself. While, e.g., broadcast news translation can be treated similarly to written text translation, different ideas of translation could be considered for conversational speech. For this task, humans professional translators typically refer to three "interpreting modalities": simultaneous, consecutive and liason. Simply speaking, all modalities require the human interpreter to listen to a given amount of speech, to recount what has been said, to listen again, and so on. Probably, the less ambitious scenario for automatic SLT might be the one of simultaneous interpreting, which typically requires the human to translate at very short intervals, e.g. few seconds, or even in real-time. Besides being physically very demanding, simultaneous interpreters, due to the strict time constraints, are less able to exploit their linguistic and domain knowledge. Both reasons make users accept less fluent and almost close to literal translations. Given that speech translation relies on automatic speech recognition (ASR), the task should be tailored to the affordable ASR accuracy. In the past, interlingua-based systems have been applied to resemble the way a liason interpreter works, e.g. at a meeting or appointment. In particular, the interpreter is assumed to be familiar with the subject under discussion and uses psychological skills to facilitate communication. While the mediator metaphor seemed appropriate, especially in the presence of noisy input, interlingua approaches have shown little ability to cope with poor speech recognition performance, and to work significantly worse than purely data-driven translation models. Nevertheless, any plan for speech translation evaluation should take into account progress in the area of speech recognition and scale up difficulty of the considered tasks accordingly.
Human and automatic evaluation should take into account important differences between written and spoken language. Practically, how should input sentences containing disfluencies and syntactic errors be treated? what kind of human translations should be taken as target references? The simultaneous interpreting scenario would suggest to put more emphasis on adequacy rather than fluency. Moreover, appropriate reference translations could be obtained by transcribing human interpreters working in realistic conditions.