The field of dialogue evaluation is still in a very early stage of development. This paper surveys relevant work and outlines the approach to evaluation developed in the SUNDIAL project. This evaluates a system in terms of a battery of metrics, divided between those which treat the system as a black box and those which look inside at parts of it (as though it were a glass box). Some of these metrics require the application of subjective judgement, so they can not be fully automated. We argue that this is a reasonable price to pay for a well-rounded evaluation of a spoken dialogue system.
Keywords: Spoken dialogue systems, Evaluation, Black box metrics, Glass box metrics.