A challenging application of text-to-speech synthesis is the reading of names and addresses, for example in the reverse directory service automation. This task undergoes rather severe conditions, as the proper pronunciation of names is particularly difficult, and the telephone network may cause acoustic degradations of the speech signal. Nevertheless very high segmental intelligibility is needed to avoid user rejection.
A project has been started at CSELT for the development of an experimental reverse directory service on the Italian telephone network, by exploiting the capabilities of a diphone-based text-to-speech synthesis system, augmented with specialized name pronunciation rules.
This paper describes the intelligibility evaluation of a consistent data base of surnames and addresses of the Italian telephone directory, by comparing natural speech and text-to-speech synthesis in two conditions: 16 kHz as a high quality reference, and 8 kHz PCM as telephone standard. The speech synthesis segmental intelligibility of several phonetic contexts is discussed, and referred to the natural speech performance. Results are interpreted also in the framework of the application.