For a long time, it has been possible to navigate in text files with tools which are now familiar even to beginners. Multimedia has given a new role to audio and video as media that are now accessible in digital form on the Web. Fast access to audio and video information requires new tools which will enrich the future search engine. Joint indexing of audio and video will increase the recall rate of audio and/or visual events: each medium supporting the other for better detection. This will present an overview on different tools for indexing and of the efforts to develop efficient technology.
A typical application is audio-visual speech where the cooperation of media improves the recognition of a message in noisy environment by lip reading.
GSM limited bandwidth as well as voice transmission over IP require detection of non speech segments by a voice activity detector (VAD) in a transmission. "Silences" are then generated at the receiver for the audio message reconstruction. VAD are also used in speech enhancement.
In the automatic analysis of broadcast news, it is necessary to separate speech and music.
Also detection of speaker turns plays a critical role in the analysis of the discourse but can also be used to improve speech detection by adaption of the speech unit models. It should be possible without knowing the number of different speakers intervening in recorded speech and without knowing anything about their voice characteristics. Identity search of speakers is of course important too. The words of a given speaker can then be automatically found in the data base. A recent technique under study is eigenvoices: speaker space is described in terms of projections of speaker characteristic vectors in a subspace (initially by PCA but later the vector basis is retrained.
Search of keywords is a preliminary step towards topic detection: accumulation of words belonging to a given domain will be used for selection of significant parts of the data. Another technique which seems the most straightforward is the conversion of the whole speech file into text file: in that case a lot of existing tools for text files can be used but speaker information is completely lost. Moreover, the lexicon should be completely specified (more than 100000 words) contrary to systems where a phoneme lattice is generated (N best search)in which any sequence of phonemes can be searched for. Important projects have been devoted to indexing (a.o.THISL a European project) and have improved LVSIR. Many labs have worked on the Broadcast News Hub 4 data base.