Identifying people in video broadcast is by nature a multimodal task: persons can be identified thanks to biometric information (face or voice), or thanks to a reference to their identity in the overlaid text or the speech content. In the framework of the French evaluation program Repere, this paper presents a method for identifying speakers in videos without any a-priori models, based only on overlaid text often used to introduce guests or journalists occurring for the first time in a given TV show. We show that Entity Linking improves speaker identification performance by reducing ambiguities in OCR transcriptions and allowing to add biometric constraints in the multimodal fusion process. All the methods presented are evaluated on the Repere video corpus of broadcast shows from 2 French TV channels and 5 different shows (news, talk shows, magazine).
Index Terms: OCR, Named Entity, Entity Linking, Multimodal fusion