ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

Computational Approaches to Linguistic Code Switching

Mona Diab, Pascale Fung, Julia Hirschberg, Thamar Solorio

Code-switching (CS) is the phenomenon by which multilingual speakers switch back and forth between their common languages in written or spoken communication. CS may occur at the inter-utterance, intra-utterance (mixing of words from multiple languages in the same utterance) and even morphological (mixing of morphemes from different languages) levels. CS presents serious challenges for language technologies such as Automatic Speech Recognition, Language Modeling, Parsing, Machine Translation (MT), Information Retrieval (IR) and Extraction (IE), Keyword Search, and semantic processing. A prime example of this is acoustic modeling and language modeling in automatic speech recognition (ASR): techniques trained on one language quickly break down when there is mixed language input. The lack of basic tools such as language models, part-of-speech (POS) taggers and parsers trained on such mixed language data makes downstream tasks even more challenging. Even for problems that are largely considered solved for monolingual corpora, such as Language Identification, or POS Tagging, performance degrades at a rate proportional to the amount and level of mixed-language present in the data.

This special event is to bring together researchers interested in solving the CS problem, to raise community awareness of the (limited) resources available and the work currently underway for the study of CS, with particular emphasis on work in the speech community. The format will consist of a short introduction from the organizers followed by discussion. We held a workshop in CS in conjunction with EMNLP 2014, developing a shared text-based task for this purpose. We received 18 regular workshop submissions and accepted 8. The goal of this event is to engage the speech processing community now working in this area and to encourage new research by those now working primarily with monolingual corpora.

We will solicit participation from researchers working in speech processing for the analysis and/processing of CS data. Topics of relevance to the event will include the following:

  • Methods for improving ASR acoustic and language models in code switched data
  • Domain/dialect/genre adaptation techniques applied to CS data processing
  • Challenges of language identification in CS data
  • Speech-to-speech translation in CS data
  • Keyword search in CS data
  • Cross-lingual approaches to CS
  • Development of corpora to support research on CS data
  • Crowdsourcing approaches for the annotation of code switched data