The paper proposes a human-computation-based scheme for transcribing speech corpora. The core idea of the scheme is to implement a Web-based language learning system to collect orthographic and phonetic labels from a large amount of language learners and use some criteria to choose the commonly input labels as the transcriptions of the corpora. It is essentially a technology of distributed knowledge acquisition. The benefit of the scheme is that it makes the transcribing task neither tedious nor costly. The design of a system for transcribing Min Nan speech corpora is described in detail.
Index Terms— Speech transcription, southern Min (Min Nan) language, distributed knowledge acquisition, Web-based language learning