ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Using a hybrid approach to build a pronunciation dictionary for Brazilian Portuguese

Gustavo Mendonça, Sandra Aluisio

This paper describes the method employed to build a machine-readable pronunciation dictionary for Brazilian Portuguese. The dictionary makes use of a hybrid approach for converting graphemes into phonemes, based on both manual transcription rules and machine learning algorithms. It makes use of a word list compiled from the Portuguese Wikipedia dump. Wikipedia articles were transformed into plain text, tokenized and word types were extracted. A language identification tool was developed to detect loanwords among data. Words' syllable boundaries and stress were identified. The transcription task was carried out in a two-step process: i) words are submitted to a set of transcription rules, in which predictable graphemes (mostly consonants) are transcribed; ii) a machine learning classifier is used to predict the transcription of the remaining graphemes (mostly vowels). The method was evaluated through 5-fold cross-validation; results show a F1-score of 0.98. The dictionary and all the resources used to build it were made publicly available.


doi: 10.21437/Interspeech.2014-319

Cite as: Mendonça, G., Aluisio, S. (2014) Using a hybrid approach to build a pronunciation dictionary for Brazilian Portuguese. Proc. Interspeech 2014, 1278-1282, doi: 10.21437/Interspeech.2014-319

@inproceedings{mendonca14_interspeech,
  author={Gustavo Mendonça and Sandra Aluisio},
  title={{Using a hybrid approach to build a pronunciation dictionary for Brazilian Portuguese}},
  year=2014,
  booktitle={Proc. Interspeech 2014},
  pages={1278--1282},
  doi={10.21437/Interspeech.2014-319},
  issn={2308-457X}
}