ISCA Archive IberSPEECH 2022
ISCA Archive IberSPEECH 2022

CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech

Vinícius G. Santos, Caroline Adriane Alves, Bruno Baldissera Carlotto, Bruno Angelo Papa Dias, Lucas Rafael Stefanel Gris, Renan de Lima Izaias, Maria Luiza Azevedo de Morais, Paula Marin de Oliveira, Rafael Sicoli, Flaviane Romani Fernandes Svartman, Marli Quadros Leite, Sandra Maria Aluísio

With the advent of technology, the availability of linguistic data in digital format has been increasingly encouraged to facilitate its use not only in different areas of Linguistics but also in related areas, such as natural language processing. Inspired by a protocol for digitizing the NURC (‘Cultured Linguistic Urban Norm’) project collection — one of the most influential in Brazilian Linguistics —, this paper aims to present the textto-speech alignment process of the NURC-Sao Paulo Minimal ˜ Corpus. This subcorpus comprises 21 audio files and audioaligned multilevel transcripts according to linguistically motivated intonation units (≈18 hours, ≈155 k words), covering three text genres. The dataset — currently used to evaluate methods for processing the entire NURC-SP corpus — is publicly available on the Portulan Clarin repository [CC BY-NCND 4.0] (https://hdl.handle.net/21.11129/0000-000F-73CA-C).