ISCA Archive SIGUL 2023
ISCA Archive SIGUL 2023

A Transformer-Based Orthographic Standardiser for Scottish Gaelic

Junfan Huang, Beatrice Alex, Michael Bauer, David Salvador-Jasin, Yuchao Liang, Robert Thomas, William Lamb

The transition from rule-based to neural-based architectures has made it more difficult for low-resource languages like Scottish Gaelic to participate in modern language technologies. The performance of deep-learning approaches correlates with the availability of training data, and low-resource languages have limited data reserves by definition. Historical and non-standard orthographic texts could be used to supplement training data, but manual conversion of these texts is expensive and time-consuming. This paper describes the development of a neural-based orthographic standardisation system for Scottish Gaelic and compares it to an earlier rule-based system. The best performance yielded a precision of 93.92, a recall of 92.20 and a word error rate of 11.01. This was obtained using a transformer- based mixed teacher model which was trained with augmented data.


doi: 10.21437/SIGUL.2023-23

Cite as: Huang, J., Alex, B., Bauer, M., Salvador-Jasin, D., Liang, Y., Thomas, R., Lamb, W. (2023) A Transformer-Based Orthographic Standardiser for Scottish Gaelic . Proc. 2nd Annual Meeting of the ELRA/ISCA SIG on Under-resourced Languages (SIGUL 2023), 108-112, doi: 10.21437/SIGUL.2023-23

@inproceedings{huang23_sigul,
  author={Junfan Huang and Beatrice Alex and Michael Bauer and David Salvador-Jasin and Yuchao Liang and Robert Thomas and William Lamb},
  title={{A Transformer-Based Orthographic Standardiser for Scottish Gaelic }},
  year=2023,
  booktitle={Proc. 2nd Annual Meeting of the ELRA/ISCA SIG on Under-resourced Languages (SIGUL 2023)},
  pages={108--112},
  doi={10.21437/SIGUL.2023-23}
}