A tonal language is a language in which the meaning of words is not only determined by the sounds of the consonants and vowels, but also by the pitch or tone used to pronounce them. Mispronunciation Detection and Diagnosis (MD&D) of tonal languages is challenging since tone presentation is difficult to be detected correctly. There has been relatively little research conducted on tonal languages, with most focusing on Mandarin. Furthermore, there are no publicly available datasets and source codes for the task. This work constructs and publishes a Vietnamese dataset for experimenting with MD&D, as well as proposes an end-to-end model that utilizes pitch analysis to detect and diagnose mispronunciations for tonal languages, especially focusing on Vietnamese. Experiments show that the proposed model achieved a relative improvement in phone error rate of 7.1% and detection accuracy of 7.4% compared to a state-of-the-art baseline.