Non-native mispronunciation verification is an important component in computer-aided language learning (CALL) systems. However, the data sparsity problem makes it difficult to establish an accurate acoustic model directly on non-native data with supervised approaches since it is impractical to collect and manually label a large amount of non-native speech data. In this paper, we propose a pre-training approach based on self-supervised learning with multi-target contrastive coding utilizing plenty of raw resources of two native languages for non-native acoustic modeling of mispronunciation verification. In our work, the model is designed to learn the representations of discrepancy with respect to phonetic structures in and across different languages, and speakers by making predictions that are contrastive to different targets. In addition, an additional term is incorporated as a regularization term by reconstructing the original speech from the shared components. Through the experiments on the Japanese part of the BLCU inter-Chinese speech corpus, results show that our proposed approaches are effective to improve the performance for the non-native acoustic modeling of phone recognition and mispronunciation verification.