This study proposes an approach of Japanese language grapheme-to-phoneme (G2P) conversion by combining the Transformer framework and Bidirectional Encoder Representations from Transformers (BERT) to utilize external datasets (e.g., dictionaries). Conventional transformer-based methods encounter limitations in referencing specific dictionary readings due to the inability of the model to directly modify its intermediate processes and weights. Therefore, this study employs a dual Transformer strategy to improve the accurate pronunciation of proper nouns, numerals, and counter words. The first Transformer method facilitates the application of external data, and the second Transformer employs BERT to predict accent sandhi. This combination of Transformer-based techniques with dictionary integration enables the accurate and arbitrary pronunciation of proper nouns, numerals, and counter words, which contribute to the ongoing development of text-to-speech technologies.