Automatic prosody evaluation models for second language (L2) read speech are classified into two categories: reference-based and reference-free. Reference-based models refer to native speakers' speech of the uttered text while reference-free models do not. Conventional reference-free models do not even take the uttered text into account. We propose an automatic prosody evaluation model that takes the uttered text into account by estimating native speakers' prosodic patterns using a Transformer encoder. The Transformer encoder used in Fast-Speech 2 estimates a sequence of native speakers' prosodic features in a phoneme-segment level, and a subsequent neural network module evaluates an L2 learner's utterance by comparing the sequence of prosodic features with the estimated sequence of native speakers' utterances. We evaluated the model by Spearman's correlation between the objective and subjective scores on L2 English sentence speech read by Japanese university students. The experimental results indicated that our model achieved a higher subjective-objective score correlation than that with a reference-free model and even higher than an inter-rater score correlation.