This study investigated relationships between teacher-selected stimulus sentences and machine-suggested ones in terms of the correlation between human ratings and GOP-based machine scores. In assessing shadowed speeches consisting of 55 sentences recorded by 125 Japanese learners of English, it was examined which sentence combinations out of the 55 sentences could maximize the correlation between automatic scores and human ratings. A veteran teacher selected 10 sentences based on criteria such as sentence length, grammar, pronunciation, and prosodic features. The shadowed speeches of the 10 sentences were manually rated by two native speakers of English focusing on pronunciation, prosody and lexical access (whether shadowing is done adequately or not after each word or phrase is identified). The same shadowed utterances were automatically assessed by the DNN-based GOP procedure. A significantly high correlation (r=0.738, p<.01) was found between manual ratings and automatic scores. Then three sentences were selected out of the original 55 sentences by greedy search so that correlation between their DNN-GOP scores and manual ratings of the selected 10 sentences could be maximized, and the top-ranked combinations of three sentences were listed up. The number of shared sentences between teacher-selected sentences and machine-suggested ones was much smaller than expected, and thus the teacher's strategies in selecting stimulus sentences turned out to be inappropriate in terms of maximizing correlation. Examining the sentence sets suggested by the machine revealed that some particular sentences frequently appeared in the sets and seemed to have something to do with maximizing correlation. Hence, the present study compared the 10 sentences selected by the teacher and his selection criteria with the sentence sets suggested by the machine and discussed what kind of sentence should be chosen to improve reliability in automatic assessment.