This paper proposes a new method of detecting mis-recognized utterances based on a ROVER-like voting scheme. Although the ROVER approach is effective in improving recognition accuracy, it has two serious problems from a practical point of view: 1) it is difficult to construct multiple automatic speech recognition (ASR) systems, 2) the computational cost increase according to the number of ASR systems. To overcome these problems, a new method is proposed where only a single acoustic engine is employed but multiple language models (LMs) consisting of a baseline (main) LM and sub LMs are used. The sub LMs are generated by clustered sentences and used to rescore the word lattice given by the main LM. As a result, the computational cost is greatly reduced. Through experiments, the proposed method resulted in 18-point higher precision with 10% loss of recall when compared with the baseline, and 22- point higher precision with 20% loss of recall.