It is well-known that HMMs only of the basic structure cannot capture the correlations among successive frames adequately. In our previous work, to solve this problem, segmental unit HMMs were introduced and their effectiveness was shown. And the integration of delta- cepstrum and delta-delta- cepstrum into the segmental unit HMMs was also found to improve the recognition performance in the work. In this paper, we investigated further refinements of the models by using a mixture of PDFs and/or context dependency, where, for a given syllable, only a preceding vowel was treated as the context information. Recognition experiments showed that the accuracy rate was improved by 23 %, which clearly indicates the effectiveness of the refinements examined in this paper. The proposed syllable-based HMM outperformed a triphone model.