This paper focuses on the definition and modeling of robust context-dependent units for flexible vocabulary-recognition. It proposes a new technique for tuning the acoustic resolution of the models, and discusses the advantages of representing phonetic transcriptions in terms of a sequence of stationary context-independent phonemes and diphone-transition coarticulation units rather than with the classical diphone or triphone units. Combining these two techniques, the recognition rate of a speaker-independent recognizer with a vocabulary of 600 surnames increases from 91.2% to 96% using less than one third of the densities of the original models.