This paper investigates methods to model inter-phrase or word context for continuous Japanese speech recognition. It was found that by compiling a network of context-dependent phonetic models which models the inter-word or inter-phrase context, recognition error reduction by 32% can be achieved compared to models which do not account for inter-word context. However, this will significantly increase the number of phonetic models required to model the vocabulary. To overcome this increase, we clustered the inter-word/phrase context into only a few classes. Using one class for consonant inter-word context and two classes for vowel context, the recognition accuracy on digit string recognition was found to be virtually equal to the accuracy with unclustered models, while the number of phonetic models required was reduced by more than 50%.