This paper investigates discrete tokens based cross-utterance speech contexts modelling for Zipformer-Transducer (Z-T) systems. Their efficacy and efficiency in modelling preceding, current and future speech utterance contexts using concatenation or pooling projection of Z-T encoder embeddings are extensively shown on the 1000-hr GigaSpeech-M and DementiaBank Pitt elderly speech datasets over comparable contextual Z-T baselines using filterbank or continuous WavLM features}. The best performing discrete tokens based contextual Z-T system outperforms the non-contextual baseline by statistically significant average WER reductions of 0.39% and 1.41% absolute (3.4% and 3.4% relative) on the two tasks, respectively. Model training time speedup ratios up to 4.36x is obtained over continuous WavLM feature-based contextual Z-T systems, while retaining up to 98.0% of their WER reductions over non-contextual baselines.