We explore knowledge distillation methods from nonstreaming to streaming Transformer–Transducer (T–T) models. Streaming T–T truncates future context. It leads to recognition quality degradation compared with the original T–T. In this work, we explore knowledge distillation, which minimizes internal representations in all Transformer layers between nonstreaming and streaming T–T models. In the experiment, we compared two different methods: the minimization of the L2 distance of hidden vectors and the minimization of the L2 distance of heads. All experiments were conducted using the public LibriSpeech corpus. Results of the experiment showed that hidden vector similarity-based knowledge distillation is better than multi-head similarity-based knowledge distillation. We observed 3.5% and 2.1% relative reductions in word error rate compared with the original streaming T–T in test-clean set and test-other set, respectively.