A conventional sequence-to-sequence voice conversion (seq2seq VC), i.e., attentional encoder-decoder, can be trained without the speech sequence pre-aligning normally used to counter the different lengths of the source and target speakers. However, if alignments rendered by attention are not monotonic, speech drops and repeats will happen, and the linguistic contents will not be kept. To address this issue, we propose VC-T, a novel streaming VC framework based on a neural transducer (RNNT); RNNT is effective in the automatic speech recognition field as it offers robust alignment against collapse. We also introduce an alignment design scheme for VC-T training. Experiments show that our offline and streaming VC-T variants outperform two modern seq2seq parallel VCs while offering a lower character error rate as a result of the proposal robust alignment. Our VC-T also achieves better naturalness the drastic degradation suffered by the conventional alternatives, especially for streaming VC.