This paper presents a voice conversion (VC) method that utilizes recently proposed recurrent temporal restricted Boltzmann machines (RTRBMs) for each speaker, with the goal of capturing high-order temporal dependencies in an acoustic sequence. Our algorithm starts from the separate training of two RTRBMs for a source and target speaker using speaker-dependent training data. Since each RTRBM attempts to discover abstractions at each time step, as well as the temporal dependencies in the training data, we expect that the models represent the speaker-specific latent features in the high-order spaces. In our approach, we run conversion from such speaker-specific-emphasized features of the source speaker to those of the target speaker using a neural network (NN), so that the entire network (the two RTRBMs ant the NN) forms a deep recurrent neural network and can be fine-tuned. Through VC experiments, we confirmed the high performance of our method especially in terms of objective criteria in comparison to conventional VC methods such as Gaussian mixture model (GMM)-based approaches.