Mono-to-stereo audio generation is a task that converts mono audio into two-channel stereo audio. Stereo audio generation plays a crucial role in enhancing spatial perception and auditory immersion. Traditional mono-to-stereo methods include rule-based, simulation-based, and various deep learning-based approaches. These methods require expert knowledge or explicit positional information to generate specific stereo effects, limiting their scalability and generalization. To address these challenges, we propose DiffStereo, an end-to-end diffusion transformer-based model that generates stereo audio conditioned on mono audio. The contributions of DiffStereo are as follows: First, DiffStereo directly synthesizes stereo audio from a mono waveform input in an end-to-end fashion, requiring no human intervention or prior knowledge. Second, DiffStereo achieves competitive objective ratings and consistently better subjective ratings, validating the effectiveness of our end-to-end approach.