As auto-regressive speech generation models have recently demonstrated exceptional ability in producing realistic and contextually appropriate spoken language, concerns about their potential misuse also grow. To mitigate these risks, our research introduces a pioneering statistical watermarking framework tailored for auto-regressive speech generation models. This framework integrates the statistical algorithms from existing watermark techniques to ensure that the audio outputs remain traceable and accountable without compromising audio quality. Besides, we identify the re-encoded mismatch, a significant hurdle in maintaining detection accuracy when audio outputs are re-encoded for verification. Through comprehensive experiments, we valid the detectability of our watermark and provide a detailed examination of how the re-encoded mismatch impacts watermark detection efficiency.