Though large language models (LLMs) based methods have achieved remarkable progress in zero-shot text-to-speech (TTS) synthesis, they suffer from robustness issues including mispronunciation, word skipping, and word repeating. To address these robustness challenges, we propose incorporating phoneme position prediction into the LLM-based TTS model. More concretely, given an input phoneme sequence as condition, our model autoregressively predicts acoustic codes and their corresponding phoneme positions within the input sequence synchronously. This mechanism ensures accurate and complete alignment between acoustic codes and phonemes. Experimental results demonstrate that our system significantly reduces phoneme skipping/repetition errors compared to strong baselines, achieving a 52.7% relative reduction in character error rate while maintaining comparable performance in zero-shot TTS evaluations.