Reinforcement learning is becoming a popular tool for building dialogue managers. This paper addresses two issues in using RL. First, we propose two methods for finding MDP violations. Both methods make use of computing Q scores when testing the policy. Second, we investigate how convergence happens. To do this, we use a dialogue task in which the only source of variability is the dialogue policy itself. This allows us to study how and when convergence happens as training progresses. The work in this paper should help dialogue designers build effective policies and understand how much training is necessary.