An Auto-Regressive eXogenous (ARX) model combined with descriptive models of the glottal source waveform has been adopted to more accurately separate the vocal tract and the voicing source. However, these methods cannot be easily applied to the analysis of voices uttered by different speech production methods, such as esophageal voice. We previously proposed the Voicing Source Hidden Markov Model (VS-HMM) and an accompanying parameter estimation method. The states of the VS-HMM were concatenated in a ring topology to represent the periodicity of the glottal source. We refer to the model combining the VS-HMM with an Auto-Regressive (AR) filter as AR-HMM. In this paper, we extend the conventional AR-HMM with its fixed ring topology to automatically generate the optimum topology for the VS-HMM using the Minimum Description Length-based Successive State Splitting (MDL-SSS) algorithm in order to simultaneously and accurately estimate the vocal tract and voicing source based on a voice excited by an unknown, aperiodic voicing source such as an esophageal voice. Experiment results using synthesized pseudo-esophageal voices confirmed that the proposed AR-HMM approach can separate the vocal tract characteristics and the voicing source more accurately than the conventional AR-HMM with its fixed-ring topology or the LP method.