Vocal tract resonances play a central role in the perception and analysis of speech. Here we consider the canonical task of estimating such resonances from an observed acoustic waveform, and formulate it as a statistical model-based tracking problem. In this vein, Deng and colleagues recently showed that a robust linearization of the formant-to-cepstrum map enables the effective use of a Kalman filtering framework. We extend this model both to account for the uncertainty of speech presence by way of a censored likelihood formulation, as well as to explicitly model formant cross-correlation via a vector autoregression, and in doing so retain a conditionally linear and Gaussian framework amenable to efficient estimation schemes. We provide evaluations using a recently introduced public database of formant trajectories, for which results indicate improvements from twenty to over 30% per formant in terms of root mean square error, relative to a contemporary benchmark formant analysis tool.