This paper proposes a method to extract prosodic features from a speech signal by leveraging auxiliary linguistic information. A prosodic feature extractor called the statistical phrase/accent command estimation (SPACE) has recently been proposed. This extractor is based on a statistical model formulated as a stochastic counterpart of the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration. The key idea of this approach is that a phrase/accent command pair sequence is modeled as an output sequence of a path-restricted hidden Markov model (HMM) so that estimating the state transition amounts to estimating the phrase/accent commands. Since the phrase and accent commands are related to linguistic information, we may expect to improve the command estimation accuracy by using them as auxiliary information for the inference. To model the relationship between the phrase/accent commands and linguistic information, we construct a deep neural network (DNN) that maps the linguistic feature vectors to the state posterior probabilities of the HMM. Thus, given a pitch contour and linguistic information, we can estimate phrase/accent commands via state decoding. We call this method “DNN-SPACE.” Experimental results revealed that using linguistic information was effective in improving the command estimation accuracy.