Input features that capture speech dynamics have frequently been proposed to improve recognition accuracy. A broad class of such features can be obtained by applying a linear projection to a window spanning successive feature vectors. The linear projection can be directly compared to conventional modeling schemes when it is optimized according to a maximum likelihood criterion. On a large acoustic training database of conversational telephone speech, phoneme errors were reduced by 5.5% and word errors by 6% using maximum likelihood temporal features. Smaller databases were subject to undertraining and no significant improvements in error rates were observed.