We present an extension to the SRILM toolkit for training maximum entropy language models with N-gram features. The extension uses a hierarchical parameter estimation procedure for making the training time and memory consumption feasible for moderately large training data (hundreds of millions of words). Experiments on two speech recognition tasks indicate that the models trained with our implementation perform equally to or better than N-gram models built with interpolated Kneser-Ney discounting.