Spoken language systems often rely on static speech recognizers. When the underlying models are dynamic, training is usually performed using unsupervised methods. In this work, we explore an alternative approach that uses human computation to provide on-the-fly crowd-supervised training. Although the framework we describe is applicable to any stochastic model for which the training data can be generated by nonexperts, we demonstrate its utility on the lexicon and language model of a speech recognizer in a cinema voice-search domain. We show how an initially shaky system can achieve over a 10#328% absolute improvement in word error rate (WER) - entirely without expert intervention. We then analyze how these gains were made.