We describe a multi-domain, conversational test set developed for IBM's Superhuman speech recognition project and our 2002 benchmark system for this task. Through the use of multi-pass decoding, unsupervised adaptation and combination of hypotheses from systems using diverse feature sets and acoustic models, we achieve a word error rate of 32.0% on data drawn from voicemail messages, two-person conversations and multiple-person meetings.