On-device end-to-end (E2E) models are required to handle long-tail vocabulary and a large number of acoustic conditions. With finite amount of training data some of these conditions and vocabulary words are unseen during training, which often leads to recognition errors. Text-based contextual biasing is intended to mitigate this problem, yet it works well only when sufficient textual context is provided, and when the speech signal is well modeled by the ASR system. In this work, we propose to extend biasing to operate directly in the audio domain. We address a scenario where audio samples and the associated transcriptions are available, as is the case of manually corrected voice typing. We propose to directly compare incoming audio embeddings against a list of Audio Exemplars (AE), each associated with a text correction. We demonstrate the effectiveness of our approach by correcting the outputs of a production-quality RNNT model, which results in relative-WER reduction of 21.7% (one-shot) and 33.7% (multi-shot) on the Wiki-Names data set.