In this study, we apply a combination of face and speaker identification techniques to the task of multi-modal (i.e., multi-biometric) user authentication for mobile or variable-environment applications. Audio-visual data was collected using a web camera connected to a laptop computer in three different environments: a quiet indoor office, a busy indoor cafe, and near a noisy outdoor street intersection. Experiments demonstrated the benefits that may be obtained from using a multi-modal approach, even when both input modalities suffer from difficult environmental conditions or a poor match between training and testing conditions. Over twelve different training and testing conditions, user authentication equal error rates were reduced an average of 19% from the best individual biometric in each condition, and 36% from an audio-only system.