Nowadays, the use of technological devices and face and speaker biometric recognition systems are becoming increasingly common in people daily lives. This fact has motivated a great deal of research interest in the development of effective and robust systems. However, although face and voice recognition systems are mature technologies, there are still some challenges which need further improvement and continued research when Deep Neural Networks (DNNs) are employed in these systems. In this manuscript, we present an overview of the main findings of Victoria Mingote’s Thesis where different approaches to address these issues are proposed. The advances presented are focused on two streams of research. First, in the representation learning part, we propose several approaches to obtain robust representations of the signals for text-dependent speaker verification systems. While in the metric learning part, we focus on introducing new loss functions to train DNNs directly to optimize the goal task for text-dependent speaker, language and face verification and also multimodal diarization.