The work presented in this article falls within text-dependent speaker recognition. In our framework, each speaker owns and pronounces a secret phrase. It corresponds to two tasks: verification of the spoken text (Text Validation) and verification of the speaker's identity (SV). These tasks are usually carried-out in tandem by two different systems. Maintaining two systems involves a certain amount of complexity and may present shortcomings in terms of reliability. In this paper, we propose to use a Self-Supervised Learning Model (SSL) to develop a unified system capable of performing both tasks simultaneously. The proposed approach combines two models on a common SSL and takes advantage of a teacher-student paradigm to integrate textual constraints into the SV part, without requiring lexical labels during its learning phase. Evaluation on different datasets demonstrates the effectiveness of the approach.