Among various information conveyed by spoken utterances, linguistic information about meanings that the speaker wanted to express and individuality information about the speaker are most basic and important for human communication. The human brain stores models of both information, and people recognize these two classes of information easily, clearly and simultaneously. People have common sense about the human voice, and using the common sense, people can capture the characteristics of each speaker's voice from extremely short utterance by each speaker and predict his/her voice uttering new words or sentences. Using this skill, people can separate the voices of many speakers spoken simultaneously or sequentially, and the contents of each utterance can be understood. Although various researches have been conducted on technologies for recognizing speakers of utterances, technologies for automatically adapting recognition models to speakers to improve speech recognition accuracy, and technologies for separating and extracting multiple superimposed utterances, their performances are far below human abilities. It is important to clarify the principle of speaker embedding, in which people model and use the personality of speech, and incorporate it into speech and speaker recognition systems in a semi-supervised or self-supervised manner.