The objective of speaker verification is to reject or accept whether
or not the input speech is that of a enrolled speaker. Traditionally,
i-vector or speaker embeddings system such as d-vector representing
the speaker information has been showing high performance with similarity
metrics at the backend. Recently it has been proposed an end-to-end
system based on previous speaker embeddings approach without additional
strategy after extraction. Among the various models, CNN based end-to-end
system is showing state-of-the-art performance. CNN based model is
trained to classify multiple speakers and speaker embeddings are extracted.
In this paper, we propose shortcut connections based deep speaker
embeddings for end-to-end speaker verification system. We construct
modified ResNet-18 model so that the activation outputs from bottleneck
architecture have shortcut connections to speaker embeddings. Deep
speaker embeddings are extracted by jointly training in end-to-end
approach. The model was constructed without other sophisticated methods
such as length normalization, or additive margin softmax loss. When
we tested proposed model on the unconstrained conditions data set called
VoxCeleb1, the result showed EER of 3.03% when tested with high dimensional
deep speaker embeddings. This is the state-of-the-art performance of
end-to-end speaker verification model on VoxCeleb1.