We have developed a sequential output model with speaker vector constraints for the joint modeling of multi-talker automatic speech recognition (ASR) and speaker diarization. The conventional approach to joint modeling of multi-talker ASR and speaker diarization, called SOMSRED, enables the estimation of speaker embeddings from fully overlapped speech by discretizing the speaker embedding space and treating the speaker embeddings as tokens. However, the predicted speaker embeddings become less distinctive compared to the ones directly obtained from non-overlapping speech due to the discretization. To address this problem, we add a new training objective that optimizes speaker embeddings in continuous space without discretization. Experimental results show that the proposed method avoids overfitting to the discretized speaker tokens and outperforms SOMSRED in both ASR performance and speaker embedding performance.