This paper considers applying speaker diarization (SD) to the output tokens of automatic speech recognition (ASR). We formulate the task to be solved as a sequence classification problem, where we estimate the correct speaker label for each ASR output token based on a sequence of token-level speaker embeddings and candidate speaker profiles. To leverage the information from the ASR model, we utilize a recently proposed t-vector for the speaker embedding estimation. Whereas previous studies performed t-vector classification using cosine similarities with ad hoc post-processing, we propose to use a sequence classification model to leverage the sequential nature of the task more effectively. To handle a variable number of speakers, we use a classification model inspired by a target speaker voice activity detection based on transformers. We conduct experiments using the AMI meeting corpus in both speaker identification and diarization settings and show the effectiveness of our approach.