Spontaneous speech often contains disfluent acoustic features such as fillers and hesitations, which are major causes of errors during automatic speech recognition (ASR). In this paper, we propose a method of "disfluency labeling” to address this problem. Our proposed method replaces disfluent phenomena in the transcription of speech data used for training with two types of labels, filler (#) and hesitation (@), and trains an end-to-end ASR model using this data, which makes it possible to recognize disfluent acoustic phenomena as recognition targets, like characters. In addition, by removing the disfluency labels that are included in the recognition results, the words that the speaker actually intended to say can be extracted from the disfluent speech. The results of our evaluation experiments show that both the character and sentence error rates were reduced for all of the ASR test sets when disfluency labeling was applied, compared to the baseline method. The proposed method also outperformed other methods intended to reduce disfluency-related errors, even when more disfluent, spontaneous dialog speech was used. This study shows that explicit learning of two disfluent features, fillers and hesitations, is effective in spontaneous speech recognition.