To tackle the tone classification problem in conversational speech, we propose a transformer-based encoding network to classify tones in an utterance on a syllable-by-syllable basis. Using just F0 and rhythmic information, the interaction encoder consolidates contour representations first. By jointly predicting word tones using perceived judgments on reduction degrees, the learning architecture improves automatic recognition of the underlying syllable tones. Leveraging these enhancements, the experiments show that the proposed model is very robust and achieved a 12% increase in tone classification accuracy.