Currently, tone classification studies mainly focus on training classifiers by using intrinsic features of isolated segments, i.e. often the syllables. Mostly, the works are not merely in use of fundamental frequency (f0) but utilizing more information on the spectrograms, MFCCs, or energy to improve model accuracy. However, as we know, more challenges on tone classification lie on modeling the complex f0 variations from the tonal coarticulations and the interactive effects among tonality in continuous speech. To tackle down this issue, we first aim at in using the sequence of f0 samples in speech utterance only. In addition, we propose a transformer based network with an extendable BERT input architecture and a joint learning technique to consolidate the contour representations of consecutive tones. Leveraging or fusing more information affected from speech rhythm in utterance, the experiments show that the proposed J-ToneNet is very robust for read speech.