Connectionist temporal classification (CTC) has been successfully used in speech recognition. It learns the alignments between speech frames and label sequences automatically without explicit pre-generated frame-level labels. While this property is convenient for shortening the training pipeline, it may become a potential disadvantage for the frame-level system combination due to inaccurate alignments. In this paper, a novel Dynamic Time Warping (DTW) based position calibration algorithm is proposed for joint decoding on two CTC based acoustic models. Furthermore, joint decoding for CTC and conventional hybrid NN-HMM models is also explored. Experiments on a large vocabulary Mandarin speech recognition task show that the proposed joint decoding of both CTC based and CTC-Hybrid based systems can achieve a significant and consistent character error rate reduction.