Robustness is crucial for automatic speech recognition systems in real-world environments. Speech enhancement/separation algorithms are normally used to enhance noisy speech before recognition. However, such algorithms typically introduce distortions unseen by acoustic models. In this study, we propose a novel joint training approach to reduce this distortion problem. At the training stage, we first concatenate a speech separation DNN, a filterbank and an acoustic model DNN to form a deeper network, and then jointly train all of them. This way, the separation frontend and filterbank can provide enhanced speech desired by the acoustic model. In addition, the linguistic information contained in the acoustic model can have a positive effect on the frontend and filterbank. Besides the commonly used log mel-spectrogram feature, we also add more robust features for acoustic modeling. Our system obtains 14.1% average word error rate on the noisy and reverberant CHIME-2 corpus (track 2), which outperforms the previous best result by 8.4% relatively.