Accurate voiced/unvoiced information is crucial in estimating the pitch of a target speech signal in severe nonstationary noise environments. Nevertheless, state-of-the-art pitch estimators based on deep neural networks (DNN) lack a dedicated mechanism for robustly detecting voiced and unvoiced segments in the target speech in noisy conditions. In this work, we proposed an end-to-end deep learning-based pitch estimation framework which jointly detects voiced/unvoiced segments and predicts pitch values for the voiced regions of the ground-truth speech. We empirically showed that our proposed framework significantly more robust than state-of-the-art DNN based pitch detectors in nonstationary noise settings. Our results suggest that joint training of voiced/unvoiced detection and voiced pitch prediction can significantly improve pitch estimation performance.