We propose neural networks for predicting response timing of spoken dialog systems. Response timing varies depending on the dialog context. This context-dependent response timing is conventionally estimated directly from acoustic event sequences and word sequences extracted from past utterances. Since there are so wide varieties in these sequences, large amounts of training data are required to build reliable models. While, there is no large dialog databases with response timings annotated. The proposed method estimates dialog act for each utterance as an auxiliary task, and uses its intermediate states for response timing estimation in addition to acoustic and linguistic features. Since dialog act has significantly less variation than word sequences and is closely related to response timing, we expect to be able to construct a highly reliable model even with small training data. We evaluate our approach on the HarperValleyBank corpus. The experimental results show that the proposed approach is more effective than the conventional approach that does not use dialog act information for each utterance such as dialog act.