To apply scene-aware interaction technology to real-time dialog systems, we propose an online low-latency response generation framework for scene-aware interaction using a video question answering setup. This paper extends our prior work on low-latency video captioning to build a novel approach that can optimize the timing to generate each answer under a trade-off between latency of generation and quality of answer. For video QA, the timing detector is now in charge of finding a timing for the question-relevant event, instead of determining when the system has seen enough to generate a general caption as in the video captioning case. Our audio visual scene-aware dialog system built for the 10th Dialog System Technology Challenge was extended to exploit a low-latency function. Experiments with the MSRVTT-QA and AVSD datasets show that our approach achieves between 97% and 99% of the answer quality of the upper bound given by a pre-trained Transformer using the entire video clips, using less than 40% of frames from the beginning.