This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is fixed and known in advance. This requirement complicates its utilization for a front-end system of DSR for multi-speaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.