Various neural network architectures have been proposed in recent years for the task of multi-channel speech separation. Among them, the filter-and-sum network (FaSNet) performs end-to-end time-domain filter-and-sum beamforming and has shown effective in both ad-hoc and fixed microphone array geometries. However, whether such explicit beamforming operation is a necessary and valid formulation remains unclear. In this paper, we investigate the beamforming operation and show that it is not necessary. To further improve the performance, we change the explicit waveform-level filter-and-sum operation into an implicit feature-level filter-and-sum operation around a context of features. A feature-level normalized cross correlation (fNCC) feature is also proposed to better match the implicit operation for an improved performance. Experiment results on a simulated ad-hoc microphone array dataset show that the proposed modification to the FaSNet, which we refer to as the implicit filter-and-sum network (iFaSNet), achieve better performance than the explicit FaSNet with a similar model size and a faster training and inference speed.