End-to-end (E2E) automatic speech recognition (ASR) systems often exploited pre-trained hidden Markov model (HMM) systems for word timing estimation (WTE), due to their inability to predict word boundaries. However, training an HMM is difficult for low-resource languages due to the lack of phonetic transcriptions, leading to a high demand for HMM-free WTE methods, particularly for multilingual ASR systems. In this paper, we propose a novel framework for performing WTE without the need for any HMM or phonetic labels. Specifically, the proposed method trains an alignment network using the outputs of the E2E ASR encoder and a voice activity detection module to generate the frame-level subword labels. In our experiments, the proposed method outperforms previous HMM-free WTE methods in a multilingual scenario. Notably, in the Fleurs dataset, we obtain a relative improvement of 57% over previous work in terms of accumulated averaging shift across 5 languages.