Adaptive time-scale modification (ATSM) adaptively adjusts audio speed and improves upon previous systems by tailoring the scale for each phoneme in two steps: phoneme positioning via Montreal forced aligner (MFA) and reconstruction with adaptive speaking rate. However, ATSM’s phoneme-specific rate is constant regardless of sentences, and MFA struggles with precise phoneme alignment in synthetic speech. Driven by this, we propose a fully neural networks-based ATSM (Neural ATSM) that dynamically controls each phoneme’s speaking rate to vary from sentence to sentence. It predicts phonemelevel rates using a speaking rate predictor and flexibly modifies the scales to fit sentence context using Gaussian upsampling and attention mechanism, ensuring feature similarity with Softdynamic time warping (DTW) loss. We also integrate a variational autoencoder (VAE) and flow models for enhanced timescaled signals. Experimental results show that Neural ATSM outperforms ATSM for real and synthesized speech.