Modern smartphones possess hardware for audio acquisition and to perform speech processing tasks such as speaker recognition and health assessment. However, energy consumption remains a concern, especially for resource-intensive DNNs. Prior work has improved the DNN energy efficiency by utilizing a compact model or reducing the dimensions of speech features. Both approaches reduced energy consumption during DNN inference but not during speech acquisition. This paper proposes using a masking kernel integrated into gradient descent during DNN training to learn the most energy-efficient speech length and sampling rate for windowing, a common step for sample construction. To determine the most energy-optimal parameters, a masking function with non-zero derivatives was combined with a low-pass filter. The proposed approach minimizes the energy consumption of both data collection and inference by 57%, and is competitive with speaker recognition and traumatic brain injury detection baselines.