Speech representations learned with self-supervised learning (SSL) have the potential to significantly improve the performance of a number of audio applications, especially when availability of labeled data from the deployment domain is limited. Despite their successes, SSL training methods are compute- and memory-heavy, and require large investments in computing infrastructure, thus putting it out of the reach of most institutions. Therefore, building efficient model architectures is essential for the wide-scale adoption of SSL in speech technologies. CNN-based Acoustic Feature Extractors (AFE), which are widely used as encoders of acoustic waveforms, remain one of the main efficiency bottlenecks. This work proposes replacing CNN-based AFEs with more efficient ones and demonstrates that SSL pre-training time and memory consumption can be reduced by a factor of two to three over existing methods while preserving performances in speech-, command-, and speaker-recognition tasks.