The Speech Transmission Index (STI) is a crucial metric for evaluating speech intelligibility, but its standard measurement method is too complicated for real-time applications. Though recently proposed deep learning based STI estimation schemes can effectively address the problem, existing methods still fall short of covering all possible STI scenarios. This paper presents eSTImate: an end-to-end deep learning system for real-time STI blind estimation that integrates the tasks of STI estimation and speech enhancement through a feature pyramid auxiliary learning architecture and incorporates multi-head attention mechanisms. The proposed model demonstrates the performance of state-of-the-art, achieving a low mean absolute error of 0.016 and root mean square error of 0.021 on the constructed dataset that covers the whole range of STI, highlighting its potential to provide accurate and consistent real-time STI estimation across diverse real-world scenarios.