In this paper, we propose a computational measure for the quality of audio in user-generated multimedia (UGM) in accordance with the human perceptual system. To this end, we first extend the previously proposed IIT-JMU-UGM Audio dataset by including samples with more diverse context, content, distortion types, and intensities, along with implicitly distorted audio that reflect realistic scenarios. We conduct subjective testing on the extended database containing 2075 audio clips to obtain the mean opinion scores for each sample. We then introduce transformer-based learning to the domain of audio quality assessment, which is trained on three vital audio features: Mel-frequency cepstral coefficients, chroma, and Mel-scaled spectrogram. The proposed non-intrusive transformer-based model is compared against state-of-the-art methods and found to outperform Simple RNN, LSTM, and GRU models by over 4%. The database and the source code will be made public upon acceptance.