Word-level prominence plays a crucial role in spoken language understanding, as it helps to interpret the speaker’s intent. However, automatic word-level prominence detection remains underexplored, particularly in non-native speech, where prominence patterns are more variable due to native language influence. While prominence is determined by acoustic features such as energy, duration, and pitch, capturing statistics on these acoustics at the word level provides only a global representation, missing finer suprasegmental variations. In this study, we consider syllables as the suprasegmental unit and propose a methodology that jointly models syllable prominence variations and their contribution to word prominence using sequential neural networks. This enables learning word-level prominence representations with minimal labeled data. Our method outperforms an unsupervised n-gram-based baseline by 24.01% and a supervised SVM by 5.73%, demonstrating its effectiveness over both approaches.