Automatic syllable stress detection is typically operated at syllable level with stress-related acoustic features. The stress placed on a syllable is influenced not only by its own characteristics but also by its context in the word. However, traditional methods for stress detection overlook the contextual acoustic factors that influence stress placement. By addressing this issue, we study sequential modeling approaches by integrating the syllable dependency for automatic syllable stress detection using a masking strategy. This approach considers a sequence of syllables at the word level and identifies its stress label sequence. We explore various sequential models, such as RNNs, LSTMs, GRUs, and Attention networks. We conduct experiments on the ISLE corpus comprising non-native speakers speaking English. From the experiments, we observe a significant improvement in the performance with all sequential models compared to the state-of-the-art non-sequential baseline (DNN).