Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of full utterance. However, due to a mismatch of attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer encoder performs better than the baseline results.