Recently, the introduction of reinforcement learning methods and the Embodied Joint Embedding (EmJEm) approach has made it feasible to unsupervisedly infer articulatory movements from arbitrary utterances. However, the quality of re-synthesized utterances is still unsatisfactory and there is a lack of direct evaluation of the inferred articulatory movements to see if they are physiologically meaningful. In this work, we extend the EmJEm approach to tackle these problems of unsupervised acoustic-to-articulatory inversion (AAI). The VocalTractLab is adopted as the articulatory synthesizer and a novel architecture of the articulatory inference network is proposed. To obtain physiologically meaningful articulatory trajectories, a smoothness constraint is introduced as an articulatory prior during training. Experiments show that the proposed approach is able to re-synthesize utterances with state-of-the-art quality while effectively smooth the articulatory trajectories. We directly compare the unsupervisedly obtained articulatory trajectories with the recorded articulatory data from the HPRC database and it turns out that the inferred articulatory trajectories have a relatively high correlation with the recorded trajectories. This encouraging result shows the practical potential of unsupervised AAI methods.