Lip-to-speech synthesis in the wild remains challenging due to the limited visual information. While self-supervised models have shown promising results in relatively high-quality lip-to-speech synthesis, their computational demands make them impractical for edge devices. To address this issue, we introduce LightL2S, a novel multi-speaker lip-to-speech system designed to achieve ultra-low complexity for edge deployment. To reduce the computational cost, we adopt a much more efficient architecture MoViNet for visual encoder instead of using the conventional ResNet-18. Furthermore, we introduce Zipformer blocks to efficiently learn prosodic information and quantized self-supervised audio representations from the output features generated by MoViNet. Finally, we employ the differentiable digital signal processing vocoder to synthesize speech. Experimental results demonstrate that LightL2S can generate reasonable speech even with a computational complexity of only 0.8 GMacs.