In this paper, we propose a novel multi-stream framework for automatic cued speech recognition (ACSR) that directly processes the upper-body region, addressing hand-lip asynchrony without requiring explicit segmentation or synchronization. Our model integrates two distinct modalities: (i) an appearance-based stream leveraging the ResNet18 for feature extraction and (ii) a skeletal stream based on a modulated graph convolutional network (GCN). For graph construction, we incorporate, for the first time in ACSR, 3D pose parameters inferred from the PIXIE model. Both modalities are coupled with temporal convolution for short-range dynamics learning and a BiGRU encoder for long-term sequence modeling. In addition, we introduce an alignment module that combines CTC with two auxiliary losses, improving each modality performance and enabling effective late fusion during inference. Our model achieves state-of-the-art performance across three benchmark datasets, demonstrating its effectiveness.