Continuous fingerspelling recognition from videos is paramount for real-time sign language (SL) interpretation, enhancing accessibility. Despite deep learning progress, challenges persist, especially in signer-independent (SI) scenarios, due to signing variability. To address these, we propose a novel bimodal approach that integrates appearance and skeletal information, focusing solely on the signing hand. Our system relies on two basic modules: (a) a 3D-CNN model capturing spatial features, while adapting to motion variations and (b) a modulated spatio-temporal graph convolutional network (ST-GCN) based on 3D joint-rotation parameterization for skeletal feature modeling. Both modalities are combined with a BiGRU encoder and CTC decoding. To further enhance representation capacity, we introduce an alignment mechanism relying on two auxiliary losses. Through ensemble fusion and language model integration, our method achieves superior performance across three SI fingerspelling datasets.