To address the limitations in existing end-to-end heterogeneous change detection methods, which often neglect modality-specific feature differences and struggle to balance local details with global semantics, this paper introduces a multi-scale heterogeneous change detection network (MHCDNet) featuring cross-modal fusion for heterogeneous remote sensing imagery, which is built upon an encoder-decoder architecture. In the encoding part, a remote sensing foundation model is utilized to construct multi-scale feature representations for multi-modal images. To enhance the textural and structural information, a feature enhancement module (FEM) is introduced, which employs a bottleneck structure with multi-scale convolution design to effectively enhance detail information in different modal features while suppressing noise interference. Furthermore, to effectively account for the differences in multimodal features and achieve efficient fusion of shallow heterogeneous features, a selective cross-modal fusion module (SCFM) is introduced, which learns dynamic weights to enable adaptive fusion of multi-modal features, effectively capturing complementary information between modalities, thereby enhancing the robustness and representational capacity of fused features. Additionally, to effectively model the spatiotemporal context of deep heterogeneous features, a cross-modal cross-attention fusion module (CCFM) is introduced, which leverages both spatial and channel attention mechanisms to capture inter-modal spatiotemporal correlations, significantly enhancing the robustness and reliability of fused features. Finally, an adaptive up-sampling module (AUM) is proposed to achieve alignment and fusion of encoder-decoder features, effectively compensating for the loss of detail information during the decoding process, accumulating the change information, and generating change maps through a change head composed of three convolutional layers and up-sampling modules. To verify the effectiveness of the proposed method, experiments are conducted on two large-scale flood change detection datasets, CAU-Flood and Ombria. The results demonstrate that compared with other methods, MHCDNet achieves the best accuracy metrics on both datasets, while significantly reducing the false alarms and missed detections in change detection, yielding optimal visual results. Furthermore, ablation studies further verify the effectiveness of each module in MHCDNet. Model complexity analysis demonstrates that MHCDNet possesses low computational complexity, achieving the best balance between accuracy and efficiency.