Acta Geodaetica et Cartographica Sinica ›› 2026, Vol. 55 ›› Issue (2): 328-343.doi: 10.11947/j.AGCS.2026.20250331

• Photogrammetry and Remote Sensing • Previous Articles    

Heterogeneous remote sensing image flood change detection based on multi-scale cross-modal feature fusion

Daifeng PENG1,2(), Xuelian LIU1, Mengfei LU1, Haiyan GUAN1   

  1. 1.School of Remote Sensing and Geomatics Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China
    2.Technology Innovation Center for Integrated Applications in Remote Sensing and Navigation, Ministry of Natural Resources, Nanjing 210044, China
  • Received:2025-09-04 Revised:2026-01-16 Published:2026-03-13
  • About author:PENG Daifeng (1988—), male, PhD, associate professor, majors in remote sensing image intelligent interpretation. E-mail: daifeng@nuist.edu.cn
  • Supported by:
    The National Natural Science Foundation of China(42371449; 41801386)

Abstract:

To address the limitations in existing end-to-end heterogeneous change detection methods, which often neglect modality-specific feature differences and struggle to balance local details with global semantics, this paper introduces a multi-scale heterogeneous change detection network (MHCDNet) featuring cross-modal fusion for heterogeneous remote sensing imagery, which is built upon an encoder-decoder architecture. In the encoding part, a remote sensing foundation model is utilized to construct multi-scale feature representations for multi-modal images. To enhance the textural and structural information, a feature enhancement module (FEM) is introduced, which employs a bottleneck structure with multi-scale convolution design to effectively enhance detail information in different modal features while suppressing noise interference. Furthermore, to effectively account for the differences in multimodal features and achieve efficient fusion of shallow heterogeneous features, a selective cross-modal fusion module (SCFM) is introduced, which learns dynamic weights to enable adaptive fusion of multi-modal features, effectively capturing complementary information between modalities, thereby enhancing the robustness and representational capacity of fused features. Additionally, to effectively model the spatiotemporal context of deep heterogeneous features, a cross-modal cross-attention fusion module (CCFM) is introduced, which leverages both spatial and channel attention mechanisms to capture inter-modal spatiotemporal correlations, significantly enhancing the robustness and reliability of fused features. Finally, an adaptive up-sampling module (AUM) is proposed to achieve alignment and fusion of encoder-decoder features, effectively compensating for the loss of detail information during the decoding process, accumulating the change information, and generating change maps through a change head composed of three convolutional layers and up-sampling modules. To verify the effectiveness of the proposed method, experiments are conducted on two large-scale flood change detection datasets, CAU-Flood and Ombria. The results demonstrate that compared with other methods, MHCDNet achieves the best accuracy metrics on both datasets, while significantly reducing the false alarms and missed detections in change detection, yielding optimal visual results. Furthermore, ablation studies further verify the effectiveness of each module in MHCDNet. Model complexity analysis demonstrates that MHCDNet possesses low computational complexity, achieving the best balance between accuracy and efficiency.

Key words: heterogeneous change detection, feature enhancement, selective cross-modal fusion, cross-modal cross-attention fusion, adaptive up-sampling

CLC Number: