Acta Geodaetica et Cartographica Sinica ›› 2025, Vol. 54 ›› Issue (5): 853-872.doi: 10.11947/j.AGCS.2025.20240244

• Photogrammetry and Remote Sensing • Previous Articles     Next Articles

Visual-language joint representation and intelligent interpretation of remote sensing geo-objects: principles, challenges and opportunities

Haifeng LI1(), Wang GUO1, Mengwei WU1, Chengli PENG1, Qing ZHU2, Yu LIU3, Chao TAO1()   

  1. 1.School of Geosciences and Info-physics, Central South University, Changsha 410083, China
    2.Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu 610031, China
    3.Institute of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China
  • Received:2024-06-18 Revised:2025-03-27 Online:2025-06-23 Published:2025-06-23
  • Contact: Chao TAO E-mail:lihaifeng@csu.edu.cn;kingtaochao@csu.edu.cn
  • About author:LI Haifeng (1980—), male, PhD, professor, majors in general-purpose multimodal spatio-temporal foundation models and multimodal spatio-temporal memory models. E-mail: lihaifeng@csu.edu.cn
  • Supported by:
    Hunan Natural Science Funds for Distinguished Young Scholar(2022JJ10072);The National Natural Science Foundation of China(42471419)

Abstract:

Remote sensing imagery intelligent interpretation primarily relies on visual models to establish a mapping between remote sensing images and semantic labels. However, due to the limited categories of available annotations, such models struggle to capture the deep semantics of geo-objects and their interrelations, thereby failing to develop a broader understanding of world knowledge. With the emergence of large language models (LLMs), which possess powerful capabilities in encoding human knowledge expressed through language, this limitation may be effectively addressed. Guiding visual models with LLMs can significantly broaden their capacity for knowledge acquisition and drive a paradigm shift in remote sensing interpretation—from surface-level semantic matching toward deeper world knowledge understanding. Building upon this insight, this paper presents a systematic analysis of geo-object concept representation in remote sensing. By examining both the intension and extension of geo-object concepts, it reveals the limitations of relying solely on visual modalities for representing complex geo-object characteristics. The study then elaborates on the theoretical significance and practical value of integrating visual and language modalities to enhance concept representation. Furthermore, it investigates the inherent challenges of modality alignment under this new paradigm and reviews representative solution strategies. This paper also explores how the paradigm fosters the emergence of novel capabilities in remote sensing interpretation models, analyzes the underlying mechanisms driving these capabilities, and discusses their practical implications. Finally, it summarizes the new opportunities and challenges that arise in intelligent remote sensing interpretation within this conceptual framework.

Key words: remote sensing image intelligent interpretation, geo-objects concept representation, visual-language models, emergence of capabilities, visual-language joint representation

CLC Number: