测绘学报 ›› 2025, Vol. 54 ›› Issue (5): 853-872.doi: 10.11947/j.AGCS.2025.20240244

• 摄影测量学与遥感 • 上一篇    下一篇

视觉-语言联合的遥感地物概念表达与智能解译:原理、挑战与机遇

李海峰1(), 郭旺1, 吴梦伟1, 彭程里1, 朱庆2, 刘瑜3, 陶超1()   

  1. 1.中南大学地球科学与信息物理学院,湖南 长沙 410083
    2.西南交通大学地球科学与工程学院,四川 成都 610031
    3.北京大学遥感与地理信息系统研究所,北京 100871
  • 收稿日期:2024-06-18 修回日期:2025-03-27 出版日期:2025-06-23 发布日期:2025-06-23
  • 通讯作者: 陶超 E-mail:lihaifeng@csu.edu.cn;kingtaochao@csu.edu.cn
  • 作者简介:李海峰(1980—),男,博士,教授,研究方向为多模态时空通用大模型、多模态时空信息记忆模型。E-mail:lihaifeng@csu.edu.cn
  • 基金资助:
    湖南省杰出青年基金(2022JJ10072);国家自然科学基金(42471419)

Visual-language joint representation and intelligent interpretation of remote sensing geo-objects: principles, challenges and opportunities

Haifeng LI1(), Wang GUO1, Mengwei WU1, Chengli PENG1, Qing ZHU2, Yu LIU3, Chao TAO1()   

  1. 1.School of Geosciences and Info-physics, Central South University, Changsha 410083, China
    2.Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu 610031, China
    3.Institute of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China
  • Received:2024-06-18 Revised:2025-03-27 Online:2025-06-23 Published:2025-06-23
  • Contact: Chao TAO E-mail:lihaifeng@csu.edu.cn;kingtaochao@csu.edu.cn
  • About author:LI Haifeng (1980—), male, PhD, professor, majors in general-purpose multimodal spatio-temporal foundation models and multimodal spatio-temporal memory models. E-mail: lihaifeng@csu.edu.cn
  • Supported by:
    Hunan Natural Science Funds for Distinguished Young Scholar(2022JJ10072);The National Natural Science Foundation of China(42471419)

摘要:

当前遥感智能解译主要依赖视觉模型在遥感影像与语义标签之间建立映射关系,但受限于有限类别的语义标注,模型难以学习地物及其关系的深层语义,进而无法涌现对世界知识的理解能力。随着大语言模型的兴起,依托其对语言中蕴含的人类知识的强大编码能力,有望突破遥感视觉模型知识获取的局限。通过大语言模型引导视觉模型,可显著拓展其知识学习范围,推动遥感智能解译从语义匹配向世界知识理解跃迁。本文认为,视觉-语言联合的多模态遥感解译模型将引发新一轮范式变革。在此基础上,本文进一步围绕遥感地物概念表达展开系统分析。通过深入分析遥感地物的概念内涵与外延,揭示了单纯依赖视觉模态在表达复杂遥感地物特征方面的不足,剖析了联合视觉-语言两种模态数据进行遥感地物概念表达的价值和意义。在此基础上,详细分析了遥感地物概念表达范式背后所面临的模态对齐问题及其代表性解决方法,探讨了该范式如何催化遥感解译模型新能力的涌现,并对该能力的产生原因和实际应用价值进行分析。最后,本文总结并讨论了在这种遥感地物概念表达框架下遥感影像智能解译领域所面临的机遇与挑战。

关键词: 遥感影像智能解译, 地物概念表达, 视觉-语言模型, 能力涌现, 视觉-语言联合表征

Abstract:

Remote sensing imagery intelligent interpretation primarily relies on visual models to establish a mapping between remote sensing images and semantic labels. However, due to the limited categories of available annotations, such models struggle to capture the deep semantics of geo-objects and their interrelations, thereby failing to develop a broader understanding of world knowledge. With the emergence of large language models (LLMs), which possess powerful capabilities in encoding human knowledge expressed through language, this limitation may be effectively addressed. Guiding visual models with LLMs can significantly broaden their capacity for knowledge acquisition and drive a paradigm shift in remote sensing interpretation—from surface-level semantic matching toward deeper world knowledge understanding. Building upon this insight, this paper presents a systematic analysis of geo-object concept representation in remote sensing. By examining both the intension and extension of geo-object concepts, it reveals the limitations of relying solely on visual modalities for representing complex geo-object characteristics. The study then elaborates on the theoretical significance and practical value of integrating visual and language modalities to enhance concept representation. Furthermore, it investigates the inherent challenges of modality alignment under this new paradigm and reviews representative solution strategies. This paper also explores how the paradigm fosters the emergence of novel capabilities in remote sensing interpretation models, analyzes the underlying mechanisms driving these capabilities, and discusses their practical implications. Finally, it summarizes the new opportunities and challenges that arise in intelligent remote sensing interpretation within this conceptual framework.

Key words: remote sensing image intelligent interpretation, geo-objects concept representation, visual-language models, emergence of capabilities, visual-language joint representation

中图分类号: