Controllable generation of high-resolution optical remote sensing image explicitly guided by spatio-temporal information

doi:10.11947/j.AGCS.2026.20250529

Abstract

Abstract:

The visual appearance of land cover objects in high-resolution optical remote sensing images is significantly influenced by seasonal evolution and regional differences. Enhancing the spatio-temporal controllability of generation models to reproduce object features under specific spatio-temporal contexts accurately remains a critical challenge. Existing research has limitations in the encoding method of multi-source spatio-temporal information, as well as the interaction guidance method between encoded spatio-temporal features and visual features, making it difficult to accurately model the precise mapping between spatio-temporal conditions and the visual appearance of land cover objects. To address this problem, this paper proposes a framework for spatio-temporal controllable high-resolution optical remote sensing image generation. First, a multi-source spatio-temporal information encoding strategy considering attribute differences is designed, which utilizes heterogeneous frequency encoding and independent projections to transform diverse spatio-temporal information into accurate and decoupled feature representations, thereby modeling the unique properties of diverse spatio-temporal information. Second, an interaction guidance mechanism between spatio-temporal features and visual features based on decoupled attention is designed. This mechanism employs an independent parallel attention branch to facilitate deep interaction between spatio-temporal features and visual features, effectively leveraging the constraining role of spatio-temporal information without interfering with text-guided generation. We adopt low-rank adaptation to efficiently transfer domain knowledge by optimizing only low-rank decomposition matrices, thereby preserving the pre-trained generative priors of the base model. Experiments on a large-scale dataset covering seven typical regions in China demonstrate that the proposed method outperforms state-of-the-art methods by 46.69% and 14.67% in terms of spatio-temporal distribution consistency and structural-textural consistency, respectively. These results confirm the controllability and generalization potential of the proposed framework across diverse spatio-temporal scenarios.

Key words: spatio-temporal intelligence, remote sensing image, generation model, diffusion model, deep learning, spatio-temporal information

CLC Number:

P237

Tiandong SHI, Ling ZHAO, Wenhao ZHAO, Ji QI, Hao CUI, Chengli PENG, Xinchang ZHANG. Controllable generation of high-resolution optical remote sensing image explicitly guided by spatio-temporal information[J]. Acta Geodaetica et Cartographica Sinica, 2026, 55(5): 894-908.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

URL: http://xb.chinasmp.com/EN/10.11947/j.AGCS.2026.20250529

http://xb.chinasmp.com/EN/Y2026/V55/I5/894

Figures/Tables 19

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Tab. 1

Fig. 7

Fig. 8

Tab. 2

Fig. 9

Tab. 3

Fig. 10

Fig. 11

Tab. 4

Fig. 12

Tab. 5

Fig. 13

Fig. 14

References 35

[1]	HUANG Wei, CUI Zhimei, HUANG Zhidu, et al. Research on building extraction based on object-oriented CART classification algorithm and GF-2 Satellite images[J]. Journal of Geodesy and Geoinformation Science, 2024, 7(4): 5-18.
[2]	ZHAO Bofei, SUI Haigang, ZHU Yihao, et al. Real-time rescue target detection based on UAV imagery for flood emergency response[J]. Journal of Geodesy and Geoinformation Science, 2024, 7(1): 74-89.
[3]	HAN Zheng, LING Ziyan, DONG Li, et al. Heterogeneity effect of human disturbances on landscape patterns in the Yellow River Delta wetland, China[J]. Journal of Geodesy and Geoinformation Science, 2024, 7(4): 75-93.
[4]	杨元喜. 地理空间数字孪生与时空智能[J]. 测绘学报, 2025, 54(2): 213-220. DOI: . doi: 10.11947/j.AGCS.2025.20240515
	YANG Yuanxi. Digital twin and spatio-temporal intelligence of geospatial information system[J]. Acta Geodaetica et Cartographica Sinica, 2025, 54(2): 213-220. DOI: . doi: 10.11947/j.AGCS.2025.20240515
[5]	李德仁, 王密, 肖晶, 等. 论无所不在的时空智能[J]. 遥感学报, 2025, 29(6): 1388-1398.
	LI Deren, WANG Mi, XIAO Jing, et al. On ubiquitous spatio-temporal intelligence[J]. National Remote Sensing Bulletin, 2025, 29(6): 1388-1398.
[6]	陈军, 艾廷华, 闫利, 等. 智能化测绘的混合计算范式与方法研究[J]. 测绘学报, 2024, 53(6): 985-998. DOI: . doi: 10.11947/j.AGCS.2024.20240131
	CHEN Jun, AI Tinghua, YAN Li, et al. Hybrid computational paradigm and methods for intelligentized surveying and mapping[J]. Acta Geodaetica et Cartographica Sinica, 2024, 53(6): 985-998. DOI: . doi: 10.11947/j.AGCS.2024.20240131
[7]	TAO Chao, QI Ji, ZHANG Guo, et al. TOV: the original vision model for optical remote sensing image understanding via self-supervised learning[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 4916-4930.
[8]	龚健雅, 许越, 胡翔云, 等. 遥感影像智能解译样本库现状与研究[J]. 测绘学报, 2021, 50(8): 1013-1022. DOI: . doi: 10.11947/j.AGCS.2021.20210085
	GONG Jianya, XU Yue, HU Xiangyun, et al. Status analysis and research of sample database for intelligent interpretation of remote sensing image[J]. Acta Geodaetica et Cartographica Sinica, 2021, 50(8): 1013-1022. DOI: . doi: 10.11947/j.AGCS.2021.20210085
[9]	ZHENG Z, ERMON S, KIM D, et al. Changen2: multi-temporal remote sensing generative change foundation model[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(2): 725-741.
[10]	ZAN Yujie, JI Shunping, CHAO Songtao, et al. Open-vocabulary generative vision-language models for creating a large-scale remote sensing change detection dataset[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 225: 275-290.
[11]	ZHENG Zhuo, TIAN Shiqi, MA Ailong, et al. Scalable multi-temporal remote sensing change data generation via simulating stochastic change process[C]//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2024: 21761-21770.
[12]	XU Yonghao, YU Weikang, GHAMISI P, et al. Txt2Img-MHN: remote sensing image generation from text using modern Hopfield networks[J]. IEEE Transactions on Image Processing, 2023, 32: 5737-5750.
[13]	ZHAO Rui, SHI Zhenwei. Text-to-remote-sensing-image generation with structured generative adversarial networks[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 8010005.
[14]	LIU Yidan, YUE Jun, XIA Shaobo, et al. Diffusion models meet remote sensing: principles, methods, and perspectives[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4708322.
[15]	张新长, 赵元, 齐霁, 等. 基于AI大模型的文生图技术方法研究及应用[J]. 地球信息科学学报, 2025, 27(1): 10-26.
	ZHANG Xinchang, ZHAO Yuan, QI Ji, et al. Research and application of text-to-image technology based on Al foundation models[J]. Journal of Geo-information Science, 2025, 27(1): 10-26.
[16]	YUAN Zhiqiang, HAO Chongyang, ZHOU Ruixue, et al. Efficient and controllable remote sensing fake sample generation based on diffusion model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5615012.
[17]	XU Yue, LIU Honghao, YANG Ruixia, et al. Remote sensing image semantic segmentation sample generation using a decoupled latent diffusion framework[J]. Remote Sensing, 2025, 17(13): 2143.
[18]	BAGHIRLI O, ASKAROV H, IBRAHIMLI I, et al. SatDM: synthesizing realistic satellite image with semantic layout conditioning using diffusion models[EB/OL]. [2025-11-03]. http://arxiv.org/abs/2309.16812.
[19]	DONG Runmin, YUAN Shuai, FENG Litong, et al. Transferable image synthesis for remote sensing semantic segmentation via joint reference-semantic fusion[J]. Information Fusion, 2026, 127: 103839.
[20]	TANG Datao, CAO Xiangyong, WU Xuan, et al. AeroGen: enhancing remote sensing object detection with diffusion-driven data generation[C]//Proceedings of 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2025: 3614-3624.
[21]	ZHANG Mu, LIU Yunfan, LIU Yue, et al. CC-diff: spatially controllable text-to-image synthesis for remote sensing with enhanced contextual coherence[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5645316.
[22]	OU Ruizhe, YAN Haotian, WU Ming, et al. A method of efficient synthesizing post-disaster remote sensing image with diffusion model and LLM[C]//Proceedings of 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference. [S.l.]: IEEE, 2023: 1549-1555.
[23]	SEBAQ A, ELHELW M. RSDiff: remote sensing image generation from text using diffusion model[J]. Neural Computing and Applications, 2024, 36(36): 23103-23111.
[24]	LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238-259.
[25]	YU Zhiping, LIU Chenyang, LIU Liqin, et al. MetaEarth: a generative foundation model for global-scale remote sensing image generation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(3): 1764-1781.
[26]	KHANNA S, LIU P, ZHOU Linqi, et al. DiffusionSat: a generative foundation model for satellite imagery[EB/OL]. [2025-11-03]. https://arxiv.org/abs/2312.03606.
[27]	BAI Jinze, BAI Shuai, YANG Shusheng, et al. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond[EB/OL]. [2025-11-03]. http://arxiv.org/abs/2308.12966.
[28]	LIU Shilong, ZENG Zhaoyang, REN Tianhe, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection[C]//Proceedings of Computer Vision-ECCV 2024. Cham: Springer, 2025: 38-55.
[29]	REN Tianhe, LIU Shilong, ZENG Ailing, et al. Grounded SAM: assembling open-world models for diverse visual tasks[EB/OL]. [2025-11-03]. http://arxiv.org/abs/2401.14159.
[30]	PEEBLES William, XIE Saining. Scalable diffusion models with transformers[C]//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 4172-4182.
[31]	HU E, SHEN Y, WALLIS P, et al. LoRA: low-rank adaptation of large language models[C]//Proceedings of 2022 International Conference on Learning Representations. San Diego: OpenReview.net, 2022.
[32]	TANG Datao, CAO Xiangyong, HOU Xingsong, et al. CRS-diff: controllable remote sensing image generation with diffusion model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5638714.
[33]	HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium[C]//Proceedings of 2017 Neural Information Processing Systems. Long Beach: Curran Associates, Inc., 2017: 6629-6640.
[34]	WANG Z, SIMONCELLI E P, BOVIK A C. Multiscale structural similarity for image quality assessment[C]//Proceedings of 2003 Asilomar Conference on Signals, Systems & Computers. Pacific Grove: IEEE, 2003: 1398-1402.
[35]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of 2021 International Conference on Machine Learning. San Diego: PMLR, 2021: 8748-8763.

方法	FID	MS-SSIM	CLIP Score
文本编码法	6.858 0	0.150 4	23.948 6
CRS-Diff	7.027 0	0.196 9	25.595 7
DiffusionSat	6.485 0	0.167 0	25.587 7
Text2Earth	13.523 4	0.189 9	24.787 2
本文方法	3.457 2	0.225 8	25.770 5

模块	FID	MS-SSIM	CLIP Score
不使用独立投影模块	4.860 4	0.171 6	25.770 2
使用独立投影模块	3.457 2	0.225 8	25.770 5

编码策略	FID	MS-SSIM	CLIP Score
标准MLP编码	4.530 6	0.211 5	25.736 5
异构频率编码	3.457 2	0.225 8	25.770 5

注意力机制	FID	MS-SSIM	CLIP Score
标准交叉注意力	7.848 5	0.213 3	25.345 5
解耦注意力	3.457 2	0.225 8	25.770 5

适配秩	FID	MS-SSIM	CLIP-Score
128	8.161 4	0.186 1	25.564 0
256	6.567 2	0.203 8	25.746 0
512	4.574 8	0.224 8	25.767 0
1024	3.457 2	0.225 8	25.770 5