Acta Geodaetica et Cartographica Sinica ›› 2025, Vol. 54 ›› Issue (12): 2233-2246.doi: 10.11947/j.AGCS.2025.20250293

• Photogrammetry and Remote Sensing • Previous Articles     Next Articles

Small-sample classification of hyperspectral images based on mixed CNN-ViT feature optimization

Jin ZHANG1(), Fan FENG1(), Chenguang DAI1, Zhenchao ZHANG1, Ying YU1, Bing LIU2   

  1. 1.Institute of Geospatial Information, Information Engineering University, Zhengzhou 450001, China
    2.Institute of Data and Target Engineering, Information Engineering University, Zhengzhou 450001, China
  • Received:2025-07-17 Revised:2025-11-13 Online:2026-01-15 Published:2026-01-15
  • Contact: Fan FENG E-mail:zhangjrs0802@163.com;fengrs1991@163.com
  • About author:ZHANG Jin (1994—), female, PhD candidate, majors in deep learning and remote sensing image classification. E-mail: zhangjrs0802@163.com
  • Supported by:
    The National Natural Science Foundation of China(42071340);Program of Song Shan Laboratory (Included in the management of Major Science and Technology Program of Henan Province)(221100211000-04)

Abstract:

Hyperspectral image classification is a key technology for achieving fine-grained recognition of ground objects. With the advancement of imaging technology, the spatial resolution of hyperspectral images acquired by UAV platforms has significantly improved, bringing new opportunities and challenges to fine-grained land cover classification. However, existing deep neural networks still exhibit insufficiently comprehensive feature learning for high spatial-resolution hyperspectral images under small-sample conditions. To address this issue, this paper proposes a mixed feature optimization method of convolutional neural networks (CNN) and vision Transformer (ViT), including three aspects: adaptive spatial-spectral feature learning, bidirectional feature integration and multi-segment feature interaction enhancement. First, multi-scale 3D spatial-spectral features and 2D local selfattention features are incorporated into a cascaded residual structure to achieve global-local multi-scale spatial-spectral feature extraction, enhancing the feature richness. Then, spatial and channel features are integrated from two directions to extract correlations across both dimensions, thereby complementing and enhancing the features extracted by CNN and ViT. After fusing these multi-stage features, they are fed into a factorized second order pooling layer to address the issues of large discrepancies and insufficient interaction among multi-stage features. Finally, the fine-grained fused features are input into a fully connected layer for classification. Small-sample classification experiments were conducted on three hyperspectral image datasets with high spatial resolution, namely LongKou, HanChuan, and HongHu. Only five samples per land-cover class are used for model training. The proposed method achieved classification accuracies of 94.00%, 83.24%, and 87.63%, respectively, demonstrating its effectiveness under small-sample conditions.

Key words: hyperspectral image classification, mixed convolutional network, local selfattention, factorized second order pooling, multi-feature optimization, small sample

CLC Number: