Acta Geodaetica et Cartographica Sinica ›› 2025, Vol. 54 ›› Issue (4): 675-687.doi: 10.11947/j.AGCS.2025.20240310

• China's 3D Realistic Model Construction • Previous Articles    

Cross-modal contrastive masked autoencoder pre-training for 3D real-scene point cloud

Qingdong WANG1(), Tengfei WANG2, Li ZHANG1()   

  1. 1.Chinese Academy of Surveying and Mapping, Beijing 100036, China
    2.School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China
  • Received:2024-07-29 Published:2025-05-30
  • Contact: Li ZHANG E-mail:wangqd@casm.ac.cn;zhangl@casm.ac.cn
  • About author:WANG Qingdong (1986—), male, PhD, associate researcher, majors in intelligent processing of 3D point clouds. E-mail: wangqd@casm.ac.cn
  • Supported by:
    The National Key Research and Development Program of China(2023YFB3907600);The Fundamental Research Funds for CASM(AR2424)

Abstract:

The existing single-modality pre-trained models are still susceptible to the unorderness and sparisity of point clouds, making it difficult to meet the requirements of diverse downstream tasks for 3D real-scene construction. To further enhance the performance of pre-trained models, this paper proposes a cross-modal contrastive masked autoencoders pre-training method that uses the 2D image modality to assist the 3D point cloud modality, based on multi-modal, contrastive learning, and masked auto-encoding pre-training theories. The network mainly consists of two branches: the intra-modal branch, which is a contrastive masked autoencoder sarchitecture for learning more comprehensive feature information, and the cross-modal branch, which is a 2D/3D cross-modal contrastive learning architecture for improving robustness to unordered and sparse data. To verify the effectiveness of the proposed method, we conduct a series of downstream task experiments, such as masked point cloud reconstruction, classification, few-shot classification, and segmentation, on datasets including ShapeNet, ModelNet40 and ScanObjectNN. The results indicate that the proposed method exhibits superior transferability compared to existing methods.

Key words: point cloud, self-supervised learning, masked autoencoder, contrastive learning, pre-training, 3D real scene

CLC Number: