测绘学报 ›› 2025, Vol. 54 ›› Issue (4): 675-687.doi: 10.11947/j.AGCS.2025.20240310

• 实景三维中国建设 • 上一篇    

面向实景三维的点云跨模态对比掩码自编码预训练方法

王庆栋1(), 王腾飞2, 张力1()   

  1. 1.中国测绘科学研究院,北京 100036
    2.武汉大学测绘学院,湖北 武汉 430079
  • 收稿日期:2024-07-29 发布日期:2025-05-30
  • 通讯作者: 张力 E-mail:wangqd@casm.ac.cn;zhangl@casm.ac.cn
  • 作者简介:王庆栋(1986—),男,博士,副研究员,研究方向为三维点云智能化处理。 E-mail:wangqd@casm.ac.cn
  • 基金资助:
    国家重点研发计划(2023YFB3907600);中国测绘科学研究院基本科研业务(AR2424)

Cross-modal contrastive masked autoencoder pre-training for 3D real-scene point cloud

Qingdong WANG1(), Tengfei WANG2, Li ZHANG1()   

  1. 1.Chinese Academy of Surveying and Mapping, Beijing 100036, China
    2.School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China
  • Received:2024-07-29 Published:2025-05-30
  • Contact: Li ZHANG E-mail:wangqd@casm.ac.cn;zhangl@casm.ac.cn
  • About author:WANG Qingdong (1986—), male, PhD, associate researcher, majors in intelligent processing of 3D point clouds. E-mail: wangqd@casm.ac.cn
  • Supported by:
    The National Key Research and Development Program of China(2023YFB3907600);The Fundamental Research Funds for CASM(AR2424)

摘要:

现有单模态预训练模型仍然易受点云无序性和稀疏性的影响,难以满足实景三维建设中多种下游任务需求。为了进一步提升预训练模型性能,本文基于多模态、对比学习和掩码自编码等理论,提出一种利用二维图像模态辅助三维点云模态的跨模态对比掩码自编码学习方法。该方法由模态内自监督分支和2D/3D跨模态自监督分支两部分组成。对于模态内分支,本文提出一种融合掩码自编码和对比学习的自监督学习方法,以学习更全面的特征信息。对于跨模态分支,本文提出了一种基于2D/3D跨模态对比学习方法,将二维图像模态信息融入三维点云模态中,从而提升模型对点云稀疏性和无序性的稳健性。为验证本文方法的有效性,本文在ShapeNet、ModelNet40、ScanObjectNN等数据集上进行了点云重建、分类、少样本分类和分割等多项下游任务的测试。结果表明,本文方法相较于现有方法展现出更好的迁移性。

关键词: 点云, 自监督学习, 掩码自编码, 对比学习, 预训练, 实景三维

Abstract:

The existing single-modality pre-trained models are still susceptible to the unorderness and sparisity of point clouds, making it difficult to meet the requirements of diverse downstream tasks for 3D real-scene construction. To further enhance the performance of pre-trained models, this paper proposes a cross-modal contrastive masked autoencoders pre-training method that uses the 2D image modality to assist the 3D point cloud modality, based on multi-modal, contrastive learning, and masked auto-encoding pre-training theories. The network mainly consists of two branches: the intra-modal branch, which is a contrastive masked autoencoder sarchitecture for learning more comprehensive feature information, and the cross-modal branch, which is a 2D/3D cross-modal contrastive learning architecture for improving robustness to unordered and sparse data. To verify the effectiveness of the proposed method, we conduct a series of downstream task experiments, such as masked point cloud reconstruction, classification, few-shot classification, and segmentation, on datasets including ShapeNet, ModelNet40 and ScanObjectNN. The results indicate that the proposed method exhibits superior transferability compared to existing methods.

Key words: point cloud, self-supervised learning, masked autoencoder, contrastive learning, pre-training, 3D real scene

中图分类号: