测绘学报

• 学术论文 •    下一篇

基于相似性保持和特征变换的高维数据聚类改进算法

王家耀1,谢明霞2,郭建忠,陈科4   

  • 收稿日期:2010-03-10 修回日期:2010-09-27 出版日期:2011-06-25 发布日期:2011-06-25
  • 通讯作者: 谢明霞

The Research on High Dimensional Data Clustering Based on Improved Feature Transformation

  • Received:2010-03-10 Revised:2010-09-27 Online:2011-06-25 Published:2011-06-25

摘要:

文章分别从相似性度量、特征变换以及降维转换器的生成三个方面进行研究,提出改进特征变换的高维数据聚类算法。首先,通过文中所设计的相似性度量函数 HDsim计算得到高维空间对象相似度矩阵,把相似度矩阵转化为距离矩阵,并利用近邻法创建距离矩阵相应的邻域图,根据最短路径算法Floyd,得到最短路径距离矩阵;其次,将高维空间中数据对象的二维映射过程(将高维数据转化为二维数据,使二维空间中各对象间欧氏距离趋近于高维空间对象间最短路径距离)转化为优化问题,并设计相应的适应度函数,利用遗传算法对其进行求解;最后,利用降维后的二维数据坐标点进行k-均值聚类,并根据(高维空间数据点坐标,降维后二维空间数据点坐标)值对进行RBF神经网络训练,保存训练好的神经网络,当一新数据对象输入时,利用训练好的神经网络对其进行二维映射,通过判断该对象与各聚类簇中心距离的远近获得其归属。通过对UCI提供的机器学习数据库中iris和zoo数据集的聚类分析,验证了文中所提出的高维数据聚类算法的有效性

关键词: 特征变换, 高维数据聚类, 相似度, 降维, 遗传算法, 径向基神经网络

Abstract:

The researches on similarity measure, feature transformation and the design of dimensionality reduction converter have been done in this paper, and the high dimensional data clustering algorithm is proposed. Firstly, gain the similarity matrix of high dimensional data with the similarity measure function HDsim designed in the paper, and translate it into distance matrix. Construct the graph of distance matrix through the nearest neighbor searching method and gain the distance matrix of the shortest path based on the algorithm Floyd. Then, translate the dimensionality reduction process into the optimization and design the fitness function, resolve this optimization problem with genetic algorithm. Finally, the reduced data is used for clustering analysis via k-means and the value pairs between the coordinates of high dimensional data and their reduced 2D coordinates are used for RBF neural network training, save the trained neural network. Determine the belongingness of new object based on the distance from the new object to each current clustering center through the trained neural network. It proves the validity of the high dimensional data clustering algorithm proposed in this paper through the clustering analysis of the data set iris and zoo in the machine learning database provided by UCI.

Key words: feature transformation, high dimensional data clustering, similarity measure, dimensionality reduction, genetic algorithm, RBF neural network