测绘学报 ›› 2016, Vol. 45 ›› Issue (5): 616-622.doi: 10.11947/j.AGCS.2016.20150181

• 地图学与地理信息 • 上一篇    下一篇

开放式地理实体关系抽取的Bootstrapping方法

余丽1,2, 陆锋1,3, 刘希亮1   

  1. 1. 中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室, 北京 100101;
    2. 中国科学院大学, 北京 100101;
    3. 江苏省地理信息资源开发与利用协同创新中心, 江苏 南京 210023
  • 收稿日期:2015-04-07 修回日期:2016-02-02 出版日期:2016-05-20 发布日期:2016-05-30
  • 通讯作者: 陆锋 E-mail:luf@lreis.ac.cn
  • 作者简介:余丽(1986-),女,博士生,研究方向为互联网空间信息搜索。E-mail: yul@lreis.ac.cn
  • 基金资助:
    国家自然科学基金(41271408);国家863计划(2013AA120305)

A Bootstrapping Based Approach for Open Geo-entity Relation Extraction

YU Li1,2, LU Feng1,3, LIU Xiliang1   

  1. 1. State Key Lab of Resources and Environmental Information System, The Institute of Geographic Sciences and Natural Resources Research, Beijing 100101, China;
    2. University of Chinese Academy of Sciences, Beijing 100101, China;
    3. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, ChinaAbstract
  • Received:2015-04-07 Revised:2016-02-02 Online:2016-05-20 Published:2016-05-30
  • Supported by:
    The National Natural Science Foundation of China (No.41271408);The National High-Tech Research and Development Program of China (863 Program) (No.2013AA120305)

摘要: 从网络文本中抽取地理实体间空间关系和语义关系要求高时效性和强鲁棒性。本文提出一种开放式地理实体关系的自动抽取方法,通过bootstrapping技术统计词语的词性、位置和距离特征来计算语境中词语权值,据此确定描述地理实体关系的关键词,最终组织成结构化实例,并使用百度百科和Stanford CoreNLP开展了试验。研究结果表明,本文方法能自动挖掘自然语言的部分词法特征,无须领域专家知识和大规模标注语料,适用于未知关系类型的信息抽取任务;较之经典的Frequency、TF-IDF和PPMI频率统计方法,精度和召回率分别提升约5%和23%。

关键词: 文本挖掘, 地理实体, 关系抽取, 定量评价, bootstrapping

Abstract: Extracting spatial relations and semantic relations between two geo-entities from Web texts, asks robust and effective solutions. This paper puts forward a novel approach: firstly, the characteristics of terms (part-of-speech, position and distance) are analyzed by means of bootstrapping. Secondly, the weight of each term is calculated and the keyword is picked out as the clue of geo-entity relations. Thirdly, the geo-entity pairs and their keywords are organized into structured information. Finally, an experiment is conducted with Baidubaike and Stanford CoreNLP. The study shows that the presented method can automatically explore part of the lexical features and find additional relational terms which neither the domain expert knowledge nor large scale corpora need. Moreover, compared with three classical frequency statistics methods, namely Frequency, TF-IDF and PPMI, the precision and recall are improved about 5% and 23% respectively.

Key words: text mining, geo-entities, relation extraction, quantitative evaluation, bootstrapping

中图分类号: