Structured POI data Extraction from Internet News
POI (Point of Interest) data is key resources for GPS application. Manual POI collection is expensive and time consuming. This paper presents a novel approach that automatically extracts structured POI data from Internet news articles. The procedure includes erasing noisy news document with POI linguistic features, making lexical analysis on the remaining texts using ICTCLAS2010, identifying time expression and the full name of POI location and organization, extracting the relationship between entities, and getting structured data given a POI event based on extraction modeling. The POI extraction model is computed with the term frequency and word distance, without any syntax analysis, scenario template or relationship induction. Consistency and validity check were employed to optimize result. Open testing with experiment conducted on 1,000 news articles, the precision is 97.30% and recall is 75.48%. The approach has been applied in industrial POI collection. POI oriented event extraction is effective.
information extraction extraction model relation extraction POI ICTCLAS2010
Hua-Ping Zhang Qian Mo He-Yan Huang
School of Computer Sciences Institute of Computing Technology Beijing, P.R.C 100081 Beijing Technology and Business University, Beijing, P.R.C 100048
国际会议
2010 4th International Universal Communication Symposium(第四届国际普遍交流学术研讨会 IUCS 2010)
北京
英文
116-121
2010-10-18(万方平台首次上网日期,不代表论文的发表时间)