TITLE EXTRACTION FROM LOOSELY STRUCTURED DATA RECORDS

摘要：

In this paper, we present a novel title extraction method from Loosely Structured Data Records (LSDKs). Firstly, we automatically identify the format of titles and then extract them accordingly. For the Web page whose title is occurred in all the Data Records, we obtain the one in the candidate titles which has the largest length of the same content as the accurate title. And for the Web page whose title is occurred before the first Data Record, the candidate title which has the largest length of the different content can be considered as the accurate title. Our experiment demonstrates that our automatic algorithm is robust and effective on two databases collected from the Internet.

关键词： Title eztraction Structured data records Forum data Loosely structured data records

作者: YI-PU WU XUE-JIE ZHANG QING LI JING CHEN

作者单位: Department of Computer Science and Engineering, Yunnan University, Kunming 650091, China Department Department of Computer Science and Engineering, Yunnan University, Kunming 650091, China Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

会议类型: 国际会议

会议名称: 2008 International Conference on Machine Learning and Cybernetics(2008机器学习与控制论国际会议)

会议地点: 昆明

会议语种:英文

页码: 2623-2628

在线出版日期: 2008-07-12（万方平台首次上网日期，不代表论文的发表时间）

会议专题

TITLE EXTRACTION FROM LOOSELY STRUCTURED DATA RECORDS