Data Extraction and Cleansing of Semi-Structured Chinese Texts

摘要：

The rapid growth of data mining generates an everincreasing demand for automatic information extraction from Chinese texts. However, existing approaches in this domain focus on wellstructured Chinese texts and therefore have difficulties in dealing with semistructured Chinese texts which do not conform to strict syntactic structures. We propose in this paper an approach to semiautomatic data extraction and cleansing for these texts. Preliminary experimental results show that, with modest manual intervention, it can effectively extract information from raw semistructured Chinese texts collected from ebusiness applications.

关键词： dataextraction datacleansing semi-structured text Chinese manual intervention

作者: Wei-Heng ZHU Shun LONG

作者单位: Dept. of Computer Science, Jinan University, Guangzhou, P.R.China Guangdong Emergency Technology Research Center of Risk Evaluation and Prewarning on Public Network S

会议类型: 国际会议

会议名称: 2011 International Conference on Business Management and Electronic Information(2011商业管理与电子信息国际学术会议 BMEI2011)

会议地点: 广州

会议语种:英文

页码: 1-4

在线出版日期: 2011-05-13（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Data Extraction and Cleansing of Semi-Structured Chinese Texts