Data Extraction and Cleansing of Semi-Structured Chinese Texts
The rapid growth of data mining generates an everincreasing demand for automatic information extraction from Chinese texts. However, existing approaches in this domain focus on wellstructured Chinese texts and therefore have difficulties in dealing with semistructured Chinese texts which do not conform to strict syntactic structures. We propose in this paper an approach to semiautomatic data extraction and cleansing for these texts. Preliminary experimental results show that, with modest manual intervention, it can effectively extract information from raw semistructured Chinese texts collected from ebusiness applications.
dataextraction datacleansing semi-structured text Chinese manual intervention
Wei-Heng ZHU Shun LONG
Dept. of Computer Science, Jinan University, Guangzhou, P.R.China Guangdong Emergency Technology Research Center of Risk Evaluation and Prewarning on Public Network S
国际会议
广州
英文
1-4
2011-05-13(万方平台首次上网日期,不代表论文的发表时间)