An N-Gram-and-Wikipedia Joint Approach to Natural Language Identification

摘要：

Natural Language Identification is the process of detecting and determining in which language or languages a given piece of text is written. As one of the key steps in Computational Linguistics/Natural Language Processing(NLP) tasks, such as Machine Translation, Multi-lingual Information Retrieval and Processing of Language Resources, Natural Language Identification has drawn widespread attention and extensive research, making it one of the few relatively well studied sub-fields in the whole NLP field. However, various problems remain far from resolved in this field. Current noncomputational approaches require researchers possess sufficient prior linguistic knowledge about the languages to be identified, while current computational (statistical) approaches demand largescale training set for each to-be-identified language. Apparently, drawbacks for both are that, few computer scientists are equipped with sufficient knowledge in Linguistics, and the size of the training set may get endlessly larger in pursuit of higher accuracy and the ability to process more languages. Also, faced with multi-lingual documents on the Internet, neither approach can render satisfactory results. To address these problems, this paper proposes a new approach to Natural Language Identification. It exploits N-Gram frequency statistics to segment a piece of text in a languagespecific fashion, and then takes advantage of Wikipedia to determine the language used in each segment. Multiple experiments have demonstrated that satisfactory results can be rendered by this approach, especially with multi-lingual documents.

关键词： Gram Wikipedia TextTUing Algorithm Natural Language Identification

作者: Xi Yang Wenxin ziang

作者单位: School of Software Dalian University of Technology Dalian, 116620, China

会议类型: 国际会议

会议名称: 2010 4th International Universal Communication Symposium(第四届国际普遍交流学术研讨会 IUCS 2010)

会议地点: 北京

会议语种:英文

页码: 331-338

在线出版日期: 2010-10-18（万方平台首次上网日期，不代表论文的发表时间）

会议专题

An N-Gram-and-Wikipedia Joint Approach to Natural Language Identification