An N-Gram-and-Wikipedia Joint Approach to Natural Language Identification
Natural Language Identification is the process of detecting and determining in which language or languages a given piece of text is written. As one of the key steps in Computational Linguistics/Natural Language Processing(NLP) tasks, such as Machine Translation, Multi-lingual Information Retrieval and Processing of Language Resources, Natural Language Identification has drawn widespread attention and extensive research, making it one of the few relatively well studied sub-fields in the whole NLP field. However, various problems remain far from resolved in this field. Current noncomputational approaches require researchers possess sufficient prior linguistic knowledge about the languages to be identified, while current computational (statistical) approaches demand largescale training set for each to-be-identified language. Apparently, drawbacks for both are that, few computer scientists are equipped with sufficient knowledge in Linguistics, and the size of the training set may get endlessly larger in pursuit of higher accuracy and the ability to process more languages. Also, faced with multi-lingual documents on the Internet, neither approach can render satisfactory results. To address these problems, this paper proposes a new approach to Natural Language Identification. It exploits N-Gram frequency statistics to segment a piece of text in a languagespecific fashion, and then takes advantage of Wikipedia to determine the language used in each segment. Multiple experiments have demonstrated that satisfactory results can be rendered by this approach, especially with multi-lingual documents.
Gram Wikipedia TextTUing Algorithm Natural Language Identification
Xi Yang Wenxin ziang
School of Software Dalian University of Technology Dalian, 116620, China
国际会议
2010 4th International Universal Communication Symposium(第四届国际普遍交流学术研讨会 IUCS 2010)
北京
英文
331-338
2010-10-18(万方平台首次上网日期,不代表论文的发表时间)