会议专题

Stochastic Arabic Hybrid Diacritizer

This paper introduces a two-layer stochastic system to diacritize raw Arabic text automatically. The first layer determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A* lattice search algorithm and m-gram probability estimation. When full-form words are out-of-vocabulary (OOV), the system utilizes a second layer, which factorizes each Arabic word into its possible morphological constituents (prefix, root, pattern and suffix), then uses m-gram probability estimation and A* lattice search algorithm to select among the possible factorizations to get the most likely diacritization sequence. While the second layer has better coverage of possible Arabic forms, the first layer yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-end) diacritics. The presented hybrid system possesses the advantages of both layers. The paper details the workings of both layers and the architecture of the hybrid system. By comparing our proposed system with the best performing system to our knowledge of Habash et al. 9 using their training and testing corpus; it is found that the word error rates of 5.5% for the morphological diacritization and 9.4% for the syntactic diacritization by Habash et al., and only 3.1% for the morphological diacritization and 9.4% for the syntactic diacritization by our system.

Mohsen RASHWAN Mohammad AL BADRASHINY Mohamed ATTIA Sherif ABDOU Ahmed RAFEA

Department of Electronics & Electrical Communications, Faculty of Eng., Cairo Univ., Egypt The Engineering Company for the Development of Computer Systems RDI The Engineering Company for the Development of Computer Systems Faculty of Computers & Information, Cairo Univ., Egypt Department of Computer Science, American University in Cairo(AUC),Egypt

国际会议

International Conference on Natural Language Processing and Knowledge Engineering(IEEE自然语言处理与知识工程国际会议 IEEE NLP-KE 2009)

大连

英文

1-8

2009-09-24(万方平台首次上网日期,不代表论文的发表时间)