会议专题

Sentence Alignment for Web Page Text based on Vector Space Model

There exist noisy,unparallel sentences in parallel web pages. Web page structure is subjected to some limitation for sentences alignment task for web page texi. The most shaightifor ward way of aligning sentence is using a translation lexicon. However,a major obstacle to this approach is the lack of dictionary for training. This paper presents a method for automatically align Mongolian-Chinese parallel text on the Web via vector space model. Vector space model is an algebraic model for representing any object as vectors of identifiers,such as index terms. In the statistically based vector-space model,a sentence is conceptually represented by a vector of keywords extracted from the text. Extracted keywords are composed by content words,known as terms and the weight of a term in a sentence vector can be determined tf-idf method. CHI is used to compute the association between bilingual words. Once the term weights are determined,the similarity between sentence vectors is computed via cosine measure. The experimental results indicate that the method is accurate and efficient enough to apply without human intervention.

Parallel web page sentence alignment vector space model Mongolian scripts Chinese scripts

GuanHong Zhang Odbal

Department of Computer science and technology,Key Lab of Network and Intelligent Information Process Hefei Institutes of Physical Science of Chinese Academy Science ,Anhui,Hefei 230031

国际会议

2011 International Conference on Opto-Electronics Engineering and Information Science(2011光电电子工程与信息科学国际会议 ICOEIS 2011)

西安

英文

1737-1740

2011-12-23(万方平台首次上网日期,不代表论文的发表时间)