Detecting the Same Text in Different Languages

摘要：

Compression based similarity distances have the main drawback of needing the same coding scheme for the objects to be compared. When two texts are translated, there exists significant similarity with no literal coincidence. In this article, we present an algorithm that compares the redundancy structure of the data extracted by means of a LempelZiv compression scheme. Each text is presented as a graph and two texts are considered similar with our measure if they have the same referential topology when compressed. We give empirical evidence that this measure detects similarity between data coded in different languages.

作者: Kostadin Koroutchev Manuel Cebriána

作者单位: Depto.de Ingeniería Informática Universidad Autónoma de Madrid 28049 Madrid, Spain

会议类型: 国际会议

会议名称: 2006年IEEE信息理论国际会议(Proceedings of 2006 IEEE Information Theory Workshop ITW06)

会议地点: 成都

会议语种:英文

页码: 337-341

在线出版日期: 2006-10-22（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Detecting the Same Text in Different Languages