Finding optimal threshold for correction error reads in DNA assembling

摘要：

Background: DNA assembling is the problem of determining the nucleotide sequence ot a genome from its substrings, called reads. In the experiments, there may be some errors on the reads which affect the performance of the DNA assembly algorithms. Existing algorithms,e.g. ECINDEL and SRCorr, correct the error reads by considering the number of times each length-k substring of the reads appear in the input. They treat those length-k substrings appear at least M times as correct substring and correct the error reads based on these substrings. However, since the threshold M is chosen without any solid theoretical analysis,these algorithms cannot guarantee their performances on error correction.Results: In this paper, we propose a method to calculate the probabilities of false positive and false negative when determining whether a length-k substring is correct using threshold M. Based on this optimal threshold M that minimizes the total errors (false positives and false negatives). Experimental results on both real data and simulated data showed that our calculation is correct and we can reduce the total error substrings by 77.6% and 65.1% when compare to ECINDEL and SRCorr respectively.

作者: Francis Y.L.Chin Henry C.M.Leung Wei-Lin Li Siu-Ming Yiu

作者单位: Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong The State Key Laboratory of Computer Science, Institute of software, Chinese Academy of Sciences, 10

会议类型: 国际会议

会议名称: The 7th Asia-Pacific Bioinformatics Conference(第七届亚太生物信息学大会)

会议地点: 北京

会议语种:英文

页码: 153-161

在线出版日期: 2009-01-01（万方平台首次上网日期，不代表论文的发表时间）

会议专题

Finding optimal threshold for correction error reads in DNA assembling