Towards Comprehensive Structural Motif Mining for Better Fold Annotation in the Twilight Zone of Sequence Dissimilarity
Here we report a novel graph database mining method called APGM (APproximate Graph Mining) and demonstrate the application to protein structure pattern identification and structure classification. We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrix as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. The biologic motivation of our study is to recognize common structure patterns inimmunoevasins, proteins mediating virus evasion of host immune defense. We investigated two immunologically relevant protein domain families: the Immunoglobulin V set and the Immunoglobulin C1 set. We collected proteins from SCOP release 1.69. For each family we created a culled set of proteins with maximal pairwise sequence identity percentage below 30% by using PISCES server. We combined these proteins and randomly selected proteins to create training and testing data set. We compared our method with one exact graph mining method MGM on classification accuracy. For Immunoglobulin C1 set,the classification based on feature identified by MGM only can reach 73%, while APGM is between 69% ~ 91%. For Immunoglobulin V set, since the exact match method cannot mine any meaningful patterns, it fails in classification, while by using APGM, we have the accuracy about 78%. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. And without loss of generality,choice of appropriate compatibility matrices allows our method to be easily employed in any domain where subgraph labels have some uncertainty.
Yi Jia Vincent Buhr Jintao Zhang Jun Huan Leonidas N.Carayannopoulos
国际会议
The 7th Asia-Pacific Bioinformatics Conference(第七届亚太生物信息学大会)
北京
英文
874
2009-01-01(万方平台首次上网日期,不代表论文的发表时间)