会议专题

PCA-based population structure interence with generic clustering algorithms

Background: Handling genotype data typed at hundreds of thousands of loci is very timeconsuming and it is no exception for population structure inference. Therefore, we propose to apply PCA to the genotype data of a population, select the significant principal components using the Tracy-Widom distribution, and assign the individuals to one or more subpopulations using generic clustering algorithms.Results: We investigated K-means, soft K-means and spectral clustering and made comparison to STRUCTURE, a model-based algorithm specifically designed for population structure inference. Moreover, we investigated methods for predicting the number of subpopulations in a population. The results on four simulated datasets and two real datasets indicate that our approach performs comparably well to STRUCTURE. For the simulated datasets, STRUCTURE and soft K-means with BIC produced identical predictions on the number of subpopulations. We also showed that, for real dataset, BIC is a better index than likelihood in predicting the number of subpopulations.Conclusions: Our approach has the advantage of being fast and scalable, while STRUCTURE is very time-consuming because of the nature of MCMC in parameter estimation. Therefore,we suggest choosing the proper algorithm based on the application of population structure inference.

Chih Lee Ali Abdool Chun-Hsi Huang

Computer Science and Engineering Department, University of Connecticut, 371 Fairfield Road, Storrs, CT 06269, USA

国际会议

The 7th Asia-Pacific Bioinformatics Conference(第七届亚太生物信息学大会)

北京

英文

761-771

2009-01-01(万方平台首次上网日期,不代表论文的发表时间)