会议专题

Discovery of Unknown Bacteria: A Metagenomic Analysis

The North Railroad Avenue Plume (NRAP)Superfund Site in Espa(n)ola New Mexico is tetrachioroethene (aka perchloroethylene or PCE)contaminated drinking water aquifer. PCE is a possible carcinogen. So it is important to purify this aquifer. Fortunately, microbial communities evolve to utilize these contaminants for energy. It is extremely essential to understand the metabolic pathway,genetic information of those microorganisms which help in purification. Since environmental microbes cannot be cultured in laboratories, tests are conducted in the site. To aid in microbial growth, Emulsified Vegetable Oil (EVO) and other amendments are added to this aquifer. Parameters such as PCE and its breakdown products, trichloroethylene (TCE), cis and trans 1,2-dichloroethene (DCE), vinyl chloride,ethane, and water quality parameters such as dissolved oxygen, temperature, pH, and redox potential are monitored on a time to time bases. DNA of all microbes found in this pond is also extracted on time to time bases. It is important to sequence these DNA and to study those genes which are responsible for bioremediation.Solexa IG is a Massively Parallel Sequencing by Synthesis (MPSS) instrument which produces fragments of ~200 base pairs with about 36-50bp for paired end reads. The challenge is to align these fragments/reads to get the entire sequence of individual microbes based on overlapping reads. There are about 109 reads originating from different microbes. Few of these microbes have already been sequenced. It is easy to remove these fragments by BLASTing them against the database. There are still a large number of reads to be aligned. Paired end information is used to perform this alignment using the best string matching algorithms. Still these comparisons would take remarkably long. Hence supercomputers with exceptionally high computing powers are used to make this computation possible and faster. The deal is to find exactly that read that would follow the current read to complete sequence by eliminating mismatches due to repeats greatly. To minimize comparisons, we apply machine leaming techniques. Since the current amount of known genomes is far less than 1% of the entire microbial genomics, presently available training data may be insufficient for supervised learning methods with multi-class support vector machine (SVM). In view of the fact that, the number of different microbes in the sequence is unknown and that SOM needs apriori definition of the architecture, we use growing hierarchical self-organizing map (GHSOM).As classification of microbial communities can be improved by extracting the features of transition metrics of a Markov process instead of word frequency, we use a combination of transition features of Markov processes with GHSOM. Once clusters are formed, reads are compared only among its group. This algorithm helps in fast alignment and assembly of our metadata. Once the sequences are aligned feature mining techniques are used to find those microbes that are responsible for biodegradation. A gene-eentric approach will be more revealing to figure out the metabolic pathway involved in biodegradation. Here features are genes whose proportion changes in response to the addition of EVO.A simulated program Shedder is developed to imitate the shotgun sequencing approach to get important information regarding preprocessing and essential requirements. This program can be used to find the minimum number of reads required to get complete sequences. Alignment software will be benchmarked with simulated Solexa paired-end read data produced by Shedder.

D Suryakumar R Chilakapati Q Liu A H Sung

New Mexico Tech New Mexico Tech Institute for Complex Additive Systems Analysis, Socorro, NM

国际会议

The 7th Asia-Pacific Bioinformatics Conference(第七届亚太生物信息学大会)

北京

英文

828

2009-01-01(万方平台首次上网日期,不代表论文的发表时间)