The informatics of genetic sequence variations

Gabor Marth

Boston College

Genetic variations are important because they underlie phenotypic differences, can serve as heritable landmarks to track population history, and functional variants may cause or predispose to diseases. In this talk I will highlight some of our past and ongoing research projects on several different aspects of genetic variations.

The discovery of polymorphic sites in the human genome requires accurate and efficient computer tools. With collaborators at Washington University we have developed a computer program, PolyBayes, for the all essential steps of polymorphism discovery on the genome scale: the organization and alignment of a huge number of DNA reads using the genome reference sequence as a substrate, the identification of duplicated sequence copies that otherwise give rise false positive predictions, and for discriminating between true polymorphic sites and sequencing errors. The original algorithm was designed for large-scale discovery projects in clonal sequences. An important emerging application area of polymorphism discovery, the search for mutations within DNA re-sequencing data of individuals, requires that one detects heterozygous positions within Sanger sequencing traces from PCR products of diploid genome DNA. We have developed machine-learning algorithms for accurate heterozygote detection for analyzing this important data type. New, super-high throughput sequencing machines such as the 454 Life Sciences pyrosequencer promise orders of magnitude of reduction in sequencing cost. So-called flowgrams from pyrosequencing machines are substantially different in nature from Sanger-sequencing reads because all bases within mono-nucleotide runs are incorporated within a single flow cycle, as opposed to one base per cycle. Base calling for flowgrams is complicated because often even non-incorporated nucleotides produce significant signal, and because the signal intensity produced by mono-nucleotide runs is not a linear function of the base number. Because the error rate of the native base calling software is unacceptably high for SNP discovery we have developed our own 454 base caller that performs much better. Finally, to demonstrate the utility of genetic variations for population and evolutionary genetics I will present our analysis of genome-wide distributions of the most abundant type of genetic variations, single-nucleotide polymorphisms (SNPs). We will use observed SNP distributions (SNP density and the allele frequency spectrum) to delineate human demographic history of effective population size in present-day DNA samples using a population genetic modeling and data fitting approach.

back to main page