From The MarthLab

Jump to: navigation, search

[edit] PolyBayes

Genetic variations are landmarks that allow us to track our genetic ancestry and their genome structure informs us about the molecular and demographic forces that have shaped it. For medical research the most important polymorphisms are disease-causing variants, but non-functional polymorphisms are also useful as markers for linkage and association studies. There is a growing need to find rare, medically important alleles in deep alignments of clonal sequences and diploid sequence traces; to identify large numbers of markers for mapping studies in humans, model organisms, and plants; and to discover informative polymorphisms for pathogen strain identification. With the tremendous sequencing capacity at large sequencing centers and an anticipated jump in sequencing speed medical re-sequencing projects to map out genetic changes leading to and during cancer development are gearing up. This amount of sequence data will require completely automated, versatile, yet highly accurate polymorphism discovery and genotype determination software that does not exist today. The detection of single-nucleotide polymorphisms (SNPs) and short insertion/deletions (INDELs) in DNA sequences is challenging because one must align and compare sequences from varied sources, and differentiate true polymorphisms from sequencing errors.

Building on our existing software, POLYBAYES, first developed at Washington University, we are currently developing a general polymorphism discovery tool that meets these challenges. We organize fragmentary sequences by layering them upon the genome reference sequence; discard paralogous sequences from similar, duplicated genome regions; and use base quality values in a rigorous, Bayesian scheme to compare sequences of arbitrary quality standards. Specifically, we propose methods to align multi-exon genes, and novel methods for paralog filtering based either on complete mapping information or on genome distributions of sequence divergence.

We will develop new algorithms for the difficult problem of INDEL detection; integrate accurate heterozygote polymorphism detection in diploid traces into our software to enable individual genotyping; enhance sensitivity to detect rare alleles; and include a new measure to estimate the true positive rate of our candidate polymorphism predictions. We will implement a fast, reliable, full-functionality discovery tool that is free for academic research, performs well in large discovery projects, but can run on desktop computers, and is easily accessible to Biologists in small or medium laboratories.

[edit] SNP Discovery / Data Mining

PolyBayes was first applied to SNP discovery in human EST data aligned to genomic clone sequences (publication reference 8). We used this method to mine The SNP Consortium reads at Washington University (publcation reference 7). The method was then adapted to the analysis of the overlapping regions of large-insert (BAC) clones that were sequenced for the public human reference genome, and produced over 500,000 high quality candidate SNPs (publication reference 3), and over 100,000 deletion-insertion type polymorphisms (DIPs) (see publication reference 5). Subsets of both the candidate SNPs and diallelic DIPs have been experimentally charactarized in diverse population samples (publication references 5 and 6). Such a large data set has made it possible, for the first time, to explore the genomic distribution of human variability on the scale of the entire genome.


[edit] Contributors