1. Efficient Approach to Unique Single-Nucleotide Polymorphism Discovery
- Author
-
Patricia Taillon-Miller, Ellen E. Piernot, and Pui-Yan Kwok
- Subjects
Genetics ,Polymorphism, Genetic ,Base Sequence ,Molecular Sequence Data ,Single-nucleotide polymorphism ,Hydatidiform Mole ,Sequence Analysis, DNA ,Tag SNP ,Biology ,Molecular Inversion Probe ,DNA sequencing ,SNP genotyping ,Pregnancy ,Methods ,Humans ,Human genome ,Female ,Genotyping ,Genetics (clinical) ,Alleles ,SNP array - Abstract
Single-nucleotide polymorphisms (SNPs) are the most frequently found DNA sequence variations in the human genome (Taillon-Miller et al. 1998). It has been argued that a dense set of SNP markers can be used to identify genetic factors associated with complex disease traits (Risch and Merikangas 1996; Collins et al. 1997). Advocates of these approaches suggest that some 100,000 or more SNP markers (at 30-kb intervals or up to five markers per gene) will be needed in population studies to detect genetic factors with moderate effects in the complex traits being investigated (Collins et al. 1997). Several efforts, sponsored by both the National Human Genome Research Institute and private industry, have been launched to develop SNP markers with the goal of achieving the numbers needed for association studies within the next 3 years (Marshall 1997, 1998; Wang et al. 1998). Because all high-throughput genotyping methods capable of handling large numbers of markers and samples require precise knowledge of the DNA sequence surrounding the SNP markers, and the usefulness of the markers is determined by their heterozygosity in the population, any SNP discovery approach must involve the determination of DNA sequence and allele frequencies. Furthermore, most high-throughput genotyping methods also require a genomic DNA amplification step, making it necessary to develop sequence-tagged sites (STSs) that amplify only the DNA fragments containing the SNPs and nothing else from the rest of the genome. This is not a trivial concern because there are many repetitive elements and duplicated regions in the genome in which near identical sequences are found on different chromosomal regions. If the DNA fragments amplified by the PCR (Saiki et al. 1988) came from different parts of the genome but were near identical, the DNA sequence differences might be erroneously considered alleles of an SNP, leading to highly confusing results when the genotyping experiments were performed. The use of computer programs such as REPEAT MASKER (A.F.A. Smit and P. Green, unpubl.) has made it a simple task to avoid developing SNP markers from common repetitive regions such as those containing Alu or L1 elements. What is more difficult to detect is the presence of a putative SNP in a duplicated region of the genome. In this report, we demonstrate the utility of a SNP screening approach that yields the DNA sequence and allele frequency information while screening out duplications with minimal cost and effort. The approach combines the use of a complete hydatidiform mole (CHM) as a homozygous DNA reference sample and a pooled DNA sequencing strategy for SNP identification and allele frequency estimation (Kwok et al. 1994). A CHM is usually a 46, XX homozygote formed by the fertilization of an empty ovum by a single haploid sperm, which later duplicates its chromosomes to give a diploid tumor (Lawler et al. 1991). The worldwide incidence of hydatidiform moles is 1/1000 pregnancies (Grimes 1984). We have reported previously that the CHM can be used as a homozygous DNA reference in SNP marker development (Taillon-Miller et al. 1997). In the course of screening anonymous STSs for SNPs, we have noticed that the DNA from this CHM1 can be used to identify false-positive SNPs that are the result of amplification of duplicated regions of the genome as in the case of multigene families or low-frequency repeats. In this study we show that in every case in which the CHM sequence contains a heterozygous base, it is the result of duplication, and the sequence differences are not in fact allelic. In regions of the genome in which high-quality, large-scale sequencing is being performed, we have shown that the most efficient and cost-effective approach to SNP identification is comparison of the consensus sequences of the overlapping regions of the large-insert clones being sequenced (Taillon-Miller et al. 1998). In regions in which no such overlapping sequences are available, one has to develop STSs and screen the DNA fragments amplified from multiple individuals for DNA sequence variations (Kwok et al. 1996; Wang et al. 1998). We have advocated a sequence comparison approach consisting of obtaining the DNA sequences from four individuals (eight chromosomes) plus a pooled DNA sample for allele frequency estimation (Kwok et al. 1994, 1996). This strategy allows one to identify, with >85% probability all the SNPs with >20% allele frequency for the minor allele (Kwok et al. 1994). SNPs developed by use of the population pool method of estimating frequencies have been confirmed by subsequent genotyping of the markers in every individual present in the pool and the frequencies have been shown to be accurate (±5%) (Kwok et al. 1994). With the advent of two new classes of dye-labeled dideoxy chain terminators (the dRhodamine and the energy transfer, BigDye terminator) that have improved spectral properties and give more even peaks in cycle sequencing (Zakeri et al. 1998), we show in this study that one can reduce the number of samples used in each screening experiment from five to just two (CHM and pooled sample) and still identify all the SNPs found with the previous approach. Reducing the number of sequencing reactions required to identify SNPs from anonymous STSs and screening out duplications undetected by computer filters greatly lowers the reagent and labor cost of SNP development.
- Published
- 1999