Genetics, Vol. 166, 2001-2006, April 2004, Copyright © 2004

Recovering Frequencies of Known Haplotype Blocks From Single-Nucleotide Polymorphism Allele Frequencies

Itsik Pe'era and Jacques S. Beckmanna,b
a Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel 76100
b Département de Génétique Médicale, Lausanne-CHUV, Lausanne 1011, Switzerland

Corresponding author: Itsik Pe'er, Whitehead Institute, MIT Center for Genome Research, 1 Kendall Square, Bldg. 300, Cambridge, MA 02139-1561., peer{at}broad.mit.edu (E-mail)

Communicating editor: P. J. OEFNER


*  ABSTRACT
*TOP
*ABSTRACT
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

Prospects for large-scale association studies rely on economical methods and powerful analysis. Representing available SNPs by small subsets and measuring allele frequencies on pooled DNA samples each improve genotyping cost effectiveness, while haplotype analysis may highlight associations in otherwise underpowered studies. This manuscript provides the mathematical framework to integrate these methodologies.


SINGLE-NUCLEOTIDE polymorphisms (SNPs) are the markers of choice for high-throughput association studies aimed at dissecting complex traits (RISCH and MERIKANGAS 1996 Down; KWOK 2001 Down). Several economization strategies have been proposed, to enable SNP genotyping on a whole-genome scale: First, there is no need to type each of the millions of available SNPs. By discarding redundant markers, only a small, representative set of SNPs needs to be typed, the tagging SNPs (JOHNSON et al. 2001 Down; JUDSON et al. 2002 Down). Second, by pooling samples, e.g., of, respectively, cases and controls, allele frequencies in each pool can be measured and costs per sample greatly reduced (SHAM et al. 2002 Down).

Two complementary analytical approaches for large-scale association studies are currently advocated, focusing, respectively, on single markers or haplotypes. Different estimations were expressed as to their relative power (LONG and LANGLEY 1999 Down; ZOLLNER and VON HAESELER 2000 Down; AKEY et al. 2001 Down; BADER 2001 Down; DALY et al. 2001 Down; KAPLAN and MORRIS 2001 Down). To illustrate this debate, there are situations in which haplotypes perform at best as well as single SNPs (e.g., when the causative SNP is included in the analyses) or are inferior to SNPs, (e.g., if the causative SNP allele is found on more than one haplotype). On the other hand, there are situations where haplotype analyses contribute increased power (e.g., when neither the causative SNP nor any other SNP allele on the same haplotype is included in the analyses). Consider, for example, the three haplotypes aaaaCaaaa atatGtata ttttCtttt and say that C/G is causal; then none of the other SNPs will be as strong as the second haplotype, and no haplotype will be as strong as the causal SNP. This debate could be further compounded by issues pertaining to the number of degrees of freedom or to multiple testing (AKEY et al. 2001 Down).

Indeed, several studies present simulations (ZOLLNER and VON HAESELER 2000 Down; AKEY et al. 2001 Down) and data (DALY et al. 2001 Down) where the haplotypes highlight associations for which single SNPs are underpowered. Second, haplotype analyses provide a simplified means for the conduct of large-scale genetic analysis. Last, accumulation of data about haplotype structure (PATIL et al. 2001 Down; COUZIN 2002 Down; GABRIEL et al. 2002 Down; PHILLIPS et al. 2003 Down) as well as more sophisticated methods that group haplotypes together (TEMPLETON et al. 1987 Down; SELTMAN et al. 2003 Down) may improve power of the haplotype paradigm. More experimental data are needed to sort out under what circumstances one approach may surpass the other, as in practice single-SNP and haplotype approaches complement each other, covering each other's blind spots.

Let us now consider these two approaches in the context of DNA pooling. While pooled single-SNP analysis simply compares allele frequencies among cases and controls (SHAM et al. 2002 Down), the use of haplotype analysis is less trivial, as pooling loses the individuality of allele combination, obfuscating the sample haplotype content. The goal of this article is to advocate combining haplotype analysis with DNA pooling and SNP-tagging strategies.

Throughout this article, we assume that a haplotype map (i.e., the list of all haplotypes) is available, at least for the genetic region and population under consideration. We develop cost-effective analytical techniques for such a particular region, which can be extended to complete chromosomes by existing methods (ZHANG et al. 2002 Down).

We first address the selection of haplotype-tagging SNPs in the trivial context of a single-haplotype sample without the added complexity of DNA pooling. The intuitions behind this demonstrative example, as well as the bounds attainable in this simple case, are relevant for the analysis of pooled chromosomes, the main focus of this article.

Consider a block of h known haplotypes along a set of s adjacent SNPs. When typing a single-chromosome sample for bearing any of these haplotypes we can ideally hope for log2h tagging SNPs to be sufficient for haplotype determination (see example in Table 1a). In the worst-case scenario (see Appendix A for formal proof), h – 1 selected, tagging SNPs are needed to tell these h haplotypes apart (see Table 1b). Unfortunately, the latter case is very common, occurring whenever haplotypes are perfectly coalescent, i.e., whenever there has been a single ancestral haplotype, from which all the others diverged by nonrecurrent mutations without recombination. In all single-chromosome cases, it is simple to recover the haplotype from allele calls of tagging SNPs.


 
View this table:
In this window
In a new window

 
Table 1. Examples for haplotype structure

We now turn to the bigger challenge of haplotype frequency inference, using pooled samples. This would also allow calling diploids, which can be regarded as pools of two singleton samples. [It is simple to show that tagging diploids requires a lower bound of log3(h(h + 1)/2) SNPs.] Of course, allele frequencies of SNPs that are unique to a specific haplotype immediately give away the haplotype frequency, but such SNPs may not exist. In such cases it is counterintuitive that admixtures of individual samples, which on the face of it obscure the individuality of SNP allele combination, can be used to infer frequency of these combinations, i.e., haplotypes.

We demonstrate that h – 1 SNPs are always necessary and usually sufficient to recover haplotype frequencies from such pools. Sufficiency is guaranteed when haplotypes in the block all coalesce to a single founder without recombination. We show a simple method to compute haplotype frequencies after typing tagging SNPs. Finally, we devise how such h – 1 representative SNPs should be chosen.

We first discuss an idealized situation, in which SNP allele frequencies are measured with complete accuracy. Formally, let v be a measured column vector of SNP minor allele frequencies. Consider the unknown vector u of the first h – 1 haplotype frequencies in the current block. The frequency of the remaining haplotype is assumed to complete these to unity. Let M be a (known) binary matrix of size s x (h – 1), where the Mij = 1 if the minor allele of the ith SNP is present in the jth haplotype. Our basic observation is that

(1)

An equivalent formulation is suggested by BARRATT et al. 2002 Down. We hereby develop this idea and analyze conditions for which it will be applicable.

If and only if the matrix M attains a full rank of h 1, its inverse matrix M–1 can be used to recover u:

(2)

This implies that s must be at least h – 1 for the recovery to be possible. Furthermore, if the original s SNPs are sufficient for recovery of haplotype frequencies, then one can always choose h – 1 tagging SNPs, which are sufficient as well (see Appendix A). Interestingly, the coalescent situation, which is the worst for recovery of a single haplotype, is guaranteed to be the best case for inference of haplotype frequencies always requiring h – 1 representative SNPs (see Appendix A).

In general, however, h – 1 SNPs may not be sufficient to recover allele frequencies. If several SNPs partition the set of haplotypes in exactly the same manner (see, e.g., SNPs 1 and 19 in Table 1a), these SNPs are information-wise completely equivalent and will not contribute the required h – 1 distinct frequency measurements. Moreover, in some cases of a rank smaller than h – 1, even h – 1 distinct, unequivalent SNPs may not suffice for recovery of frequencies (see Table 2). Fortunately, these cases are the peculiarity, rather than the rule. To demonstrate this, we have examined all 536 blocks obtained in the four-population datasets in GABRIEL et al. 2002 Down. Only one of these blocks required h SNPs for recovery of its frequencies (see Table 3) and not a single block required more than h SNPs.


 
View this table:
In this window
In a new window

 
Table 2. h – 1 SNPs may not be sufficient: a demonstrative, theoretic example


 
View this table:
In this window
In a new window

 
Table 3. h – 1 SNPs may not be sufficient: an example from real data

So far, we oversimplified and assumed genotyping was accurate, yet in reality SNP allele frequencies are measured only up to some human errors or technology-dependent imprecision, of 1–5% (MOHLKE et al. 2002 Down; SHAM et al. 2002 Down). Error may be compensated, e.g., if several pools are sampled per population, but also magnified during the computation of haplotype frequencies, as these may be sums and differences of several measured values. We demonstrate, by simulation of measurements in real haplotype blocks, that this magnification is tolerable (Fig 1).



View larger version (22K):
In this window
In a new window
Download PPT slide
 
Figure A1. Example of a hypothetical gene tree. Yellow nodes denote the h = 6 haplotypes present on the contemporary population. The nodes correspond to haplotypes. All tree leaves (end nodes) are contemporary, but also haplotypes that correspond to some other nodes may have survived since their divergence until today (in this example there is one such haplotype in boldface type). Branches correspond to sets of SNPs that distinguish the haplotypes corresponding to branches on both ends of the branch. Each SNP belongs to only one such set. When choosing a subset of h – 1 = 5 SNPs, they belong to five sets that correspond to five branches. The lavender branches are an example of such a corresponding set of branches. Cutting those branches breaks the tree apart into six tree segments, each having a single contemporary haplotype node. When adding SNPs and haplotypes according to their subscript index order, it is clear that the last haplotype added is linearly independent of the previous nonzero ones.

Yet another oversimplification concerns the assumption that all haplotypes are completely known, whereas in practice a small fraction of the samples are expected to have rare, unknown haplotypes (GABRIEL et al. 2002 Down). While this is seemingly an independent problem, it can in fact be considered as a variant of the inaccuracy issue we have addressed above: A SNP allele frequency measured in the presence of some rare haplotypes will approximate its frequency without those haplotypes; the latter just add noise to the measurement. The robustness of haplotype frequency estimates to inaccuracies of 1–5% in measurement of frequencies implies robustness to rare haplotypes with overall frequency of the same magnitude. Naturally, these inaccuracies have an additive effect that needs to be considered.

We have shown that h – 1 nonequivalent SNPs can always be selected so that they always detect which of the h locus haplotypes are present in a locus and that they are usually sufficient also for recovery of haplotype frequencies from pooled measurement of SNP allele frequencies. Furthermore, usually this recovery computation does not magnify the measurement error more than twofold, providing reasonable accuracy in predicting haplotype frequencies.

The practical implication of these results is the enablement of powered haplotype analysis to be afforded by the economical approaches of pooled genotyping and representative SNP tagging. This is expected to greatly improve genotyping applications, such as gene mapping for complex diseases and pharmacogenetic studies, which rely on allele-frequency estimates.

This work opens the door for further research of more complex scenarios, where only some haplotypes are known from a preliminary study, say in a different population, or where haplotype block boundaries are somewhat obscure (ZHANG et al. 2003 Down).


*  ACKNOWLEDGMENTS

We gratefully acknowledge the insightful comments of the anonymous reviewers as well as of Shaun Purcell. I.P. has been supported by the Eshkol fellowship from the Ministry of Science, Israel and J.S.B. is a recipient of the Herrmann Mayer chair. Research was supported by a grant from the Henry S. and Anne S. Reich Research Fund for Mental Health.

Manuscript received November 15, 2003; Accepted for publication January 6, 2004.


*  APPENDIX A
*TOP
*ABSTRACT
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

PROOF OF CLAIMS
Claim 1. One can always select at most h – 1 tagging SNPs that suffice to tag the h haplotypes.

Proof. By induction, for a single haplotype, one needs no SNPs. Suppose a set of j – 2 tagging SNPs distinguishes j 1 haplotypes. Consider the jth haplotype. Either the given representative set distinguishes it from all the rest or the SNPs in that set are identical for this haplotype and one of the other haplotypes. In the latter case, we add to the set of tagging SNPs the SNP that distinguishes these two haplotypes from one another. {blacksquare}

Claim 2. If the number s of SNPs is less than h – 1 one cannot recover the haplotype frequency vector u.

Proof. Suppose s < h – 1. The rank of M is at most s; therefore it is less than h – 1, and M cannot be inverted. {blacksquare}

Claim 3. If the allele frequencies of the original s SNPs are sufficient for recovery of the haplotype frequency vector u, then one can choose h – 1 tagging SNPs, which are sufficient as well.

Proof. The rank of M is h – 1. Thus, consider the set of h – 1 linearly independent rows of M. These rows make up a submatrix M* of rank h – 1. The inverse (M*)–1 of this matrix can thus be multiplied by vector v' of allele frequencies of SNPs corresponding to these rows to recover the haplotype frequency vector u. {blacksquare}

Claim 4. In a coalescent haplotype block, haplotype frequencies can be computed from the allele frequencies of h – 1 tagging SNPs.

Proof. According to claim 1, there exists a set S of h 1 SNPs that may be chosen to distinguish a single haplotype. Consider their corresponding branches of the gene (coalescence) tree T [ref]. These branches partition the tree into h tree segments, each harboring a single haplotype on one of its nodes (see Fig 1). To prove that the rank of M is h – 1 it suffices to make sure the rows corresponding to the chosen SNPs are linearly independent, as follows.

We order the SNPs s1, ... , sh–1 in S, according to the distance of their corresponding branches from the root of T. We arbitrarily set the "0" allele of each SNP as the one present in the root haplotype (see Fig 1). We now start with a tree T0 that contains only the root (all-zero) haplotype, which we denote h0; we then grow the tree by adding SNPs and haplotypes one at a time, as we now detail. We add to Tj–1 the SNP sj and the subtree that si connects to Tj–1, calling the resulting tree Tj and the single haplotype in this subtree hj. For each j = 1, ... , h – 1, the haplotype hj is the only one in Tj for which sj has the "1" allele. Therefore hj is independent of h1, ... , hj–1. This implies that the matrix h1, ... , hh–1 is of rank h – 1. {blacksquare}


*  APPENDIX B
*TOP
*ABSTRACT
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

SIMULATION METHODS
Haplotype block structures were represented by binary matrices, whose entries denote alleles of corresponding haplotypes. As discussed in this article, we focus on invertible such matrices. Haplotype block structures were therefore emulated by randomizing an invertible matrix M. We assume that the true vector v of SNP allele frequencies is not accurately available to us. We rather measure a vector v' = v + e, where e is a vector whose entries are the small measurement errors for each of the typed SNPs. When using M–1 to recover the haplotype vector using Equation 2, one actually computes u' = M–1v' = u + M–1e, inducing an error of M–1e (independent of v) in the computed haplotype frequencies. Our simulations examine the error magnification due to this M–1 multiplication. A matrix M was obtained from each block in GABRIEL et al. 2002 Down and multiplied by a vector e of normally distributed measurement errors. The resulting haplotype frequency errors were registered and plotted in Fig 1.


*  LITERATURE CITED
*TOP
*ABSTRACT
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

AKEY, J., L. JIN, and M. XIONG, 2001  Haplotypes vs single marker linkage disequilibrium tests: What do we gain? Eur. J. Hum. Genet. 9:291-300.[CrossRef][Medline]

BADER, J. S., 2001  The relative power of SNPs and haplotype as genetic markers for association tests. Pharmacogenomics 2:11-24.[CrossRef][Medline]

BARRATT, B. J., F. PAYNE, H. E. RANCE, S. NUTLAND, and J. A. TODD et al., 2002  Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann. Hum. Genet. 66:393-405.[CrossRef][Medline]

COUZIN, J., 2002  Human genome. HapMap launched with pledges of $ 100(million. Science 298):941-942.

DALY, M. J., J. D. RIOUX, S. F. SCHAFFNER, T. J. HUDSON, and E. S. LANDER, 2001  High-resolution haplotype structure in the human genome. Nat. Genet. 29:229-232.[CrossRef][Medline]

GABRIEL, S. B., S. F. SCHNAFFNER, H. NGUYEN, J. M. MOORE, and J. ROY et al., 2002  The structure of haplotype blocks in the human genome. Science 296:2225-2229.[Abstract/Free Full Text]

JOHNSON, G. C., L. ESPOSITO, B. J. BARRATT, A. N. SMITH, and J. HEWARD et al., 2001  Haplotype tagging for the identification of common disease genes. Nat. Genet. 29:233-237.[CrossRef][Medline]

JUDSON, R., B. SALISBURY, J. SCHNEIDER, A. WINDEMUTH, and J. C. STEPHENS, 2002  How many SNPs does a genome-wide haplotype map require? Pharmacogenomics 3:379-391.[CrossRef][Medline]

KAPLAN, N. and R. MORRIS, 2001  Issues concerning association studies for fine mapping a susceptibility gene for a complex disease. Genet. Epidemiol. 20:432-457.[CrossRef][Medline]

KWOK, P. Y., 2001  Methods for genotyping single nucleotide polymorphisms. Annu. Rev. Genomics Hum. Genet. 2:235-258.[CrossRef][Medline]

LONG, A. D. and C. H. LANGLEY, 1999  The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res. 9:720-731.[Abstract/Free Full Text]

MOHLKE, K. L., M. R. ERDOS, L. J. SCOTT, T. E. FINGERLIN, and A. U. JACKSON et al., 2002  High-throughput screening for evidence of association by using mass spectrometry genotyping on DNA pools. Proc. Natl. Acad. Sci. USA 99:16928-16933.[Abstract/Free Full Text]

PATIL, N., A. J. BERNO, D. A. HINDS, W. A. BARRETT, and J. M. DOSHI et al., 2001  Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294:1719-1723.[Abstract/Free Full Text]

PHILLIPS, M. S., R. LAWRENCE, R. SACHIDANANDAM, A. P. MORRIS, and D. J. BALDING et al., 2003  Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet. 33:382-387.[CrossRef][Medline]

RISCH, N. and K. MERIKANGAS, 1996  The future of genetic studies of complex human diseases. Science 273:1516-1517.[Abstract/Free Full Text]

SELTMAN, H., K. ROEDER, and B. DEVLIN, 2003  Evolutionary-based association analysis using haplotype data. Genet. Epidemiol. 25:48-58.[CrossRef][Medline]

SHAM, P., J. S. BADER, I. CRAIG, M. O'DONOVAN, and M. OWEN, 2002  DNA pooling: a tool for large-scale association studies. Nat. Rev. Genet. 3:862-871.[CrossRef][Medline]

TEMPLETON, A. R., E. BOERWINKLE, and C. F. SING, 1987  A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. I. Basic theory and an analysis of alcohol dehydrogenase activity in Drosophila. Genetics 117:343-351.[Abstract/Free Full Text]

ZHANG, K., M. DENG, T. CHEN, M. S. WATERMAN, and F. SUN, 2002  A dynamic programming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci. USA 99:7335-7339.[Abstract/Free Full Text]

ZHANG, K., J. M. AKEY, N. WANG, M. XIONG, and R. CHAKRABORTY et al., 2003  Randomly distributed crossovers may generate block-like patterns of linkage disequilibrium: an act of genetic drift. Hum. Genet. 113:51-59.[CrossRef][Medline]

ZOLLNER, S. and A. VON HAESELER, 2000  A coalescent approach to study linkage disequilibrium between single-nucleotide polymorphisms. Am. J. Hum. Genet. 66:615-628.[CrossRef][Medline]




This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
K. B. Beckman, K. J. Abel, A. Braun, and E. Halperin
Using DNA pools for genotyping trios
Nucleic Acids Res., November 14, 2006; 34(19): e129 - e129.
[Abstract] [Full Text] [PDF]