Abstract
The “matrix coalescent” is a reformulation of the familiar coalescent process of population genetics. It ignores the topology of the gene tree and treats the coalescent as a Markov process describing the decay in the number of ancestors of a sample of genes as one proceeds backward in time. The matrix formulation of this process is convenient when the population changes in size, because such changes affect only the eigenvalues of the transition matrix, not the eigenvectors. The model is used here to calculate the expectation of the site frequency spectrum under various assumptions about population history. To illustrate how this method can be used with data, we then use it in conjunction with a set of SNPs to test hypotheses about the history of human population size.
THE history of population size is a point of general interest in studies of biological variation. Among other things, population size changes can affect levels of heterozygosity, allele frequency, and the extent of linkage disequilibrium (Harpendinget al. 1998; Terwilligeret al. 1998). In humans, these effects are an important consideration in problems ranging from evolutionary biology to gene mapping. Thus, information about long-term population size contributes to the understanding of both ancient human history and modern human biology.
Information about the history of human population size comes from a variety of sources. Archaeology, paleoanthropology, linguistics, and historical documentation are all important. Over the last 25 years, however, genetic evidence has risen to the forefront. By providing information inaccessible through traditional means, genetic data play a key role in inferences about the ancient human past. Central to this role are the theoretical tools of population genetics, which attempt to describe the relationship between demography and genetic diversity. Among these tools, models of the coalescent process have distinguished themselves as a way to extract information about past patterns of population size change from present patterns of genetic variation (Fu and Li 2001).
The coalescent process (Kingman 1982a; Hudson 1990) describes the ancestry of a sample of genes. As we trace the ancestry of each modern gene backward from ancestor to ancestor, we occasionally encounter common ancestors—genes whose descendants include more than one gene in the modern sample. Each time this happens, the number of ancestors decreases by one. Eventually, we reach the gene that is ancestral to the entire modern sample, and the process ends. This process provides a natural description of genetic variation, which can be described both in terms of the topological (or genealogical) relationships among genetic lineages and the genetic distances (or coalescence times) between them.
The coalescent process is an example of a Markov process—a stochastic process in which the probability of moving from one state to another depends only on the state you are in, not on the states you have previously visited. In previous literature, attention has focused on the Markov chain that governs not only the lengths of the intervals between coalescent events but also the topology of the resulting gene genealogy (e.g., Takahata 1988). In this article, we introduce a reduced version of the Markov chain that ignores topology and deals only with the lengths of intervals. Our procedure has a number of advantages, especially in dealing with variation in population size. After introducing the model, we use it to study a set of human single-nucleotide polymorphisms (SNPs).
MODEL
The matrix coalescent: If time is measured backward into the past, and a sample of k lineages is selected t generations before present from a haploid population with size N(t), then the probability that the k sampled lineages have k – 1 distinct ancestors t + 1 generations before present is approximately
A sample of n lineages gathered at the present (t = 0 generations ago) will have a genealogy proceeding from the state of having n distinct lineages to the state of having n – 1 lineages, and so on down to one lineage, at a rate determined by the transition probabilities αn(t), αn–1(t),..., α2(t). In general, the probability, pk(t), of observing k lineages t generations before present where n ≥ k ≥ 1 is described by a system of recurrence equations
In calculations, we exclude terms for the absorbing state, in which there is just a single lineage. This is not restrictive, since we can always calculate
Eigenvalues and eigenvectors: Since A(t) is a triangular matrix, its eigenvalues are equal to its diagonal entries: –α2(t),..., –αn(t). The column eigenvectors of A(t) are defined by the equation A(t)c = cλ, where λ is a scalar—one of the eigenvalues of A—and c a column eigenvector. This equation can be reexpressed (suppressing t) as
Equation 6 also implies that the column eigenvectors are time invariant: Substitute (1) into (6) for the jth column eigenvector to obtain
The row eigenvectors of A are defined by rA =λr, where λ is an eigenvalue of A and r is the corresponding row eigenvector. This equation can be reexpressed as
Before these eigenvectors can be used, they must be normalized so that RC = I, where I is the identity matrix. Since both matrices are upper triangular, this requires only that, for eigenvector j, we ensure that rjcj = 1. Our computer program normalizes the eigenvectors by setting rj = cj = 1.
By expanding the matrix exponential in Equation 5 in diagonal form, we obtain
The second row of plots in Figure 1a shows how pk(t) varies with t for several different values of k and under three population histories: a sudden population increase, a gradual increase, and a gradual increase with periodic cycling.
Expected lengths of coalescent intervals: Let m denote the vector whose kth entry, mk, is the expected duration in generations of the interval during which the process contains k lineages. There is a close relationship between m and p: The kth entry in p(t) is the probability that generation t makes a contribution to the interval during which the process has k lineages. To calculate m, we integrate across p:
—Theoretical expectations. Shown are the relationships between population history, probability of coalescence, expected interval length, and theoretical frequency spectrum under three population histories for samples of n = 5 lineages. (a) Top, population size over time; middle, the probability that there are still k distinct lineages in the genealogy t generations ago, given the population history at top; bottom, expected interval lengths given the population history at top. Dashed vertical lines indicate that no particular branching order is implied for the genealogies. (b) Normalized frequency spectra for the genealogies represented in a. These results were generated using the Maple 5.1 software package using default numerical precision.
The theoretical frequency spectrum of mutations: The frequency spectrum is the distribution describing the relative abundance of alleles occurring i = 1,2,..., n – 1 times in a sample of n homologous genes. Spectra from populations that have increased in size show an overabundance of rare variants relative to populations of constant size, but populations that have decreased show an underabundance (Harpendinget al. 1998; Wooding 1999). The sensitivity of the frequency spectrum to population size change is exploited in several statistical tests of stationarity or neutrality (Tajima 1989; Fu 1997).
A polymorphic nucleotide site is ordinarily present in only two states within a sample, one of which is ancestral and the other derived. The expected fraction, σk, of sites at which the derived allele occurs k times is given by
The probability y(j, k, n) is given by Polya's distribution:
Numerical methods: Equation 5 contains a matrix exponential, and these are notoriously difficult to evaluate numerically (Moler and Loan 1978). It does not help to expand the exponential in terms of eigenvalues and eigenvectors. When the sample size is much over 50, the two eigenvector matrices in Equation 8 will contain very large numbers as well as small ones. Even worse, the entries of each row of C alternate in sign, leading to severe cancellation errors in the matrix product on the right-hand side of Equation 8. With samples of even moderate size, straightforward evaluation of these equations can produce results without any significant digits.
We deal with these problems in two different ways. For population histories of arbitrary complexity, we resort to brute force and use the CLN-1.0.1 programming library (Haible 2000) to perform computations either with high-precision floating-point numbers or with rational numbers. By varying the precision, it is possible to determine how many digits of precision are needed. Some of our calculations were performed with floating-point numbers using 500 decimal digits of precision.
Better alternatives are available when the population's history is piecewise constant. By this we mean that the history is divided into a series of epochs within each of which N(t) is constant. If the number of epochs is large, the piecewise constant model can approximate any history of population size. Even with only a few epochs, it is probably realistic for populations whose sizes are ordinarily held constant by density-dependent population regulation.
For such histories, we use the “uniformization” algorithm of Stewart (1994, Chap. 8), to evaluate equation 5 across a single epoch of the population's history. This makes it possible to project the vector p backward in time epoch by epoch. With this method, double-precision floating-point calculations are able to deal with problems involving samples of at least 1000.
This method for projecting p backward in time also makes it easy to calculate m. Details are given in appendix b.
Statistical methods: Under the assumption that the genealogies of unlinked sites are statistically independent, the log-likelihood of an observed data set (D) given a hypothetical population history (H) is
Application to human SNPs: SNPs are a potentially valuable source of information about population history: They are abundant, they are spread widely across the genome, and they are relatively inexpensive to assay. Most studies of SNPs are focused on their potential epidemiological applications (e.g., Cargillet al. 1999; Halushkaet al. 1999). SNPs have also been exploited as a source of information about the process of natural selection (Sunyaevet al. 2000; Fayet al. 2001). We focus here on human population history, although we include some discussion of selective processes.
Cargill et al. (1999) surveyed SNPs in 196.2 kb of nuclear DNA sequence in 20 Europeans, 14 Asians, 10 African Americans, and 7 African Pygmies. Most of the sequence was from the coding portion of genes implicated in cardiovascular, endocrine, and neuropsychiatric diseases, but some noncoding sequence was sequenced in flanking and intervening regions. Each amplified segment was screened by both DNA sequencing and denaturing high-performance liquid chromatography, and every putative SNP was verified by resequencing (Cargillet al. 1999). In total, 612 SNPs were identified in 106 genes.
The laboratory methodology used by Cargill et al. (1999) avoided some problems such as false positives, but two features of the SNP data made analysis difficult. First, the SNPs were a combination of linked and unlinked loci. Second, different SNP loci were assayed in different numbers of chromosomes. Some SNPs were sampled in 28 chromosomes, for example, while others were sampled in 114. To cope with these problems, Cargill et al. (1999) were forced in some analyses to rely on doubtful assumptions. The matrix coalescent provides an alternative approach. It cannot accomodate sites with varying levels of linkage, but likelihood-ratio tests can take varying sample sizes into account.
To take advantage of the informativeness of unlinked sites and to avoid the confounds associated with partial linkage, we resampled the original data set randomly in three steps:
-
All of the SNPs reported by Cargill et al. (1999) were divided into the three categories reported originally: coding nonsynonymous (cns) and coding synonymous (cs) and noncoding (nc) sites near genes.
-
To minimize linkage between sampled sites, only one randomly chosen SNP from each category was scored for each reported gene. If no SNPs in a category were found in a given gene, then no SNP in that category was chosen from the gene.
View this table:TABLE 1Sampled SNPs
-
The number of sites in each of the frequency categories reported in Cargill et al. (1999; 0–5%, 5–15%, and 15–50%) was tabulated for cns, cs, and nc SNPs using the dbSNP database (Sherry et al. 2000, 2001).
—Pitch plot of deviations from expectation. The deviation of each frequency category (0–5%, 5–15%, and 15–50%) from expectation in each SNP category (cns, cs, and nc is shown). Deviations given are the mean difference from expectation across all sample sizes within a SNP category, since not all loci were sampled in the same number of chromosomes. The hypotheses are as follows: (a) stationarity, constant population size, and selective neutrality; (b) mtDNA (recent), the most recent population expansion not rejected by Rogers (1995) on the basis of mtDNA polymorphism (see model); (c) mtDNA (ancient), the most ancient population expansion not rejected by Rogers (1995); (d) the maximum-likelihood parameters for coding nonsynonymous SNPs (cns); (e) the maximum-likelihood parameters for coding synonymous SNPs (cs); (f) the maximum-likelihood parameters for noncoding SNPs near genes (nc). The probability of the observed data given the history is indicated for hypotheses that were rejected.
Totals of 60 cns loci, 68 cs loci, and 30 nc loci were included in the randomized data set, which was composed of sites from at least 19 different chromosomes (Table 1). The sites within each category, which were always from different genes and often from different chromosomes, were assumed to be unlinked.
SNPs occurring k times could not be distinguished from SNPs occurring n – k times for roughly one-half of the SNPs in the original data set, so theoretical spectra were “folded” at frequency 0.5 in tests here, as described by Harpending et al. (1998).
Likelihoods of hypotheses given the observed frequency spectra were generated for each data set over a series of hypothetical population histories. Although the matrix coalescent can cope with very complicated models of history, it is doubtful that we could estimate more than a few parameters with the data at hand. We have therefore limited our analysis to piecewise-constant population histories containing two history epochs. We define these histories using three parameters: N0 is the population size during the most recent epoch (epoch 0), N1 is that in the earlier epoch (epoch 1), and T is the duration of epoch 0 in generations. Epoch 1 is assumed to have infinite duration. Our analysis loses a degree of freedom because the data are a collection of polymorphic sites and do not inform us about the fraction of sites that are polymorphic within the region of the genome under study. Thus, instead of working directly with the three parameters just defined, we work instead with two: τ= T/N0 and ρ= N0/N1. Here, ρ is a parameter representing the magnitude of population growth and τ is a parameter representing the time of population size change. Each parameter introduced 1 d.f. in likelihood-ratio tests. Maximum-likelihood estimates of ρ and τ were obtained for each SNP category by iterating over a series of values of ρ and τ.
—Maximum-likelihood parameter estimates. Open circles show the parameters of two alternative hypotheses estimated from mitochondrial DNA (see model).
Five hypotheses were tested for each SNP category. First, the maximum likelihood of each category was compared with the category's likelihood under the maximum-likelihood parameters of the other two categories (Figure 2). Then the maximum likelihood of each SNP category was tested against the category's likelihood of three alternatives: (a) stationarity, (b) the most recent population expansion not excluded by Rogers (1995; τ= 4.7 × 10–3, ρ= 1000), and (c) the most ancient population expansion not excluded by Rogers (1995; τ= 2.1 × 10–2, ρ= 1000); (see Figure 2).
Cargill et al. (1999) found that the frequency distribution of cs and nc SNPs differed significantly from that of cns SNPs, and that cns SNPs showed an excess of low-frequency variants. Fay et al. (2001) found differences between the frequency spectra of synonymous and nonsynonymous changes in the Cargill et al. (1999) data set as well. Our parameter estimates confirm these results (Figure 3). The cns category had maximum-likelihood parameters implying recent population growth under the assumption of selective neutrality (τ= 8.6 × 10–6 and ρ= 9900), and the cs and nc categories yielded estimates implying little or no change in population size (ρ= 0.4 for cs and 0.6 for nc).
Maximum-likelihood estimates for the nc data set were not rejected as an explanation for the cs data set at the 0.05 level, but the maximum-likelihood estimates for nc and cs data sets were rejected as an explanation for the cns data set. The maximum-likelihood parameters for the cns data set were almost (but not quite) rejected as an explanation for the nc (p < 0.08). The nc and cs data were indistinguishable, but both could be distinguished from cns (Figure 2). In addition, the cns data showed an excess of low frequency variation relative to expectations under stationarity, as Cargill et al. (1999) also observed.
The failure of likelihood-ratio tests to distinguish between cs and nc categories is a result of their similar ρ estimates. When ρ is near 1 the time of population size change has little effect on the frequency spectrum, and confidence intervals around τ are broad. When ρ is exactly 1 they extend to infinity regardless of sample size. Given the nearness of the nc SNPs to coding regions, the similarity of nc and cs frequency spectra is consistent.
If evolutionary processes in SNPs are neutral, then the three categories should be indistinguishable, yet clearly they are not. The frequency spectrum in cns SNPs differs from that of nc and cs SNPs, and none of the observed spectra is consistent with hypotheses about human population growth inferred from mtDNA.
DISCUSSION
The model introduced here differs from the coalescent theory introduced by Kingman (1982a) in that it ignores the topology of the gene genealogy. This simplified theory has a smaller state space than the classical theory, and it is easy to apply the elementary methods of the theory of Markov chains. This opens up opportunities for the study of populations that vary in size.
Conventional coalescent theory can deal with varying population sizes, as well: One simply uses 1/N(t) as the unit of time in generation t. [This procedure was suggested by Kingman (1982b, p. 31) and has been used by many later authors; we use it in appendix b of this article.] However, this procedure is awkward when mutations are introduced, because mutations occur at a constant rate on the normal (not the rescaled) time-scale. This has complicated efforts to calculate quantities like the expected site frequency spectrum under models of varying population size. Such problems are easier under the formulation introduced here.
The results of this study clearly reject the hypothesis that the cns, cs, and nc SNP data were produced by drift and mutation alone under a model of recent population expansion. The simplest explanation for the present results, taken in isolation, is that human population size has been constant, but some form of selection has affected the cns data. The preponderance of low-frequency polymorphisms in those data is consistent either with purifying selection acting on linked sites or with a selective sweep (Bravermanet al. 1995; Fu 1997). Fay et al. (2001), for example, found evidence for purifying selection in an analysis of the ratios of synonymous and nonsynonymous variants in different frequency categories in the Cargill et al. (1999) data set. Similar patterns of variation have been attributed to weak purifying selection elsewhere (Przeworskiet al. 1999).
Yet the present results should probably not be taken in isolation. Genetic data from substantial human samples involving a variety of genetic systems are now published. These can be divided into two categories: noncoding regions that on a priori grounds ought to be selectively neutral and coding regions (or closely linked introns) that on a priori grounds are more likely to be selected. The presumably neutral systems all show evidence either of population growth or of a selective sweep. (We cannot tell the difference.) The presumably selected systems are all consistent either with neutral evolution under constant population size or with weak balancing selection. To account for this strange pattern, Harpending and Rogers (2000) suggested that population growth did in fact occur during the Late Pleistocene, but that its signature has been obscured in the coding portions of the human genome by pervasive balancing selection. Additional data sets that have appeared since then have been consistent with this hypothesis (Rogers 2001). Thus, it is natural to wonder whether the present data set is also consistent with this hypothesis. Let us consider, then, the possibility that the present data reflect the simultaneous effects of population growth and selection.
In the absence of selection, population growth produces a genealogy without deep branches. Balancing selection has the opposite effect; it may maintain two or more allelic classes for a very long time. Since balancing selection and population growth affect genealogies in opposite ways, each tends to obscure the effect of the other. These countervailing effects, however, would not be reflected equally in our three categories of data. Many mutations would occur on the long branches that separate allelic classes, but only the neutral mutations would survive long. Consequently, these long branches would contribute mainly to the SNPs in our cs and nc categories. This is of interest because mutations that occur on the deepest branches of the genealogy can have intermediate frequencies (i.e., far from 0 or 1). Thus, balancing selection inflates the count of loci with intermediate frequencies, but this effect is visible mainly in the the cs and nc categories. Since mutations on deep branches contribute less to the cns category, balancing selection is less likely to obscure the effect of population growth there. Thus, cns SNPs are more likely to show the elevated count of alleles with extreme frequencies (near 0 or 1) that one associates with a population expansion. The count of extreme-frequency cns SNPs should be additionally elevated by recent deleterious mutations that have not yet been removed by purifying selection. For both reasons, the count of extreme-frequency loci should be larger among cns SNPs than among cs or nc SNPs. This is exactly the pattern that we observe.
There are undoubtedly other ways to explain these data, and there is no good reason for confidence in the hypothesis we just proposed. Our point is merely that the present data are consistent with the view that the human population underwent an expansion whose effects are visible in data from neutral loci but are hidden by balancing selection at protein-coding loci.
Acknowledgments
Henry Harpending, Jon Seger, Stewart Ethier, John Hawks, Pat Corneli, David Witherspoon, Josh Cherry, Pui-Yan Kwok, Brad Demarest, and Lara Carroll provided helpful comments and discussion. Nelson Beebe provided helpful advice on numerical methods. Yun-Xin Fu and two anonymous reviewers provided helpful comments. S.W. was supported by a National Institutes of Health (NIH) Genome Sciences Training Grant (Genome Informatics) to the University of Utah. A.R. was supported by NIH grant GM-59290 to the University of Utah. Software developed for this project is available at http://www.anthro.utah.edu/~rogers/src.
APPENDIX A: THE EXPECTED SITE FREQUENCY SPECTRUM
We assume that mutations are rare enough that the possibility of multiple mutations in a single gene genealogy can be ignored. Let Aj denote the event that exactly one mutation occurs within the portion of the genealogy containing j lineages, B the event that exactly one mutation occurs within the genealogy as a whole, and Pr[Aj|B] the conditional probability of Aj given B. The conditional probability that the mutant site will appear k times within a sample, given B, is
To calculate the unconditional probability of Aj, let Lj denote the length of the jth coalescent interval in a random gene tree. Then jLj is the total branch length associated with that coalescent interval. The conditional probability, given Lj, that a single mutation occurs within this interval is
A similar argument gives
APPENDIX B: CALCULATING EXPECTED INTERVAL LENGTHS UNDER PIECEWISE CONSTANT POPULATION HISTORIES
Our goal in this section is to calculate the vector m, which contains the expected lengths of the intervals between coalescent events. To simplify the problem, we first separate N(t) from A(t) by defining
Suppose now that the population's history is divided into K + 1 epochs within each of which N(t) is constant. We can reexpress F(0, ∞) as a sum of contributions from these epochs:
To recover m from Equation 16, we must right multiply each of the F's by p(0), a process that yields
Our computer program uses the projection methods discussed previously to calculate the probability vectors p̃(v), then subtracts pairs of vectors, and finally applies NiB–1. This last step is easy. For example, if n = 4,
Footnotes
-
Communicating editor: Y.-X. Fu
- Received July 5, 2001.
- Accepted May 10, 2002.
- Copyright © 2002 by the Genetics Society of America