Abstract
To better understand the forces affecting individual alleles, we introduce a method for finding the joint distribution of the frequency of a neutral allele and the extent of variability at closely linked marker loci (the intraallelic variability). We model three types of intraallelic variability: (a) the number of nonrecombinants at a linked biallelic marker locus, (b) the length of a conserved haplotype, and (c) the number of mutations at a linked marker locus. If the population growth rate is known, the joint distribution provides the basis for a test of neutrality by testing whether the observed level of intraallelic variability is consistent with the observed allele frequency. If the population growth rate is unknown but neutrality can be assumed, the joint distribution provides the likelihood of the growth rate and leads to a maximumlikelihood estimate. We apply the method to data from published data sets for four loci in humans. We conclude that the Δ32 allele at CCR5 and a diseaseassociated allele at MLH1 arose recently and have been subject to strong selection. Alleles at PAH appear to be neutral and we estimate the recent growth rate of the European population to be ∼0.027 per generation with a support interval of (0.0170.037). Four of the relatively common alleles at CFTR also appear to be neutral but ΔF508 appears to be significantly advantageous to heterozygous carriers.
THE age of an allele determines both its frequency and the extent of variation at closely linked marker loci. Allele age itself cannot be observed, but, given assumptions about past selection and population growth, it constrains the relationship between frequency and intraallelic variability. A large discrepancy between allele frequency and the extent of intraallelic variability expected under neutrality provides evidence of past selection. For example, at several loci in the major histocompatibility (MHC) region in humans and other species and at selfincompatibility loci in several plant species, alleles are found in low frequencies yet exhibit substantial variation among different copies of each allele (Richmanet al. 1996; Hughes and Yeager 1998). That pattern suggests that balancing selection has retained alleles for much longer times than would be expected if they were neutral. A quite different pattern is found at the CCR5 locus in humans. A 32bp deletion (Δ32) retards the onset of AIDS in heterozygous carriers and provides resistance to infection by human immunodeficiency virus in homozygous individuals. This deletion is at a frequency of >10% in Europeans (Stephenset al. 1998). Yet very strong linkage disequilibrium with two closely linked microsatellite loci indicates the deletion is young, on the order of 1000 years or less, leading Stephens et al. (1998) to conclude that this allele has been subject to strong positive selection.
In this article, we develop a formal theory of the relationship between allele frequency and the extent of intraallelic variability under general assumptions about the demographic history of a population. Our theory can be used in two ways. If the pattern and rate of population growth are assumed to be known, our theory provides a statistical test of neutrality. If the past growth rate is unknown but the allele can be assumed to be neutral, our theory provides a way to estimate the past growth rate.
Our results are obtained by combining two models. The first provides the genealogical history of an allele. The second predicts the extent of intraallelic variability given the intraallelic genealogy. We consider three ways of measuring intraallelic variability: (a) the number of chromosomes carrying the ancestral allele at a linked marker locus; (b) the length of a haplotype shared by all copies of an allele; and (c) the number of mutations at one or more closely linked marker loci.
COALESCENT MODEL OF THE INTRAALLELIC GENEALOGY
We assume that we have a sample of n chromosomes from a randomly mating population of diploid individuals. The history of population size is described by a function N(t), where t = 0 is the present and t indicates the number of generations in the past. The current population size is N(0) = N_{0}. We assume that i chromosomes in the sample carry an allele A, which is distinguished by a mutation that occurred only once, and hence all i copies of A are identical by descent at the site of the defining mutation. The allele age, t_{1}, is the time at which the defining mutation occurred. The i Abearing chromosomes are not necessarily identical at other sites and loci, and the extent of difference among them is the intraallelic variability. How that variability is described and analyzed depends on the data available.
At the site of the mutation defining A, the history of the n chromosomes in the sample can be represented by a gene genealogy that traces ultimately to a single common ancestor. The statistical properties of the gene genealogy are well known (Tavaré 1984). For A to have arisen by mutation t_{1} generations ago and have exactly i descendants now, it must have occurred on an internal branch that creates a subtree with i terminal branches, as shown in Figure 1. That subtree is the intraallelic genealogy, which necessarily has i  1 internal nodes; we denote the times of those nodes by t_{2},..., t_{i} and call them the intraallelic coalescence times. The time t_{k} is the time at which the number of lineages in the intraallelic genealogy increases from k  1 to k (k = 2,..., i), and it is convenient to define t_{i}_{+1} to be 0, the present. In some coalescent models t_{j} is used to denote the coalescence times of the entire gene genealogy. In this article, we do not refer to coalescence times other than of the intraallelic genealogy so there is no danger of confusion with other notation.
The extent of intraallelic variability depends on the intraallelic coalescence times. Those times are random variables whose joint distribution is determined by i, n, and N(t). The coalescent model provides a way to generate random sets of intraallelic coalescence times from the correct distribution. By simulating a large number of replicate sets of intraallelic coalescence times and taking appropriate averages over replicates, we can approximate the distribution of intraallelic variability, given i, n, and N(t).
The intraallelic coalescence times are generated as follows. A neutral gene genealogy is simulated using a method similar to that described by Hudson (1990), but allowing for changes in population size. The procedure is to change the time from t to
For each subtree with i terminal branches, the probability of the observed extent of intraallelic variability is found by combining the intraallelic coalescence times with one of the models described in the next sections. The overall probability is obtained by averaging over a large number of replicates. The average must be weighted by w, the length of the branch connecting that subtree to the rest of the gene genealogy. The reason for using this weighting is that the probability that this subtree represents the genealogical history of A is proportional to the probability that the defining mutation occurred on the branch connecting the subtree to the rest of the gene genealogy, and that probability is proportional to w.
We proceed by describing three different models that generate intraallelic variability. Each model calculates the probability of the data given the set of intraallelic coalescence times, t_{2},..., t_{i}.
LINKED BIALLELIC MARKER LOCUS
Mutation and recombination: We assume that a marker locus is linked to A and that the recombination rate between them is c. There are two alleles at the marker, M and m, subject to reversible mutation with rate μ from M to m and ν from m to M. Initially M is on the chromosome carrying A at t_{2} so there is perfect linkage disequilibrium between M and A; all Abearing chromosomes are MA. Subsequently, mutation and recombination create the mA chromosomes. The extent of intraallelic variability is described by the number of MA chromosomes, j. The mathematical problem is to find the probability distribution of j, given the intraallelic coalescence times and the mutation and recombination rates.
We first consider a single lineage of the intraallelic genealogy. In one generation, the probability that an MA chromosome becomes mA because of mutation and recombination is u = (1  q)c + μ and the probability an mA chromosome becomes MA is v = qc + ν, where q is the frequency of M in the population. Because we are concerned with alleles that arose in the recent past, we assume that q remains constant. Slight random variation in q has a negligible effect on the results. Our analysis would not be appropriate if there were substantial systematic changes in q and that problem would require new theoretical analysis.
Still following a single lineage, the probability that an MA chromosome becomes mA after t generations is
To find the probability distribution of the number of lineages carrying MA in a sample of i chromosomes, we have to account for both the coalescent events and the independent changes on each lineage. Between t_{k} and t_{k}_{+1} there are k lineages (t_{i}_{+1} = 0, the present). When there are k lineages, we can represent the configuration by a vector
At t_{k}_{+1}, one of the k lineages is chosen randomly to give rise to two descendant lineages. The effect of this event can be modeled by multiplying
Given the assumption that initially A was on a chromosome carrying M, p^{(2)} = (0, 0, 1) immediately after t_{2}. Multiplying successively by T^{(}^{k}^{)} and S^{(}^{k}^{)} provides the probabilities of later configurations. In particular, the vector of probabilities that there are j MA chromosomes among the i Abearing chromosomes in the sample today is
Test for neutrality and estimation of population growth rate: Using the above method we can compute p_{j} for a given N(t) and, from that distribution, determine whether the observed number of MA chromosomes is consistent with the assumption of neutrality. If j_{o} is the observed number of MA chromosomes, then the probability P that j is at least as large as j_{o} under neutrality is found by summing over the tail of the distribution,
An alternative is to assume that N(t) is not precisely known and estimate a parameter of it. For example, we could assume that the population has been growing exponentially at rate r in the past to the current size N_{0}, N(t) = N_{0}e^{}^{rt}, and treat r as a parameter to be estimated. If we assume that A is neutral, then p_{j}_{o} regarded as a function of r is the likelihood. A second demographic model we consider is one in which the rate of exponential growth changes at time t^{*} in the past. We describe this model as the double exponential model and define the growth between the present and t^{*} generations in the past to be r_{1} and the growth rate before t^{*} to be r_{2}. For human populations, a reasonable assumption is r_{1} > r_{2}.
Given a single data set, it is impossible to estimate more than one parameter of a model of population growth and hence impossible to know whether the functional form of N(t) assumed is appropriate. But different alleles and different loci all experience the same population growth, so data from different loci can be combined to gain additional information.
LENGTH OF A CONSERVED ANCESTRAL HAPLOTYPE
Model of a continuous chromosome: The general theory of mutation and recombination at several linked marker loci is difficult and not completely developed. At present, a method for efficiently computing configuration probabilities when there are more than two linked markers does not exist, although various approximations have been suggested (Xiong and Guo 1997; Slatkin and Rannala 2000). An alternative is to ignore individual loci and instead model the locations at which recombination has occurred on Abearing chromosomes. McPeek and Strahs (1999) have developed a theory of this type and applied it to disequilibrium mapping. This approach is appropriate for cases in which all copies of A are on chromosomes carrying the same haplotype at several linked marker loci. This conserved haplotype is assumed to have been on the chromosome on which A arose by mutation and persisted on descendant Abearing chromosomes because no recombination occurred within the conserved region. The data are then the size l_{o}, measured in base pairs or in map distance, of the conserved haplotype.
We begin by modeling events on one side of A. At a given position between two bases on the chromosome, the probability that there is a recombination event anywhere on the intraallelic genealogy is the sum of the lengths of the branches (i.e., the tree length, T) multiplied by recombination rate per base pair, denoted by ρ. The probability that there was no recombination between adjacent bases is 1  ρT. The probability that there was no recombination at a distance l bases from A is (1  ρT)^{l} or approximately e^{ρ}^{Tl}. If we let l_{1} indicate the distance on one side of A, the cumulative distribution is exponential
Equation 4 is the cumulative probability that there was no recombination within a distance L on one side of the allele. The total length of an unrecombined segment containing A is the sum of the lengths on the two sides, l = l_{1} + l_{2}. The two lengths can be treated as independent random variables provided that the size of the conserved haplotype is sufficiently small that interference in recombination can be ignored. In a small region of a chromosome, the probability of two or more recombination events in a single generation is so low that interference has little effect. The probability distribution of l is then the convolution of two exponential distributions,
Test for neutrality and estimation of population growth rate: As in the case of a biallelic linked marker, (6) and (7) provide a basis for either a test of neutrality or a method for estimating a parameter of the population growth model. The observed length of the shared ancestral haplotype containing A is l_{o}, and that is a minimum estimate because it is based on the detection of a shared haplotype at polymorphic marker loci near A. To test neutrality under the assumption that N(t) is known, we use (5) and (7) to compute Pr(l ≥ l_{o}) for each set of intraallelic coalescence times and then take the weighted average to obtain an approximation to Pr(l ≥ l_{o}). If that probability is less than the specified significance level, we reject the hypothesis that the allele is neutral. If instead we assume a parameter of N(t) is unknown, we find the likelihood of that parameter as a function of the data. To do so, we approximate Pr(l_{o}) by computing the weighted average of (6) over a large number of replicates.
INFINITE SITES MODEL
A mutation model that is commonly used is the infinite sites model in which a locus is represented by a very large number of completely linked sites. The mutation rate is assumed to be sufficiently small and the number of sites sufficiently large that each site can mutate at most once. Although these assumptions are not valid for most of the nuclear genome, they are convenient for some purposes, particularly when the mutation process at the marker locus or loci is poorly understood. It may be better to assume the infinite sites model and infer the minimum number of mutations that have occurred than to introduce additional uncertainty by assuming an incorrect mutation model. In our applications to PAH and CFTR, the linked markers are microsatellite loci and the data indicate the minimum number of mutations that have occurred at those loci.
Test for neutrality and estimation of population growth rate: Under the infinite sites model, intraallelic variability is described by the number S of mutations at a locus linked to A on the i chromosomes sampled. Given a set of intraallelic coalescence times, (5) provides the intraallelic tree length, T. The distribution of S is Poisson with parameter μT, where μ is the mutation rate at the marker locus. Taking a weighted average of this Poisson distribution gives us Pr(S). If S_{o} is the observed number of segregating sites, then a test of neutrality is obtained by summing the probabilities in the tail of the distribution:
COMBINING INFORMATION ACROSS ALLELES AT A LOCUS
Likelihoodratio test of homogeneity: When data from several alleles at a locus can be analyzed, intraallelic variability is usually assessed at the same linked marker locus or loci. Such a data set provides an opportunity to test whether one of the alleles differs significantly from the others. Because our method provides the likelihood of population growth rate for each allele separately, one test for homogeneity is a likelihoodratio test, which is done as follows. For the mth allele at a locus, let L_{m}(r) be the likelihood of r, given the parameters of the model generating intraallelic variability, i.e., the mutation and/or recombination rates. Let rˆ_{m} be the maximumlikelihood estimate (MLE) of r for allele m. Assuming that the alleles are independent, the joint likelihood of r is
In a test for homogeneity, the null hypothesis is that all alleles are equivalent, meaning that the same r applies to each. The likelihood of the data under the null hypothesis is L_{0} = L_{J}(rˆ_{J}). For allele m, we can consider the alternative hypothesis that a different value of r is needed for it and a second value of r can be used for the rest. The likelihood of the data under the alternative hypothesis is L^{*} = L_{m}(rˆ_{m})L_{}_{m}(rˆ_{}_{m}). Under the null hypothesis, χ^{2} = 2 ln(L_{0}/L^{*}) is asymptotically distributed as a chisquare deviate with 1 d.f., although the asymptotic theory may not be appropriate for small sample sizes.
Estimate of r based on the joint likelihood: If the test of homogeneity does not indicate significant differences among alleles, the joint likelihood provides an MLE of r, rˆ_{J}, and that will be a better estimate than is provided by each allele separately. That estimate still depends on the mutation and/or recombination rates and on the assumption of neutrality. Comparisons of estimates across loci can provide additional information.
APPLICATIONS
CCR5Δ32: As discussed in the Introduction, the frequency of the Δ32 allele at CCR5 exceeds 10% in European populations, yet it appears to be relatively young. Stephens et al. (1998) analyzed two microsatellite markers closely linked to CCR5, one denoted GAAT and the other AFMB. Stephens et al. surveyed 46 chromosomes carrying Δ32 and found that 44 carried the 197 allele at GAAT and 41 carried the 215 allele at AFMB.
We analyze the two markers separately to illustrate the use of our method. For GAAT, i = 46, j_{o} = 44, and q = 0.685 (the frequency of 197 on chromosomes not carrying Δ32). Stephens et al. (1998) estimated the recombination rate between CCR5 and GAAT to be 0.0021 using a radiation hybrid map. They assumed a mutation rate of μ = 0.001 away from 197. Combining these numbers, u = c(1  q) + μ = 0.0021(1  0.685) + 0.001 = 0.00166, the net rate of loss of 197 alleles on Δ32bearing chromosomes. The rate of gain is v = 0.0021 × 0.685 = 0.00144, ignoring any gain by mutation. Using these as the two parameters of the model of a biallelic linked marker locus and assuming r = 0.002 and N_{0} = 2 × 10^{8}, we find the tail probability, P = Pr(j ≥ j_{o}) = 2.2 × 10^{12}. Our analysis strongly supports the conclusion of Stephens et al. (1998) that Δ32 is not neutral.
The parameter values needed to carry out this analysis are not known precisely, so it is important to test the sensitivity of the conclusions to variation in their values. In this case, we can reject neutrality for a wide range of parameter values. The P value is most sensitive to changes in u and v. Even if they are reduced by a factor of 10 (u = 0.000166 and v = 0.000144), P is still only 5.3 × 10^{4}. Changes in the model of growth and the rate of recent growth have relatively little effect. For example, if the double exponential model is used with r_{1} = 0.05, r_{2} = 0.001, and t^{*} = 15 generations, P = 1.7 × 10^{12} with u = 0.00166 and v = 0.00144.
The data for the other marker locus, AFMB, also allow us to reject neutrality. In this case, i = 46, j_{o} = 41, q = 0.521, and Stephens et al.’s (1998) estimate of c is 0.0093 (u = 0.00555 and v = 0.00485, assuming mutation at rate 0.001 away from allele 215). With exponential growth and r = 0.002, P = 3.2 × 10^{9}. Reducing u and v each by a factor of 10 increases P to only 8.9 × 10^{7}. Assuming double exponential growth with r_{1} = 0.05, r_{2} = 0.001, and t^{*} = 15 generations, P = 3 × 10^{9}.
We conclude, then, that data from the two marker loci closely linked to CCR5 provide strong evidence that the Δ32 allele has not been neutral in European populations. Slatkin (2001) describes a way to estimate the selection intensity from these data and estimates the selection coefficient in favor of Δ32 to be at least 0.2, comparable with the value estimated by Stephens et al. (1998).
MLH1: MLH1 is one of several DNA mismatch repair genes associated with hereditary nonpolyposis colorectal cancer (HNPCC). Moisio et al. (1996) found relatively long conserved haplotypes associated with two recurrent alleles at MLH1, denoted MLH1^{*}1 and MLH1^{*}2, in HNPCC families in Finland and concluded that both were relatively young, 1643 generations for MLH1^{*}1 and 521 generations for MLH1^{*}2. Here, we consider MLH1^{*}1 in more detail and show that it has probably been selected.
Moisio et al. (1996) found almost the same haplotype at four microsatellite markers (loci designated 1612, 1611, 1298, and 3527 in their Figure 1) surrounding MLH1^{*}1 on 19 chromosomes from unrelated families. Two of these 19 chromosomes differed at one marker locus each, almost certainly as a result of mutation and not recombination. Moisio et al. estimated the total recombination distance between these markers to be 7.1 cM. If we assume 1 cM is equivalent to 10^{6} bases, the length of the conserved haplotype is l_{o} = 7.1 × 10^{6} bases, and the recombination rate per base, ρ, is 10^{8}. The actual relationship between centimorgans and base pairs does not affect the results of our analysis, because in our model only the product, ρl_{o}, which is the total map length spanned by the conserved haplotype (0.071 in this case), enters the calculations. In our notation, i, the number of independent chromosomes sampled, is 19.
The population of Finland has been relatively isolated and has grown rapidly in the recent past (Peltonenet al. 1995). It was founded roughly 2000 years ago by a small group and then grew to its present size of ∼6 million. Using the result that the effective size of human populations is about onethird of the census size (Felsenstein 1971), we assume a current effective population size of N_{0} = 2 × 10^{6} and an effective size of the founding population of 1000. If there had been continuous exponential growth for 100 generations, r = 0.099. A more accurate model is one that allows for more rapid growth in the recent past. S. K. Service and N. B. Freimer (unpublished data) estimate that the effective population size was ∼300,000 in 1700. That would be consistent with the double exponential model with r_{2} = 0.067 before 1700 (t^{*} = 15 generations) and r_{1} = 0.19 in the past 300 years.
The frequency of MHL1^{*}1 is difficult to measure because it is so rare. Aaltonen et al. (1998) estimated the frequency to be ∼0.0004 based on a screen of individuals with HNPCC. P. Peltomaki (personal communication) found, however, no copies of MLH1^{*}1 in 2351 healthy anonymous blood donors from the “high risk” region described by Moisio et al. (1996), implying that the frequency is <0.0004. In our analysis, we assume p = 0.0001, which leads to a conservative test. A higher frequency makes it more likely to reject neutrality.
With these parameter values, our test strongly rejects neutrality of MLH1^{*}1 for both models of population growth. The tail probabilities under neutrality are P = 4.5 × 10^{7} for exponential growth and P = 1.6 × 10^{5} for double exponential growth. Even if the estimated length of the conserved haplotype is smaller by a factor of two (l_{o} = 3550 kb), neutrality is still rejected: P = 0.0048 for the double exponential model with r_{1} = 0.19, r_{2} = 0.067, and t^{*} = 15. And even allowing for more rapid recent growth, r_{1} = 0.3 (implying a smaller founding population) and assuming l_{o} = 3550 kb, P is still only 0.023.
Another way to explore the robustness of the result is to ask how small l_{o} has to be to not reject neutrality of MLH1^{*}1. We found P = 0.066 if l_{o} = 2000 kb, r_{1} = 0.19, r_{2} = 0.067, and t^{*} = 15. We can conclude that if estimated size of the conserved haplotype, 7.1 cM, is accurate to within a factor of 2, then MLH^{*}1 has been selected in the recent past.
PAH: Defective alleles at the locus PAH are responsible for phenylketonuria (PKU; Bickelet al. 1981). More than 200 causative alleles are known, and many recurrent alleles have been tested for the extent of intraallelic variability at an intronic microsatellite locus. There is no consensus about the effect of alleles associated with PKU on heterozygous carriers. Lewis (1997, p. 248) says that heterozygous women have a lower rate of spontaneous abortion, possibly because of increased resistance to a fungal toxin, but Vogel and Motulsky (1996, p. 289) say that there may be an increased risk of stillbirth and spontaneous abortion.
We analyzed data for 11 recurrent alleles at PAH, listed in Table 1. For each allele, we used only those chromosomes with the same combination of restriction fragment length polymorphism (RFLP) and variable number of tandem repeats (VNTR) haplotypes to ensure a single origin. Intraallelic variability was measured at an intronic tetranucleotide microsatellite. To avoid assuming a particular model of mutation at the microsatellite locus, we used the infinite sites model. For each allele, the value of S_{o} was one less than the number of alleles at the microsatellite locus. The frequency of each allele was estimated from the frequency among PKU chromosomes under the assumption that the overall frequency of PKU chromosomes is 0.01 (Bickelet al. 1981). When available, samples from the same allele in different data sets were pooled. The data were assembled from several published surveys and from the PAH web site, http://www.mcgill.ca/pahdb. Two different mutation rates for the microsatellite marker were assumed: 0.0021, based on the study by Weber and Wong (1993), and 0.00028, based on the study by Chakraborty et al. (1997). When a single growth rate, r = 0.008, was assumed, we rejected neutrality for 9 of 11 alleles under the higher mutation rate but for only 1 of 11 alleles at the lower mutation rate (Table 1). Similar results were obtained for the double exponential model with r_{1} = 0.04, r_{2} = 0.002, and t^{*} = 150.
We proceeded on the assumption that the lower mutation rate, μ = 0.00028, is more appropriate, assuming in effect that these alleles are neutral or nearly so. We calculated the likelihood of r_{1} in the double exponential model (with r_{2} = 0.002 and t^{*} = 150) for each of the 11 alleles. Figure 2 shows four of the likelihood curves obtained. The rest are similar. We tested the set of 11 alleles for homogeneity using the likelihoodratio test described above. On the 11 tests, 3 resulted in values of χ^{2} that would indicate significant heterogeneity at the 5% level assuming a
If we combine information across alleles, then we get a good estimate of growth rate under the assumption that they are neutral. For μ = 0.00028, and assuming the double exponential model with r_{2} = 0.002 and t^{*} = 150 generations, the joint likelihood function leads to an MLE of r_{1} of rˆ_{1} = 0.027 with a support interval of (0.0170.037). Figure 3 shows the joint likelihood for two mutation rates at the microsatellite locus. If μ = 0.0021 and the alleles are neutral, then the average recent growth rate, r_{1}, would have to be ∼0.15 per generation, which is probably too high.
Because we do not know the appropriate mutation rate at the linked marker and because our model of past population growth is necessarily simplified, we cannot draw any strong conclusions about alleles at PAH or about recent growth of the European populations from which these data were obtained. We present these results to illustrate the use of our method. If the mutation rate at the linked microsatellite is close to 0.00028, then these alleles are neutral or nearly so. When data are combined across alleles, the resulting estimate of the recent growth rate of European populations is not obviously wrong. If the mutation rate at the microsatellite locus is 0.0021, then most or all of these alleles are not neutral and the resulting estimate of r_{1} based on the joint likelihood is not accurate because its value reflects both past population growth and past selection. Data from additional markers linked to PAH and data from other loci will be necessary to determine which of these conclusions is more nearly correct.
CFTR: Alleles at CFTR are responsible for cystic fibrosis (CF). More than 800 causative recessive alleles are known. Until recently CF caused death in early childhood. CF is relatively common in Europeans, with a frequency of 1 in 2500 births implying a frequency of heterozygous carriers of roughly 1/25. Approximately twothirds of the causative alleles in northwestern Europe are ΔF508, which is a 3bp deletion resulting in the loss of the 508th amino acid in the CFTR protein (Rommenset al. 1989). An important question about alleles at CFTR, and in particular about ΔF508, is whether their relatively high frequency results from a selective advantage to heterozygous carriers (Morton 1968).
For our analysis we used intraallelic variation at three intronic microsatellite loci that have been studied extensively. Data were accumulated from several published articles and from the online CFTR database, http://www.genet.sickkids.on.ca/cftr. On the basis of the observation that no mutations were detected in 3000 meioses, Kaplan et al. (1994) inferred that the combined mutation rate at the three loci together did not exceed 0.001. We assumed the infinite sites model for the three markers together and inferred the minimum number of mutations, S_{o}, consistent with the data. For all but ΔF508, S_{o} was one less than the number of distinct marker haplotypes associated with each allele. For ΔF508, S_{o} was estimated by using a parsimony algorithm to infer the minimum number of mutations necessary to generate the observed haplotypes. The method was the same as that employed by Morral et al. (1994). To allow for the possibility that S_{o} = 59 is too small, we performed the same calculations for S_{o} = 118.
We tested for neutrality of five alleles at CFTR. Parameters and results are shown in Table 2. Assuming the lower mutation rate, μ = 0.0005, does not reject neutrality for any of the four less frequent alleles. Even for the higher rate, μ = 0.001, neutrality is rejected only for the model of simple exponential growth. In contrast, neutrality of ΔF508 was rejected in all cases, even with S_{o} = 118.
We computed the likelihood of population growth rate for each allele, as shown in Figure 4. The mutation rate, μ = 0.00028, was used because the joint likelihood for the four alleles other than ΔF508 led to an MLE of r_{1} of 0.023, which is roughly the same as that obtained from PAH. If these four alleles at CFTR and the 11 alleles at PAH are neutral, the three markers linked to CFTR have a lower per locus mutation rate than the single marker linked to PAH. Once again, this conclusion is tentative because of our uncertainty both about the neutrality of the alleles and about the mutation rates of the marker loci.
We used the likelihoodratio test of homogeneity to determine whether ΔF508 differs significantly from the others. Table 3 shows the values of χ^{2} obtained in each of the tests of homogeneity performed for the two mutation rates and two values of S_{o}. For all four sets of parameter values, the range of χ^{2} values for the other four alleles is comparable to that found in the tests of homogeneity for the 11 alleles at PAH. And for all four sets of parameter values, χ^{2} is substantially larger for ΔF508, which is consistent with it having experienced different selection. That conclusion is consistent with the tests of neutrality summarized in Table 2. Figure 4 shows one set of likelihood curves that illustrates the extent of difference in the likelihood curves. Our conclusion agrees with that of Wiuf (2001), who used a birthdeath model to approximate the dynamics of ΔF508 and concluded that it was positively selected in the past few thousand years.
As in the case of PAH, our conclusions are tentative because they depend on assumed values of parameters and on a simplified model of past population growth. The results for CFTR are compatible with those for PAH if a lower per locus mutation is assumed for the microsatellite loci linked to CFTR and if only ΔF508 has been subject to significant selection. Additional data will be needed before any firmer conclusions can be drawn.
DISCUSSION AND CONCLUSIONS
We have shown that intraallelic variability combined with allele frequency provides information about the history of selection and population growth experienced by an allele. Our methods, like any methods of historical inference in population genetics, rely on assumptions that cannot be literally true. Populations are not randomly mating and have not undergone continuous exponential growth for long periods of time. The role of theory in this case is to find what models are sufficient to account for observations and to provide tentative conclusions that have to be subject to further testing when new data become available.
One assumption of our analysis is random mating, but that is not an important limitation to the analysis of lowfrequency alleles. The dynamics of alleles in low frequency are well approximated by a birthdeath process, which assumes that each copy of an allele reproduces independently of all others (Wiuf 2000) and hence independently of which subpopulation it and other copies are in. Therefore, our results are applicable to samples pooled from different geographic regions.
Our analysis shows that CCR5Δ32 and MLH1^{*}1 have probably been selected in the recent past. That conclusion is not sensitive to variation in assumed parameter values. Our result for Δ32 agrees with that of Stephens et al. (1998). Our result for MLH1^{*}1 is similar to that obtained by Slatkin (2000) for two lowfrequency alleles at BRCA1. Even though these diseaseassociated alleles at BRCA1 and MLH1 are quite rare, the level of intraallelic variability associated with them is too low to be consistent with neutrality.
Our analysis of PAH and CFTR shows how information can be combined across alleles at a locus and across loci. Differences among alleles at a locus can indicate the possible effect of selection. If alleles at a locus are neutral, then the joint likelihood provides a good estimate of population growth rate. Neutral alleles at different loci will provide independent estimates of growth rate.
Acknowledgments
We thank Dr. P. Peltomaki for information about the frequency of mutants of MLH1 in the Finnish population, and Drs. L. Excoffier and C. Wiuf for helpful comments on an earlier version of this article. This research was supported in part by National Institutes of Health grant GM40282 to M. Slatkin.
Footnotes

Communicating editor: Y.X. Fu
 Received July 15, 2000.
 Accepted February 28, 2001.
 Copyright © 2001 by the Genetics Society of America