- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.106.068585v1
176/4/2441 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Beisel, C. J.
- Articles by Joyce, P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Beisel, C. J.
- Articles by Joyce, P.
Originally published as Genetics Published Articles Ahead of Print on June 11, 2007.
Genetics, Vol. 176, 2441-2449, August 2007, Copyright © 2007
doi:10.1534/genetics.106.068585
Testing the Extreme Value Domain of Attraction for Distributions of Beneficial Fitness Effects
Craig J. Beisel*,
,
Darin R. Rokyta*,
,
Holly A. Wichman*,
and
Paul Joyce*,
,1
Department of Mathematics,
Department of Biological Sciences and * Initiative for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho, Moscow, Idaho 83844
1 Corresponding author: Department of Mathematics, University of Idaho, 413 Brink Hall, Moscow, ID 83844-1103.
E-mail: joyce{at}uidaho.edu
>ABSTRACT
STRUCTURE OF THE DATA
LIKELIHOOD-RATIO TEST
SIMULATION RESULTS
CONCLUSION
ACKNOWLEDGEMENTS
LITERATURE CITED
In modeling evolutionary genetics, it is often assumed that mutational effects are assigned according to a continuous probability distribution, and multiple distributions have been used with varying degrees of justification. For mutations with beneficial effects, the distribution currently favored is the exponential distribution, in part because it can be justified in terms of extreme value theory, since beneficial mutations should have fitnesses in the extreme right tail of the fitness distribution. While the appeal to extreme value theory seems justified, the exponential distribution is but one of three possible limiting forms for tail distributions, with the other two loosely corresponding to distributions with right-truncated tails and those with heavy tails. We describe a likelihood-ratio framework for analyzing the fitness effects of beneficial mutations, focusing on testing the null hypothesis that the distribution is exponential. We also describe how to account for missing the smallest-effect mutations, which are often difficult to identify experimentally. This technique makes it possible to apply the test to gain-of-function mutations, where the ancestral genotype is unable to grow under the selective conditions. We also describe how to pool data across experiments, since we expect few possible beneficial mutations in any particular experiment.
ADAPTATION at the molecular level involves the fixation of beneficial mutations over time through natural selection. Although there is an extensive body of theoretical work on the process of natural selection, the overall process of adaptation is not as well characterized theoretically (ORR 2005a). Because beneficial mutations are rare and often have small effects, theoreticians have little information on the raw material upon which natural selection acts. Despite this dearth of data, there has been a recent emergence of work on the theory of adaptation (GILLESPIE 1983, 1984, 1991; ORR 2002, 2003a, 2005b, 2006b; ROKYTA et al. 2006). Building on ideas of a sequence space originally proposed by MAYNARD SMITH (1962, 1970), Gillespie introduced the primary assumption leading to much of the current theoretical work by arguing that extreme value theory (EVT) can help circumvent our lack of information regarding the nature of benficial mutations. GILLESPIE (1983, 1984, 1991) posited that, given an initial genotype, one can imagine the fitnesses of mutants being drawn from some underlying probability distribution. The majority of these mutations will be neutral, deleterious, or even lethal to the organism and only a small number will be beneficial. The fitness effects assigned to these mutations can thus be assumed to reside in the extreme right tail of the underlying fitness distribution.
Assuming that the fitnesses of interest lie in the right tail of the fitness distribution allows the use of EVT to predict the characteristics of beneficial mutations. Theoretical work thus far, however, has relied on one further assumption. It has always been assumed that the underlying fitness distribution is in the Gumbel domain of attraction. If this is true, EVT shows that the limiting distribution of the tail is exponential. However, this need not be the case. In fact, EVT describes three types of limiting tail distributions (i.e., domains of attraction), and furthermore, not all distributions have even a limiting tail distribution. This choice of the Gumbel domain has been rationalized by arguing that the other two types are biologically unreasonable (ORR 2006a), and some theoretical (ORR 2006a), computational (COWPERTHWAITE et al. 2005), and empirical data (KASSEN and BATAILLON 2006) support this contention.
As the Gumbel domain of attraction is such a prominent component of the current theory of adaptation, it is necessary to provide a thorough empirical test of this assumption. Any attempt to test the Gumbel hypothesis ultimately reduces to a determination of whether or not data from the extreme right tail of a distribution appear to be exponential. Of course, in the construction of any statistical test, the appropriate alternative hypothesis must be determined. Commonly, in tests for exponentiality, the alternative selected is the gamma distribution (e.g., KASSEN and BATAILLON 2006), which has the attractive property that it subsumes the exponential distribution as a special case. However, since there seems to be little reason to doubt that fitnesses of genotypes possessing beneficial mutations can be considered draws from the tail of the fitness distribution, this may not be the most appropriate alternative hypothesis. In fact, the test of interest is whether the Gumbel domain is the correct domain for the unknown fitness distribution. The gamma distribution is in the Gumbel domain, so at best, testing against the gamma might provide information on whether the observed fitness values are indeed drawn from the tail.
According to EVT, there are three domains of attraction, the Gumbel, Fréchet, and Weibull domains (Figure 1). All distributions in the Gumbel domain have the exponential as the limiting distribution of their tail. This domain contains the majority of well-known distributions such as the normal, the exponential, and the gamma. The Fréchet domain contains distributions with an infinite yet heavier tail than the exponential, (e.g., the Cauchy distribution). The Weibull domain (distinct from the Weibull distribution, which in fact belongs to the Gumbel domain) contains distributions with lighter tails than exponential, which possess a finite upper bound (e.g., the uniform distribution).
|
There are two standard approaches to EVT. The classical approach considers the distribution of the largest value, the second largest value, etc. This naturally leads to results on the distribution of spacings between consecutive extreme observations. These results are leveraged by Gillespie and Orr in their theory of adaptation, as the spacings can be used to calculate fitness effects. However, when considering alternative models, the distribution of the spacings is not straightforward. A more natural approach is to consider the distribution of values above a threshold, i.e., the wild type. This is often referred to as the "peaks over threshold" approach. Under this framework all three domains can be described by a single family of distributions called the generalized Pareto distribution (GPD) (PICKANDS 1975).
The cumulative distribution function for the GPD is given by
![]() | (1) |
![]() | (2) |
and shape parameter
, commonly referred to as the tail index, which specifies the weight of the tail. Note that the exponential distribution is nested in the GPD as a special case when
= 0, making it an ideal candidate for likelihood-ratio methods. Further, the domain of attraction is exactly determined by the shape parameter
, the case
= 0 corresponding exactly to the Gumbel,
> 0 to the Fréchet, and
< 0 to the Weibull domain of attraction (CASTILLO 1988). EVT assures us that these other domains are the best alternative to the Gumbel hypothesis of an exponential tail and therefore will produce the most powerful statistical tests, provided the data lie in the extreme right tail of some underlying distribution falling into one of the three domains of attraction. In terms of empirical data, we can assume that the threshold is the fitness of the wild type and that the fitness effects of beneficial mutations follow some form of the GPD. We imagine that the methodology we present herein will prove to be most useful for microbial experimental evolution experiments, although the test is applicable to other forms of data. Particularly in viral experimental evolution, it is possible to isolate multiple beneficial mutations arising from the same ancestral genotype, as well as identify the mutations involved (e.g., BULL et al. 2000; ROKYTA et al. 2005). Yet in these experiments, there is an inherent difficulty associated with testing for the domain of attraction for the fitness distribution. As we are assuming that our observations represent extreme values from the tail of a distribution, we can expect only a small number of unique beneficial mutations. However, having only a handful of beneficial mutations implies low statistical power to perform a test for domain of attraction. The usual solution to this problem is to collect more data, but as the number of beneficial mutations is small, this results in little or no improvement in statistical power, since replicating will tend to produce the same mutants. As any test with a low number of observations will be plagued by at best mediocre power, we present methods that allow for an additional increase in power through pooling data across distinct experiments. We also address the issue of missing small-effect mutations. Since identifying beneficial mutations experimentally involves selecting for them, it may prove difficult to identify those mutations with very small beneficial fitness effects. This bias toward seeing larger-effect mutations could have profound effects on the outcome of the data analysis. Under our statistical framework, accounting for this turns out to require only a simple shift of the data.
ABSTRACT
>STRUCTURE OF THE DATA
LIKELIHOOD-RATIO TEST
SIMULATION RESULTS
CONCLUSION
ACKNOWLEDGEMENTS
LITERATURE CITED
Censored data:
Consider the fitness distribution of all one-step mutations from some genotype. Let i be the rank of the wild type. Thus, from the ancestral genotype, there are a total of i – 1 beneficial mutations with selection coefficients in rank order s = s1, s2,..., si–1, which are drawn from an unknown probability density f(s) with cumulative distribution F(s). Thus, s1 is the selection coefficient for the largest-effect mutation, s2 is the selection coefficient for the second largest, etc. Note that selection coefficients are just the fitness difference relative to the wild type normalized to the wild-type fitness. All of what follows works equivalently if selection coefficients are replaced with fitness effects. Now suppose that due to the experimental protocol, it is not possible to observe all possible mutations, only the largest n of the total i – 1, with selection coefficients s1, s2,..., sn, where n < i – 1. In other words, we failed to observe the leftmost selection coefficients sn+1, sn+2,..., si–1 in the collected data set. Denote the observed selection coefficients as sn = (s1, s2,..., sn).
Using standard results from order statistics (see RICE 1995, p. 100), the distribution of sn depends on i and is given by
![]() | (3) |
![]() | (4) |
![]() |
![]() | (5) |
![]() | (6) |
|
Applying (6) to the GPD yields a curious result. As has been previously noted (CASTILLO and HADI 1997), the GPD shape parameter
is stable with respect to shifts in the threshold. Thus, upon shifting the threshold,
remains the same and only the scale
changes. For our purposes, we are concerned only with the shape, as it determines the domain of attraction. To see this, suppose the f(s) is given by (2); then
![]() |
= 0 this result simply states the well-known memoryless property for the exponential. This implies that the likelihood function for shifted data is of the same form as for the unshifted data, differing only by a factor of (n – 1)!, which cancels out in a likelihood ratio and is needed only in the case of ordered observations. ABSTRACT
STRUCTURE OF THE DATA
>LIKELIHOOD-RATIO TEST
SIMULATION RESULTS
CONCLUSION
ACKNOWLEDGEMENTS
LITERATURE CITED
Likelihood-ratio test:
After shifting the selection coefficients appropriately (see Equation 4), we can view the data X = (X1, X2,..., Xn–1) as a random sample of n – 1 observations from the GPD. The log-likelihood function under the GPD is given by
![]() | (7) |
= 0 and then under the alternative model where
is unrestricted. The LRT statistic is usually calculated on the log scale and with the standard formulation,
![]() | (8) |
is the maximum-likelihood estimate (MLE) for
under the exponential model, and
and
are the maximum-likelihood estimates under the full GPD.
Although often –2 ln(
) asymptotically follows a
-distribution, we do not know the sample sizes for which this approximation is appropriate. Instead, the distribution of the test statistic can be generated using parametric bootstrap based specifically on the size of a particular sample. First, the MLEs of the parameters are found under the restricted model, which in our case is the scale parameter of the exponential. A data set is generated under the null model using this estimated parameter. The LRT is performed on this simulated data set and the test statistic –2 ln(
) is calculated. This procedure for the calculation of the test statistic is replicated to generate an empirical distribution of the test statistic from which an approximation of the P-value is obtained. The parametric bootstrap approach approximates the distribution of –2 ln(
) in two ways. Because the approach is based on simulation of data, there is simulation error. However, this error is controllable, as we can obtain any degree of accuracy needed by increasing the number of bootstrap replicates. The second way in which the parametric bootstrap approximates the true distribution of –2 ln(
) is that we simulate using the estimated parameter
rather than the unknown true parameter
. For small sample sizes, the low accuracy of the estimate could affect the approximation. There are ways to adjust the parametric bootstrap approach to account for this error, but these adjustments are not necessary for the problem at hand. In general, the fact that
is a scale parameter would imply that the distribution of Xj/
is independent of
. Specifically, under the null model, Xj/
follows the standard exponential distribution with mean one. Note that the likelihood of the data under every form of the GPD (Equation 2) is a function of Xj/
, so the distribution of
does not depend on
. (However,
does depend on
, but the dependence is so weak that it can be ignored, even for small sample sizes).
Care must be taken when applying likelihood theory to the Weibull domain of attraction (
< 0). Here, the truncation point depends on the parameters to be estimated. In the statistical literature this is referred to as a range-dependent model. It is well known that standard asymptotic theory does not apply for range-dependent models. This issue for parameter estimation under maximum likelihood has been previously noted for the GPD (SMITH 1985). However, since we are using parametric bootstrap, we do not rely on asymptotic theory. Also, note that if
< –1, the likelihood can become infinite due to the distribution increasing the weight on the rightmost observation, and therefore the maximum-likelihood estimate does not exist. The problem can be remedied by restricting
> –1, which excludes "reverse" tails (Figure 1) from consideration. If the true value of
is indeed < –1, the likelihood-ratio test nearly always leads to rejection of the null model
= 0. This restriction is conservative and has little effect on the analysis (see Power analysis).
Pooling data across experiments:
The problem of low power is inherent in any statistical test involving extreme values. The nature of extreme value theory dictates that when observing data from the extreme right tail of a distribution, we will have relatively few observations. The observation of a large sample contraindicates this assumption. This presents a problem in an experiment that relies on the observation of adaptive mutations. Not only is there an expectation of a small number of observations, the number of possible observations is actually fixed. This prevents the standard solution of increasing power through the collection of additional observations through replicate experiments. As the number of replicate experiments is increased, data collection suffers from a diminishing return; previously observed mutants will occur more often, culminating at a point where all possible mutants have been observed and no new information can be gained. Alternatively, there may be enough beneficial mutations to achieve a reasonable amount of power, but many of these mutations will be of small effect and require a prohibitive number of replicate experiments before they will be observed.
Seemingly caught in a catch-22 (HELLER 1961), the only hope for increasing power is through pooling data across nonreplicate experiments with different ancestral genotypes or different environmental conditions. In this case the data can now be thought of as an array of observations, where Xj,k represents the jth fitness effect from the kth experiment. Consider a total of m experiments. The formal hypothesis test is of the form
![]() |
![]() | (9) |
are the observed fitnesses for the kth experiment,
are the parameter estimates under the GPD for the kth experiment, and
is the estimate for
under the exponential model. To illustrate the improvement in power that occurs by pooling nonreplicate experiments consider the following example. Suppose that one performs 10 nonreplicate experiments generated from distinct ancestral genotypes, each of which results in the observation of 10 distinct beneficial mutations. After shifting the threshold relative to the smallest observed, we have nine fitness effects for the 10 experiments. For each experiment we are required to estimate two parameters and shift relative to the smallest observation, which reduces the degrees of freedom by 3 for each replicate experiment. The effective sample size would then be 70, since 10(10 – 1 d.f. – 2 d.f.) = 70. Now consider the case of observing 73 distinct adaptive mutations in a single replicate experiment. We lose 3 d.f. from the shift of the threshold and the estimation of two model parameters, leaving us again with an effective sample size of 70. Therefore pooling 10 observations from 10 experiments is equivalent to observing 73 from a single experiment.
Incorporating measurement error:
In microbial evolution experiments, the beneficial mutations are first identified and then the fitness effect of each is estimated from the results of a separate experiment. The precision of the fitness assays associated with this second step can be a source of significant measurement error and could influence the analysis. KASSEN and BATAILLON (2006) appropriately account for this measurement error in their likelihood analysis. It is possible to minimize the effect of this error on the test for domain of attraction by conducting a larger number of replicate fitness assays. However, it is possible that the number of replicates required is too large or not cost effective, so that accounting for measurement error is required. In this case the likelihood equations can be easily extended to account for both normal and lognormal error. The type of error that is operating can be determined by standard methods such as a Q-Q plot against the standard normal. Here we present an efficient algorithm for estimating the appropriate parameters when measurement error cannot be ignored. Let yij be the jth replicate for the ith largest fitness effect. Suppose that f(x |
) is the distribution for fitness effects. We use
as the generic parameter, where
could represent a vector of parameters, for example, the scale and shape parameter of the GPD. Let g(y | x,
2) be the normal density with mean x and variance
2. Let
be the average of the observed fitness effects under the assumption of equal replications per mutant genotype. Let
0 be the MLE for
when measurement error is ignored. Let
be the pooled variance based on the observed fitness effects. Now, the likelihood of the data is given by
![]() |
, using a Monte Carlo algorithm as follows. Since
is the appropriate estimate of
we need only calculate
Calculate
0, the MLE for
when measurement error is ignored. Simulate X1,j, X2,j,..., Xn,j for j = 1,..., N, from f(x |
0). Order the X 's so they match up with the corresponding
The maximum-likelihood estimate can now be obtained by maximizing the following approximation to the likelihood with respect to
:
![]() | (10) |
0) to approximate a likelihood over a range of
's. Importance sampling is a commonly used Monte Carlo technique in population genetics (e.g., GRIFFITHS and TAVARÉ 1994, Section 7). While KASSEN and BATAILLON (2006) use a numerical technique to approximate the likelihood rather than importance sampling, the two methods are in fact equivalent.
To gauge how susceptible the algorithm is to Monte Carlo error, simulations were performed with N = 10,000 under the null hypothesis, where
= 0,
= 1,
= 0.1, and a sample size n = 20. The algorithm provided reasonable estimates of the parameters (
,
). Note the coefficient of variation (
/
= 0.1) represents measurement error less than what is expected to influence the results of the test (see Figure 4). The Monte Carlo error, due to the approximation in Equation 10, for the estimates of the parameters and the log-likelihood achieved a coefficient of variation <5 x 10–5 and 4 x 10–5, respectively. In a second example, we simulated a sample of size n = 20 from the null model with
= 0,
= 1, and
= 0.3. Note that the coefficient of variation in this example is much larger (
/
= 0.3) yet the algorithm still was able to recover reasonable estimates of the parameters (
and
). The Monte Carlo errors for the estimates and log-likelihood were again more than three orders of magnitude smaller than the parameters, 1 x 10–5 and 1 x 10–4, respectively. The results of these two simulations suggest that the algorithm as described is suitable for maximum-likelihood methods with N remaining computationally tractable.
|
ABSTRACT
STRUCTURE OF THE DATA
LIKELIHOOD-RATIO TEST
>SIMULATION RESULTS
CONCLUSION
ACKNOWLEDGEMENTS
LITERATURE CITED
= –
/
. Note that if
was known then the MLE for
can be written
![]() |
and
into the log-likelihood for the GPD we arrive at the reparameterization,
![]() |
, which is more reliable and computationally efficient, although under this reparameterization it is not possible to restrict values of
< –1. All of the calculations in this section were conducted using the freely available statistical package R (R DEVELOPMENT CORE TEAM 2006). Log-likelihoods were optimized with the Nelder–Mead algorithm. Implementations of the test with and without measurement error are available from the author's web site at http://www.uidaho.edu/
joyce/lab page/computer-programs.html.
Power vs. sensitivity:
Two types of statistical analysis based on simulations are presented. We present a power analysis, summarized in Figure 3 and sensitivity analyses, summarized in Figures 2 and 4. Power analysis determines how many data are needed to distinguish between the null model and alternatives. Estimating the power of a statistical test requires simulating many data sets under various alternatives to the null model. In contrast, sensitivity analysis involves simulating data under the null model, where the structure of the data is included in the simulations. In one set of simulations, we add measurement error to each observation and in another we shift each data point by a percentage of the mean, equivalent to censoring the small selection coefficients. In essence, we simulate data under the null hypothesis
= 0 and then transform this simulated data set to make it appear more like data that might arise in an experiment.
|
In the language of statistical inference, power analysis is concerned with avoiding type II errors. A type II error occurs when one accepts the null model when an alternative is more appropriate. Power is one minus the probability of a type II error. Sensitivity analysis is concerned with evaluating how much the probability of a type I error is inflated when the structure of the data is ignored. A type I error occurs if we incorrectly reject the null model. If the null model is an appropriate description of the data, but the data include measurement error or exclude small-effect mutations, then failing to account for these effects will inflate the type I error rate.
Power analysis:
The critical value of the test statistic was calculated for sample sizes considered typical of a single experiment or obtainable by pooling experiments, n = 10, 20, and 30. For comparison, larger data sets of size 50 and 100 were also simulated. This calculation was performed using 10 million replicate simulations. For each replicate, a data set of size n was simulated according to the null hypothesis where
= 0 and
= 1. The LRT statistics from all such replicates generate an empirical distribution of the test statistic that allows for an approximation of the critical value for a given type I error rate
. Simulations were then performed to estimate the power of the test against a GPD alternative at varying values of the shape parameter, –1
1.
Note in Figure 3 that for a sample of size 10,
= 0 is virtually indistinguishable from
> 0. For 0 <
< 0.8, the percentage of simulated data sets that reject the Gumbel domain is
0.05, exactly the type I error rate (
= 0.05) of the test when
= 0. When
= 1 the GPD reduces to the distribution with similar tail properties as the Cauchy distribution, well known for having a heavy tail. When
= 1, the tail of the distribution is so heavy that both the mean and the variance of the distribution are infinite. However, a sample of size 10 drawn from the GPD with
= 1 will produce samples that are distinguishable from the exponential distribution only 20% of the time. Power increases appreciably for a sample of size 20, but a sample of size 50 is required to get reasonable power. However, if we consider
< 0 then we see that even a sample size as low as 10 has reasonable power. The case of
= –1 corresponds to the uniform distribution, and 60% of all samples of size 10 drawn from the uniform distribution are distinguishable from the exponential under the GPD likelihood-ratio test.
Sensitivity analysis:
Ignoring the fact that small-effect mutations are missing from the data has a major impact on the data analysis. We outline above a simple adjustment to account for this missing data. Figure 2 shows the implications associated with failing to make this adjustment. The type I error increases dramatically as the shift size increases. The effect on the type I error rate is less pronounced when using the GPD alternative over the gamma.The test is fairly insensitive to ignoring measurement error under both lognormal and normal error structures (Figure 4). A coefficient of variation as high as 20% has virtually no effect on the probability of a type I error. We did not assess the effect of measurement error on power. The measurement error in the fitness assay can be reduced through replication.
ABSTRACT
STRUCTURE OF THE DATA
LIKELIHOOD-RATIO TEST
SIMULATION RESULTS
>CONCLUSION
ACKNOWLEDGEMENTS
LITERATURE CITED
We envision this test as being most useful for experimental microbial evolution studies, because these systems allow the isolation of a number of beneficial mutations from a single ancestral background and, through sequencing, the identification and verification of those mutations. Previous studies identifying beneficial mutations have been quite labor intensive, involving either long fixation times (ROKYTA et al. 2005) or screening a large number of mutations (SANJUÁN et al. 2004; KASSEN and BATAILLON 2006). However, the framework we have described will allow testing of beneficial mutations identified through gain-of-function experiments (FERRIS et al. 2007). In these types of experiments, microbes are exposed to conditions under which the ancestral genotype cannot grow. For example, BULL et al. (2000) isolated mutants of the phage
X174 capable of growing at 45°, a temperature at which wild type fails to grow. The same type of experiment can be done by isolating, for example, antibiotic resistance mutations in bacteria or host range mutations in a virus. The difficulty with these experiments is that the wild type has a fitness near or at zero, and there may be many mutations that confer a slight advantage but not enough for measurable growth (i.e., colony or plaque formation). Since the ancestral genotype cannot grow, it might appear that this type of data violates an underlying assumption used to justify EVT—that the wild type is already well adapted to the current environment and thus its fitness along with the fitnesses of all one-step beneficial mutations are in the tail of the fitness distribution. However, we are concerned only with whether or not beneficial mutations are in the tail of the fitness distribution, not whether the ancestral genotype is in the tail. If we think of creating an ordered list containing the fitness of each one-step mutant, listed from largest to smallest, the beneficial mutations that formed plaques or colonies would be at the top of the list. If the beneficial mutations represent a small subset among all possible mutations, then they meet the criteria of being in the extreme right tail of the fitness distribution. For example, if an experiment is replicated 20 times and the same five mutations are observed four times each, then this would suggest that there are only a small number of beneficial mutations. By establishing ahead of time that the number of adaptive changes is relatively small, and by shifting fitnesses appropriately, one can confidently use the tests described here for beneficial mutations observed through gain-of-function experiments. It is important to note we are not limited in shifting relative to the genotype with the smallest observed fitness and can in fact shift relative to any observation deemed far enough out in the tail to warrant the use of EVT. Thus, not only does the methodology described provide the appropriate test for the type of tail distribution, but also it allows experimentalists to use simple but powerful gain-of-function techniques for isolating beneficial mutations, greatly facilitating the characterization of the distribution of beneficial fitness effects.
ABSTRACT
STRUCTURE OF THE DATA
LIKELIHOOD-RATIO TEST
SIMULATION RESULTS
CONCLUSION
>ACKNOWLEDGEMENTS
LITERATURE CITED
ABSTRACT
STRUCTURE OF THE DATA
LIKELIHOOD-RATIO TEST
SIMULATION RESULTS
CONCLUSION
ACKNOWLEDGEMENTS
>LITERATURE CITED
BULL, J. J., M. R. BADGETT and H. A. WICHMAN, 2000 Big-benefit mutations in a bacteriophage inhibited with heat. Mol. Biol. Evol. 17: 942–950.
CASTILLO, E., 1988 Extreme Value Theory in Engineering. Academic Press, New York/London/San Diego.
CASTILLO, E., and A. S. HADI, 1997 Fitting the generalized Pareto distribution to data. J. Am. Stat. Assoc. 92: 1609–1620.[CrossRef]
COWPERTHWAITE, M. C., J. J. BULL and L. ANCEL MYERS, 2005 Distributions of beneficial fitness effects. Genetics 170: 1449–1457.
DAVISON, A. C., and R. L. SMITH, 1990 Models for exceedances over high thresholds. J. R. Stat. Soc. Ser. B 52: 393–442.
FERRIS, M. T., P. JOYCE and C. L. BURCH, 2007 High frequency of mutations that expand the host range of an RNA virus. Genetics 176: 1013–1022.
GERRISH, P. J., and R. E. LENSKI, 1998 The fate of competing beneficial mutations in an asexual population. Genetica 102/103: 127–144.[CrossRef]
GILLESPIE, J. H., 1983 A simple stochastic gene substitution model. Theor. Popul. Biol. 23: 202–215.[CrossRef][Medline]
GILLESPIE, J. H., 1984 Molecular evolution over the mutational landscape. Evolution 38: 1116–1129.[CrossRef]
GILLESPIE, J. H., 1991 The Causes of Molecular Evolution. Oxford University Press, New York.
GRIFFITHS, R. C., and S. TAVARÉ, 1994 Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 307–319.
GRIMSHAW, S. D., 1993 Computing maximum likelihood estimates for the generalized pareto distribution. Technometrics 35: 185–191.[CrossRef]
HELLER, J., 1961 Catch-22. Simon & Schuster, New York.
KASSEN, R., and T. BATAILLON, 2006 Distribution of fitness effects among beneficial mutations before selection in experimental populations of bacteria. Nat. Genet. 38: 484–488.[CrossRef][Medline]
KIM, Y., and H. A. ORR, 2005 Adaptation in sexuals vs. asexuals: clonal interference and the Fisher–Muller model. Genetics 171: 1377–1386.
MAYNARD SMITH, J., 1962 The limitations of molecular evolution, pp. 252–256 in The Scientist Speculates: An Anthology of Partly-Baked Ideas, edited by I. J. GOOD. Basic Books, New York.
MAYNARD SMITH, J., 1970 Natural selection and the concept of a protein space. Nature 225: 563–564.[CrossRef][Medline]
ORR, H. A., 2002 The population genetics of adaptation: the adaptation of DNA sequences. Evolution 56: 1317–1330.[CrossRef][Medline]
ORR, H. A., 2003a The distribution of fitness effects among beneficial mutations. Genetics 163: 1519–1526.
ORR, H. A., 2003b A minimum on the mean number of steps taken in adaptive walks. J. Theor. Biol. 220: 241–247.[CrossRef][Medline]
ORR, H. A., 2005a The genetic theory of adaptation: a brief history. Nat. Rev. Genet. 6: 119–127.[CrossRef][Medline]
ORR, H. A., 2005b The probability of parallel evolution. Evolution 59: 216–220.[CrossRef][Medline]
ORR, H. A., 2006a The distribution of beneficial fitness effects among beneficial mutations in Fisher's geometric model of adaptation. J. Theor. Biol. 238: 279–285.[Medline]
ORR, H. A., 2006b The population genetics of adaptation on correlated fitness landscapes: the block model. Evolution 60: 1113–1124.[Medline]
PICKANDS, III, J., 1975 Statistical inference using extreme order statistics. Ann. Stat. 3: 119–131.
R DEVELOPMENT CORE TEAM, 2006 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
RICE, J. A., 1995 Mathematical Statistics and Data Analysis. Duxbury Press, Belmont, CA.
ROKYTA, D. R., P. JOYCE, S. B. CAUDLE and H. A. WICHMAN, 2005 An empirical test of the mutational landscape model of adaptation using a single-stranded DNA virus. Nat. Genet. 37: 441–444.[CrossRef][Medline]
ROKYTA, D. R., C. J. BEISEL and P. JOYCE, 2006 Properties of adaptive walks on uncorrelated landscapes under strong selection and weak mutation. J. Theor. Biol. 243: 114–120.[CrossRef][Medline]
ROZEN, D. E., J. A. G. M. DE VISSER and P. J. GERRISH, 2002 Fitness effects of fixed beneficial mutations in microbial populations. Curr. Biol. 12: 1040–1045.[CrossRef][Medline]
SANJUÁN, R., A. MOYA and S. E. ELENA, 2004 The distribution of fitness effects caused by single-nucleotide substitutions in an rna virus. Proc. Natl. Acad. Sci. USA 101: 8395–8401.
SMITH, R. L., 1985 Maximum likelihood estimation in a class of nonregular cases. Biometrika 72: 67–90.
Communicating editor: M. K. UYENOYAMA
This article has been cited by other articles:
![]() |
G. Martin and T. Lenormand The Distribution of Beneficial and Fixed Mutation Fitness Effects Close to an Optimum Genetics, June 1, 2008; 179(2): 907 - 916. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.106.068585v1
176/4/2441 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Beisel, C. J.
- Articles by Joyce, P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Beisel, C. J.
- Articles by Joyce, P.





















