CodonSubstitution Models for Heterogeneous Selection Pressure at Amino Acid Sites
 ^{*}Department of Biology, University College London, London NW1 2HE, United Kingdom
 ^{†}Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts 02138
 ^{‡}Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
 ^{§}Department of Ecology and Genetics, University of Århus, Ny Munkegade, DK8000 Århus C, Denmark
 Corresponding author: Ziheng Yang, Department of Biology, 4 Stephenson Way, London NW1 2HE, United Kingdom. Email: z.yang{at}ucl.ac.uk
Abstract
Comparison of relative fixation rates of synonymous (silent) and nonsynonymous (amino acidaltering) mutations provides a means for understanding the mechanisms of molecular sequence evolution. The nonsynonymous/synonymous rate ratio (ω = d_{N}/d_{S}) is an important indicator of selective pressure at the protein level, with ω = 1 meaning neutral mutations, ω < 1 purifying selection, and ω > 1 diversifying positive selection. Amino acid sites in a protein are expected to be under different selective pressures and have different underlying ω ratios. We develop models that account for heterogeneous ω ratios among amino acid sites and apply them to phylogenetic analyses of proteincoding DNA sequences. These models are useful for testing for adaptive molecular evolution and identifying amino acid sites under diversifying selection. Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to estimate the distributions of ω among sites. In all data sets analyzed, the selective pressure indicated by the ω ratio is found to be highly heterogeneous among sites. Previously unsuspected Darwinian selection is detected in several genes in which the average ω ratio across sites is <1, but in which some sites are clearly under diversifying selection with ω > 1. Genes undergoing positive selection include the βglobin gene from vertebrates, mitochondrial proteincoding genes from hominoids, the hemagglutinin (HA) gene from human influenza virus A, and HIV1 env, vif, and pol genes. Tests for the presence of positively selected sites and their subsequent identification appear quite robust to the specific distributional form assumed for ω and can be achieved using any of several models we implement. However, we encountered difficulties in estimating the precise distribution of ω among sites from real data sets.
COMPARISON of synonymous (silent) and nonsynonymous (amino acidaltering) substitution rates provides an important means for studying the mechanisms of DNA sequence evolution (Kimura 1983; Gillespie 1991; Ohta 1993). As synonymous mutations are largely invisible to natural selection (but see Akashi 1995), while nonsynonymous mutations can be under strong selective pressure, comparison of the fixation rates of these two types of mutations provides a powerful tool for understanding the effect of natural selection on molecular sequence evolution. A measure that has featured prominently in such studies is the nonsynonymous/synonymous substitution rate ratio (ω = d_{N}/d_{S}), termed the “acceptance rate” by Miyata and Yasunaga (1980). Here the rates d_{N} and d_{S} are defined as the numbers of nonsynonymous and synonymous substitutions per site, and their ratio ω measures the selective pressure at the protein level. An ω > 1 means that nonsynonymous mutations offer fitness advantages to the protein (individual) and have higher fixation probabilities than synonymous mutations. This is our working definition of positive selection (adaptive molecular evolution) in this article.
The ω ratio has almost always been calculated as an average over all codon (amino acid) sites in the gene and over the entire evolutionary time that separates the sequences. The criterion that this average ω be >1 is a very stringent one for detecting positive selection (e.g., Sharp 1997; Akashi 1999; Crandallet al. 1999). In many proteins, a high proportion of amino acids may be largely invariable (with ω close to 0) due to strong functional constraints. Furthermore, most proteins appear to be under purifying selection most of the time (Li 1997). Adaptive evolution most likely occurs at a few time points and affects a few amino acids. In such cases, the ω ratio averaged over time and over sites will not be significantly >1 even if adaptive molecular evolution has occurred. For example, Endo et al. (1996) used this criterion to perform a database search and identified 17 proteins out of 3595 that may have been under positive selection, at a proportion of only 0.47%. The scarcity of wellestablished cases of molecular adaptation may be partly due to the lack of power of the detection methods.
An alternative approach is to examine the ω ratio over a short evolutionary time (for example, along particular lineages in a phylogeny) or in functionally distinct regions of the gene. Messier and Stewart (1997) used inferred ancestral sequences to identify two lineages in a phylogeny of primates that are probably under diversifying selection for the lysozyme gene. Hughes and Nei (1988) found that the ω ratio is >1 in a region of the major histocompatibility complex (MHC) that codes for the antigen recognition site, while it is <1 in other regions of the gene. When knowledge of functional domains of the protein is unavailable, or when only a few sites are expected to be undergoing positive selection, a promising approach is to devise statistical models that allow for heterogeneous ω ratios among sites (Nielsen and Yang 1998). Such models can be used to test and identify critical amino acids in a protein under positive selection.
Nielsen and Yang (1998) implemented a few simple models that allow for heterogeneous ω ratios among sites. One, the “neutral” model, assumes the existence of two classes of sites: conserved sites, at which nonsynonymous mutations are deleterious and removed (ω = 0), and completely neutral sites at which ω = 1. A “selection” model adds a third class of sites with the underlying ω ratio estimated from the data. These models appear too simple to capture the complexity of the substitution process of various proteins. In particular, the neutral model does not allow for sites with 0 < ω < 1, such as sites at which nonsynonymous mutations are “slightly deleterious.” The selection model with one additional category cannot account for both sites with 0 < ω < 1 and sites with ω > 1.
In this article, we implement a number of models (statistical distributions) for heterogeneous ω ratios among sites. We have two major objectives in fitting those models. The first is to test for the presence of positively selected sites (sites with ω > 1) and to identify such sites along the gene. We find that this type of analysis seems insensitive to the exact distribution assumed for ω. We note that this test of molecular adaptation is still conservative, as it requires that positively selected sites be under diversifying selection along all lineages on the phylogeny. If positive selection affects only a few lineages, and purifying selection dominates during the rest of the evolutionary time, our test will fail.
Our second objective is to understand what distributions best describe the heterogeneous ω ratios among sites in real data. This appears to be a much harder task. Nevertheless, the distribution of ω is closely related to the fitness distribution of new mutations, and knowledge of it is important for testing competing population genetics models of molecular evolution (Kimura 1983; Gillespie 1991; Ohta 1993).
Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to examine the fit of the models and to accumulate empirical knowledge of the ω distribution among sites. These analyses reveal diversifying selection in several genes for which it was not previously suspected.
DATA
Ten data sets of proteincoding genes are analyzed. Table 1 lists the number of sequences (s) and the sequence length (n) for each data set. The transition/transversion rate ratio (κ), the (average) nonsynonymous/synonymous rate ratio (ω), and the tree length (sum of all branch lengths) are estimated under the simple model of one ω ratio for all sites (Goldman and Yang 1994, model M0 below). These statistics are listed to give an indication of the sequence divergence level and the selective constraint involved in each gene. More details of the data sets follow.
D1: Mitochondrial proteincoding genes from the hominoids: The 12 proteincoding genes on the Hstrand of the mitochondrial genome are concatenated and analyzed as one data set. All the 12 proteins are transmembrane proteins and appear to have similar substitution patterns (Kumar 1996). The seven species are human, common chimpanzee, pygmy chimpanzee, gorilla, Bornean orangutan, Sumatran orangutan, and common gibbon. The data are a subset of the data analyzed by Cao et al. (1998), where the GenBank accession numbers and references are given. The transition/transversion bias is strong, and the genes appear under strong purifying selection with ω at ~0.04.
D2: Vertebrate βglobin genes: The βglobin gene of 17 vertebrate species (human, tarsier, bush baby, hare, rabbit, cow, sheep, pig, elephant seal, rat, mouse, hamster, opossum, duck, chicken, African clawed frog, and western clawed frog) were extracted from the EMBL and GenBank databases. The sequences were aligned by hand, and the alignment appeared straightforward.
D3: Drosophila adh gene: The data set contains alcohol dehydrogenase (adh) gene sequences from 23 species of Drosophila. The data set was described in Nielsen (1997), where GenBank accession numbers for the sequences are given. The transition/transversion rate ratio (1.6) is relatively low for this data set.
D4: Flavivirus Eglycoprotein gene: Twentytwo dengue virus Eglycoprotein gene sequences from the alignment published by Zanotto et al. (1996) were used. The original alignment contains 123 sequences from more divergent groups. The 22 sequences we use are 6, 7, 7, and 2 sequences from the closely related Den 1–4 groups, respectively. The gene appears to be under strong purifying selection (with an average ω ratio of ~0.05).
D5: Human influenza virus type A HA gene: The data set is a subset of the HA1 domain of the hemagglutinin (HA) gene of human influenza viruses A analyzed by Fitch et al. (1997). We selected 28 sequences to represent major groups in the original data set. The HA gene encodes the major surface antigen, which is a target of neutralizing antibodies produced during infection or vaccination.
D6: HIV1 vif gene: A total of 29 isolates from subtype B are used. The HIV1 vif gene encodes an accessory protein that is believed to be essential for pathogenic infection, but its exact function is not known (e.g., Emerman and Malim 1998).
D7: HIV1 pol gene: A total of 23 isolates from subtype B are used. The HIV1 pol gene encodes for three proteins: the reverse transcriptase responsible for reverse transcribing the viral RNA into DNA; the polymerase responsible for cleaving the pol and gag precursor proteins into their final products; and the endonuclease (integrase) responsible for cleaving the host DNA so that the HIV DNA can be inserted. All three proteins are essential for virus replication.
D8: Japanese encephalitis env gene: Sequences from 23 isolates are used.
D9: Tickborne flavivirus NS5 gene: The data are from Kuno et al. (1998). The sequences are highly divergent, and the gene appears to be under strong purifying selection (with the average ω = 0.025).
D10: HIV1 env gene V3 region: HIV1 env gene V3 region from 13 HIV1 isolates with a known transmission history was published by Leitner et al. (1997).
The alignment for the influenza virus HA gene (D5) was kindly provided by Walter Fitch and those for data sets D6–D9 by Edward Holmes.
THEORY
Markov model of codon substitution: Suppose there are n codons (sites) in the sequence. Let the data at site h (h = 1, 2, …, n) be x_{h}; that is, x_{h} is a vector of codons from different sequences at that codon site. Models considered in this article allow the d_{N}/d_{S} (ω) ratio to vary among sites. Let ω^{(}^{h}^{)} be the ratio for site h. The codon substitution model specifies the relative instantaneous substitution rate from codon i to j at site h (h = 1, 2, …, n) as
Likelihood calculation under a model of heterogeneous ω ratios among sites: Using one ω parameter for each site would lead to too many parameters in the model. Instead we use a statistical distribution to account for heterogeneous ω ratios among sites (Nielsen and Yang 1998). Suppose we assume K classes of sites in the sequence, with the proportions and ω ratios given as
We want to calculate the probability of observing data x_{h} at site h. Note that given the ω ratio for the site, the conditional probability of the data, P(x_{h}ω), can be calculated lengths for any phylogenetic tree and branch using Felsenstein's (1981) pruning algorithm (see also Goldman and Yang 1994; Muse and Gaut 1994). The (marginal) probability of the data at the site is then an average of the conditional probability over the above distribution of ω:
After maximumlikelihood (ML) estimates of parameters are obtained, an empirical Bayes approach can be used to infer which class the site is most likely from (Nielsen and Yang 1998). The posterior probability that site h with data x_{h} is from site class k (i.e., that ω^{(}^{h}^{)} = ω_{k}) is
Continuous distributions: In theory, a continuous distribution for ω, say, f(ω), can be used in the likelihood calculation. The sum in Equation 3 will then be replaced by an integral. However, we have not found a feasible way to perform this calculation and instead resort to the use of a discrete distribution as an approximation, as in Yang (1994). We use K = 10 categories in the discrete distribution, each with probability 1/10, and use the median value of ω within each category to represent the distribution of ω ratios within that category. The p's and ω's in Equation 2 are then calculated as functions of parameters in the continuous ω distribution. Let F(ω) be the cumulative distribution function (CDF) of the ω distribution
The probability of the data x_{h} is then approximated by
Statistical distributions implemented: The models considered in this article are summarized in Table 2. The codes given are as used in the paml program (Yang 1997), which now implements all the models described here. All models involve the following parameters: branch lengths in the phylogeny, the transition/transversion rate ratio κ, and base frequencies at the three codon positions. These parameters are not listed in Table 2. Model 0 (M0) assumes one ω for all sites (Goldman and Yang 1994). The neutral model (M1) assumes a proportion p_{0} of conserved sites with ω_{0} = 0 and a proportion p_{1} = 1 − p_{0} of neutral sites with ω_{1} = 1. The selection model (M2) adds an additional class of sites with frequency p_{2} = 1 − p_{0} − p_{1} and with ω_{2} estimated from the data. M1 and M2 were implemented by Nielsen and Yang (1998). A number of new models are implemented in this article to accommodate different shapes of the ω distribution that are likely to occur in real data. The details follow.
M3 (discrete): The discrete model uses an unconstrained discrete distribution to model heterogeneous ω ratios among sites (Equation 2). The ω_{k} values are arranged in increasing order for unique identification. The model with K classes involves K − 1 frequency parameters and K parameters ω_{k}. In this article, K = 3 classes are used.
M4 (freqs): This model fixes ω at several prespecified values and estimates the corresponding frequencies for the site classes. The model with K classes involves K − 1 free frequency parameters p_{k}. We use K = 5, with ω_{k} fixed at 0, ⅓, ⅔, 1, and 3 for k = 0, 1, …, 4.
M5 (gamma): This model assumes a simple gamma distribution
M6 (2gamma): This model uses a mixture of two gamma distributions,
M7 (beta): The beta distribution
M8 (beta&ω): This model adds one extra class of sites to the beta model. A proportion p_{0} of sites have ω drawn from the beta distribution
M9 (beta&gamma): This model uses a mixture of
M10 (beta&gamma+1): This model uses a mixture of a beta and a gamma. However, the gamma is shifted to the right by one unit, so that the gamma distribution accounts for positively selected sites (ω > 1) only. A proportion p_{0} of sites have ω from
M11 (beta&normal>1): This model uses a mixture of a beta distribution
M12 (0&2normal>0): This model assumes a proportion p_{0} of sites with ω = 0 and a proportion (1 − p_{0}) from a mixture of two normal distributions. This mixture is p from
M13 (3normal>0): This model uses a mixture of three normal distributions truncated at ω = 0: p_{0} from
We use 10 categories (K = 10) to approximate the continuous part of the ω distribution, so that 11 categories are actually used in models M8 and M12. An example is shown in Figure 1, where model M6 (2gamma) is discretized.
The continuous mixture distributions involving the beta (M9, M10, and M11) are not smooth at ω = 1. Furthermore, the discretized versions of those models may not be powerful to detect positive selection. Because we use 10 categories in the discrete distribution, each of proportion 10%, there may not be any category with ω > 1, if the proportion of sites in the data under positive selection is substantially <10%. In such a case, the model may not suggest positive selection, even if the parameter estimates suggest a component of positively selected sites (that is, even if p_{1} > 0). It appears important to examine the inferred discrete distribution to determine whether the model suggests sites under selection, as is the case with other continuous mixture models (M5, M6, M12, and M13).
Phylogenetic tree topologies are obtained by ML search using simple nucleotide substitution models. For the purpose of this study, minor differences in the phylogeny do not make any significant difference in estimation of parameters or identification of positively selected sites (see below). The codon frequencies are estimated using the observed nucleotide frequencies at the three codon positions, while other parameters are estimated by numerical maximization of the likelihood function. Independent computer programs were written by Z. Yang and R. Nielsen for error checking.
Likelihoodratio tests to compare models: The LRT may be used to compare models implemented in this article. When two models are nested, twice the loglikelihood difference will be compared with a χ^{2} distribution with the degrees of freedom ν equal to the difference in the number of parameters between the two models. For example, models M2 (selection) and M3 (discrete) are more general than M1 (neutral) and can be compared with M1. When ω_{2} > 1 in M2 or M3, this becomes a test for the presence of positively selected sites. Similarly, M7 (beta) is a special case of models M8 (beta&ω), M9 (beta&gamma), M10 (beta&gamma+1), and M11 (beta&normal>1). Any of those more general models can be compared with M7. When the more general models indicate the presence of sites with ω > 1, the comparison constitutes an LRT of positive selection. We note that the null hypothesis in those tests, M1 (neutral) or M7 (beta), corresponds to fixing one of the parameters at the boundary of the parameter space of the alternative hypothesis; that is, the proportion for the component of positively selected sites (p_{2} in M2 and M3 and p_{1} in M8–M11) is set to zero in the null hypothesis. In such cases, use of the simple χ^{2} distribution is not reliable (Self and Liang 1987), and caution is needed when the test statistic is close to the critical value.
Although the continuous version of model M8 is nested within each of the continuous versions of models M9–M11, this is not the case with the discretized versions used in this article. Those models cannot be compared using a χ^{2} approximation to the test statistic, although in theory the null distribution can be generated by Monte Carlo simulation (Goldman 1993). In comparing those models, we use the Akaike Information Criterion (Akaike 1974) as a guidance, which stipulates that one extra parameter should be counted as an improvement of one loglikelihood unit. However, many of the mixture distributions we implemented (for example, M8–M13) are found to fit data sets analyzed in this article about equally well, leaving us with little power to discriminate among different ω distributions and also rendering formal statistical tests among those models unnecessary.
RESULTS
Likelihood values and parameter estimates obtained from different data sets are listed in Tables 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12. Estimates of branch lengths are not shown, although their sum (the tree length) under model M0 is given in Table 1. Estimates of the transition/transversion rate ratio (κ) are quite homogeneous among models in each data set and thus are not shown in Tables 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12. For example,
Our focus is the distribution of ω among sites, and we are interested in the shape of the ω distribution, as indicated by parameter estimates under different models and the fit of the models to data measured by their loglikelihood values. We discuss tests for the presence of positively selected sites and, when such sites exist, their identification. In the following, we describe results obtained from each data set. The general patterns are summarized in the discussion.
D1: Mitochondrial proteincoding genes from hominoids (Table 3): The mitochondrial genes are highly conserved. The average ω ratio ranges from 0.04 to 0.05 among all models except for M1 (neutral) and M4 (freqs), suggesting that a nonsynonymous mutation has only 4–5% as much chance as a synonymous mutation of being fixed in the population. Models M1 and M4 do not fit the data well and gave much larger and probably unreliable estimates of the average ω ratio. The selective pressure indicated by the ω ratio is highly variable among sites. For example, the model of one ω ratio for all sites (M0) is rejected by a big margin when compared with model M3 (discrete), which allows for three classes of sites with different ω ratios. The LRT statistic for this comparison is 2Δℓ = 2 × [−29690.55 − (−29967.86)] = 554.62, much greater than critical values from a χ^{2} distribution with d.f. = 4.
Parameter estimates from models such as M5 (gamma) and M7 (beta) suggest highly skewed L shapes for the ω distribution, with most amino acids highly conserved or almost invariable. The strictly neutral model (M1) fits the data worse than the oneratio model (M0). The reason appears to be that M1 does not account for sites with positive but very small ω ratios. For similar reasons, M4 (freqs) fits the data poorly, as it does not account for highly conserved sites with 0 < ω < ⅓ (recall that our implementation of this model fixes ω at 0, ⅓, ⅔, 1, and 3). Models M9–M11, which add a continuous component to the beta distribution, do not fit the data as well as model M8, which adds a class of sites with a constant ω ratio.
The discrete model (M3) suggests a small proportion of sites (p_{1} = 0.7%) under positive selection, with ω_{2} = 1.43. This model fits the data significantly better than M0 (oneratio), as mentioned above, or M1 (neutral). Similarly, model M8 (beta&ω) suggests a small proportion of sites (p_{1} = 0.6%) under diversifying selection with ω_{1} = 1.46. The LRT statistic for comparing M7 (beta) and M8 (beta&ω) is 2Δℓ = 2 × [−29690.71 − (−29699.38)] = 18.54. The P value for this comparison is 0.9 × 10^{−4}, in comparison with the χ^{2} distribution with d.f. = 2. M7 is thus rejected in favor of M8. However, the selection model (M2) does not detect positive selection in this data set. This is apparently due to the fact that the strict neutral model (M1) on which it is based is unrealistic, and the extra category added in M2 optimally accounts for deleterious mutations (with ω_{2} = 0.11). Models M10 (beta&gamma+1) and M11 (beta& normal>1) have the same likelihood value and the estimates of parameters under those models appear to suggest presence (at ~0.5%) of sites under positive selection. However, the discretized distributions do not have any category with ω > 1 and do not suggest positive selection; neither is the fit of those two models significantly better than the beta model (M7); 2Δℓ = 4.98 with P = 0.17 with d.f. = 3. The other continuous distributions, such as M5 (gamma), M6 (2gamma), M12 (0& 2normal>1), and M13 (3normal>1), do not have site classes with ω > 1 either. However, this failure is due to those models' insistence on having 10% of sites in a category with ω > 1. The small proportion of positively selected sites is detected only when the proportion is a free parameter estimated from the data, that is, under models M3 (discrete) and M8 (beta&ω). In sum, the models provide consistent evidence for the presence of a small proportion of positively selected sites in the mitochondrial genes.
The Bayes approach (Equation 5) can be used to identify sites potentially under diversifying selection. Model M3 (discrete) suggested 14 positively selected sites with P(ω > 1) > 0.5, while model M8 (beta&ω) located only 12, all included in the 14 found by model M3. If the stricter 95% threshold is used, only 2 sites are identified: one in ATP6 (with the posterior probabilities 0.96 under M3 and 0.94 under M8) and another in ND5 (with the probabilities 0.98 under M3 and 0.96 under M8). Overall, although the two models are constructed very differently, they produced very similar results concerning the inference of the positively selected sites.
D2: βglobin genes from vertebrates (Table 4): The βglobin genes are moderately conserved, with an average ω ratio between 0.31 and 0.36 among all but the worstfitting models. Parameter estimates from models such as M5 (gamma), M7 (beta), and M9 (beta& gamma) suggest that the distribution of ω over sites is Lshaped, with most sites highly conserved. Although better than M0, the strictly neutral model (M1) does not fit the data well. Neither did model M4 (freqs). The continuous mixture distributions (M8–M13) fit the data almost equally well.
Parameter estimates and LRTs suggest presence of sites under diversifying selection (Table 4). For example, the discrete model (M3) indicates ~7.9% of sites are under positive selection with ω_{2} = 1.69. M3 fits the data significantly better than any of models M0 (oneratio), M1 (neutral), and M2 (selection). Similarly to the case of the mitochondrial data set, the selection model (M2) does not detect positive selection, as the strict neutral model it is based on does not allow for sites with 0 < ω < 1. The beta distribution (M7) is rejected when compared with any of M8 (beta&ω), M9 (beta&gamma), M10 (beta&gamma+1), or M11 (beta& normal>1). The LRT statistic is 2Δℓ = 22.18 between M7 and M8 (with P = 0.15 × 10^{−4}, d.f. = 2) and 20.48 between M7 and M10 or M11 (with P = 0.13 × 10^{−3}, d.f. = 3). The discretized versions of models M9, M10, and M11 have one category (10%) with an ω ratio of ~1.7. Except for model M2 (selection), all other models that allow for positively selected sites (M3–M6 and M8–M13) do suggest existence of such sites.
Positively selected sites are identified using the Bayes approach (Equation 5). The different models give very similar lists of positively selected sites, although the posterior probabilities vary somewhat among models. The discrete model (M3) identified more sites under positive selection than other models. At the 99% level, eight sites are identified: 7, 42, 48, 50, 54, 67, 85, and 123. Those sites are also the top eight under model M8 (beta&ω); but at the 99% level, M8 located only sites 7, 50, and 123. To test whether the choice of tree topology has any effect, we also used the star phylogeny to analyze the data under model M3 (discrete). That analysis identified the following sites at the 99% level: 7, 42, 48, 50, 54, 67, 74, 85, 114, 123, 124, and 128. This list is very similar to those obtained using the best tree.
D3: Drosophila adh gene (Table 5): The average estimate of ω is ~0.11 for all the bestfitting models. Parameter estimates from models such as M5 (gamma) and M7 (beta) suggest that the ω distribution has a highly skewed L shape, with most sites highly conserved with very small ω ratios. Model M3 (discrete) fits the data significantly better than models M0 (oneratio), M1 (neutral), or M2 (selection), but none of the models suggest presence of positively selected sites. The beta model (M7) provides a good fit to the data, although slightly worse than M3 (discrete), and adding an extra component to the beta (as in models M8–11) leads to no significant improvement in the loglikelihood score. Other continuous mixture models (M5, M6, M8, and M12–M13) do not suggest positive selection either.
We note that previous studies (e.g., Hudsonet al. 1987) have suggested the operation of balancing selection at one particular amino acid site at the adh locus in Drosophila. Our LRTs, while highlighting the extreme variation in selective pressure among sites, do not suggest existence of sites under diversifying selection. This may be due to the lack of power of our models to detect balancing selection.
D4: Flavivirus Eglycoprotein gene (Table 6): The average estimate of ω is ~0.06 for all the bestfitting models. The ω ratios are highly variable among sites and the ω distribution appears to have a highly skewed L shape, with most sites highly conserved with very small ω ratios. The discrete model (M3) fits the data much better than M0 (oneratio), M1 (neutral), or M2 (selection), but none of the models suggest the existence of sites under diversifying selection. The beta distribution (M7) fits the data better than M3 (discrete). Since M7 also has three fewer parameters than M3, it is the preferred model for this data set. Although models M12 and M13 have marginally higher likelihood values than M7, this improvement is not more than expected given their greater numbers of parameters. Adding an additional component to the beta model to allow for positively selected sites (models M8–M11) provides no significant improvement to the model's fit to data.
D5: Human influenza A virus HA gene HA1 domain (Table 7): The average ω ranges from 0.39 to 0.41 among all models except for M1 (neutral) and M7 (beta), which do not allow for positively selected sites, fit the data badly, and give smaller estimates of ω. The average acceptance rate is <1, indicating that, on average, purifying selection dominates the evolution of the gene. The oneratio model (M0) is easily rejected when compared with all other models, which allow the ω ratio to vary among sites. Distributions of ω among sites estimated under M5 (gamma) and M6 (2gamma) are Lshaped, with heavy tails. The beta distribution (M7) has an interesting U shape, possibly because the model is forced to attempt to account for sites with Δ > 1. The discrete model (M3) fits the data as well as or better than all other models considered; M8 has nearly as high a likelihood value, and uses one fewer parameter.
Models that allow for positively selected sites, that is, M2 (selection), M3 (discrete), M5 (gamma), M6 (2gamma), and M8–M13, all suggest presence of positively selected sites. For example, the selection model (M2) suggests ~1.1% of sites under positive selection with ω_{2} = 5.8. Model M3 (discrete) suggests a large proportion of sites (25.1%) under weak diversifying selection with ω_{1} = 1.28 and a small proportion of sites (0.8%) under strong diversifying selection with ω_{2} = 6.90. Note that the selection model (M2) suggests a large proportion of neutral sites with ω_{1} = 1, and the estimates from M2 and M3 are not very different. Both models have significantly higher likelihood values than models M0 and M1, providing strong evidence for adaptive evolution. Similarly, M8 (beta&ω) suggests ~1.3% of sites under positive selection with ω_{2} = 5.2. The LRT statistic for comparing M7 (beta) and M8 (beta&ω) is 2Δℓ = 2 × 5.43 = 10.86,
Amino acid sites 135 and 226 are identified to be under positive selection at the 95% level by all models that allow for positive selection. Model M3 (discrete) suggested many more sites because the parameter estimates under this model suggest a large proportion of weakly selected sites. At the 50% level, models M5, M6, and M9–M13 suggested the same 23 positively selected sites: 15, 94, 121, 124, 133, 135, 137, 138, 155, 156, 157, 159, 163, 174, 186, 189, 196, 197, 219, 226, 246, 262, and 273.
D6: HIV1 vif gene (Table 8): The pattern for this data set is rather similar to that of the influenza virus HA gene (D5; see Table 7). The average ω ratio over all sites ranges from 0.6 to 0.9 among models. The oneratio model (M0) is easily rejected when compared with any of the moregeneral models that allow for variable ω ratios among sites. Model M1 (neutral) also fits the data set much better than M0. The gamma (M5) and double gamma (M6) models suggested Lshaped distributions for ω, with heavy tails. The beta model (M7) suggests a U shape, possibly because the underlying ω ratio at some sites is >1. Model M3 (discrete) fits the data as well as or better than all other models considered.
All models that allow for positively selected sites do suggest existence of such sites in this gene. For example, the selection model (M2) suggests ~8.5% of sites under positive selection with ω_{2} = 4.2. Model M3 (discrete) suggests a large proportion of sites (32.5%) under weak diversifying selection with ω_{1} = 1.21 and a small proportion of sites (7%) under strong selective pressure with ω_{2} = 4.0. Both models M2 and M3 have significantly higher likelihood values than models M0 (oneratio) or M1 (neutral). Similarly, M8 (beta&ω) suggests that ~9% of sites are under positive selection with ω = 3.4. The LRT statistic for comparing M7 (beta) and M8 (beta& ω) is 2Δℓ = 2 × [(−3370.66) − (−3400.45)] = 59.58,
We plotted the posterior probability distributions, calculated using Equation 5, for sites along the HIV1 vif gene. This was done for models M2 (selection), M3 (discrete), M8 (beta&ω), and M9 (beta&gamma), with results for M3 and M9 shown in Figure 2. Posterior probabilities under M2 and M3 are very similar, but as one may expect, many sites included in the neutral class under M2 are included in the weakly selected class under M3 (results not shown). Plots of M8 and M9 are virtually identical (results not shown). At the 99% level, model M9 identified 10 positively selected sites (31, 33, 39, 63, 92, 101, 109, 122, 127, and 167). At the same level, model M2 identified 5 of the sites only: 31, 39, 122, 127, and 167, while sites 33, 63, 92, 101, and 109 are included at the 95% level. The two models used in Figure 2, M3 and M9, are constructed very differently, and yet the posterior distributions under the two models are highly similar.
D7: HIV1 pol genes (Table 9): The average ω ratio over all sites ranges from 0.20 to 0.27 among models except for M1 (neutral) and M7 (beta), indicating relatively strong purifying selection. The strict neutral model (M1) and the beta model (M7) give lower estimates of the average ω ratio, as they do not allow for sites with ω > 1. M1 fits the data much better than the oneratio model (M0), but is much worse than M7 (beta). There are clearly sites with 0 < ω < 1. Parameter estimates under models such as M5 (gamma) and M6 (2gamma) suggest Lshaped distributions for ω, with heavy tails. The beta model (M7) suggests a U shape, and the peak at ω = 1 is probably caused by sites with ω > 1 in the data. The discrete model (M3) fits the data as well as or better than all other models considered. Simpler nested models (M0, M1, and M2) are all rejected when compared with M3.
The selection model (M2) does not suggest presence of positively selected sites. Similar to the cases of the mitochondrial (D1) and βglobin (D2) genes, this failure appears to be due to the unrealistic nature of the neutral null model (M1), which does not account for sites with 0 < ω < 1. Model M3 (discrete) suggests that a small proportion of sites (1.9%) are under strong diversifying selection with ω_{2} = 4.7. This model fits the data significantly better than models M0, M1, and M2. For example, the LRT statistic for the comparison between M0 and M3 is 2Δℓ = 2 × [(−9363.57) − (−9619.30)] = 2 × 255.73 = 511.46,
All models that allow for positive selection (M3–M6 and M8–M13) pinpoint similar sets of amino acid sites as targets of diversifying selection. For example, at the 99% level, M8 (beta&ω) identified the following sites: 67, 347, 478, 568, 761, and 779, while at the 95% level, four additional sites are identified by that model (sites 3, 14, 41, and 379).
Seibert et al. (1995) examined the HIV1 env, gag, and pol genes for positive selection. They used the method of Nei and Gojobori (1986) to estimate d_{N} and d_{S} and concluded that d_{N} > d_{S} for the env gene but d_{N} < d_{S} in the gag and the pol genes. Our results and those of Seibert et al. (1995) may be reconciled by noting that the tests presented in this article are more powerful to detect positive selection than the method used by Seibert et al. (1995).
D8: Japanese encephalitis env gene (Table 10): This gene is under strong purifying selection, as the average ω ratio over sites is ~0.05 for all models except for M1 (neutral) and M4 (freqs). Models M1 and M4 gave much larger estimates (0.17 and 0.07, respectively), but these do not appear reliable because the two models do not fit the data well. M1 is even poorer than the oneratio model (M0). The discrete model (M3) suggests a small proportion (0.3%) of nearly neutral sites with ω_{2} = 1.04, and there is no evidence for positive selection. Models such as M5 (gamma), M6 (2gamma), and M7 (beta) suggest highly skewed Lshaped distributions for ω, with most sites under strong selective pressure (with ω close to 0) and with very little probability density for the region ω > 1. The discrete model (M3) fits the data as well as or better than all other models considered. Model 8 (beta&ω) fits the data better (although not significantly) than M7 (beta). However, the estimated ω for the additional category of sites is <1. Furthermore, the discretized versions of models M9–M13 do not have any category with ω > 1.
D9: Tickborne flavivirus NS5 gene (Table 11): This gene is the most conserved among the 10 genes analyzed in this article. The average ω ratio over all sites ranges from 0.02 to 0.03 among models, except for M1 (neutral) and M4 (freqs). The latter two models produced larger but unreliable estimates as they fit the data very poorly. The strict neutral model (M1) fits the data even more poorly than the oneratio model (M0). Although the two models have the same number of parameters and are not nested, the loglikelihood difference (Δℓ = 582.48) is huge. Neither the selection (M2) nor discrete (M3) model indicates presence of positively selected sites. The beta mixture models (M8–M11) do not fit the data any better than the simple beta model (M7). Furthermore, none of the 10 categories in the discrete distributions used to approximate the continuous beta mixture models (M9–M11) have an ω ratio >1. Models such as M5 (gamma), M6 (2gamma), and beta (M7) all suggest highly skewed Lshaped distributions for ω. The discrete model (M3) fits the data as well as the best of other models.
D10: HIV1 env gene V3 region (Table 12): The ω ratio averaged over all sites ranges from 0.9 to 1.2 among models, except for models M1 (neutral) and M7 (beta). The latter two models give lower and unreliable estimates as they do not account for positively selected sites. This gene region has the highest overall ω ratios among the 10 data sets analyzed in this article. The HIV1 env gene and in particular the V3 region are well known to be under diversifying selection. The strict neutral model (M1) fits the data much better than the model of one ω for all sites (M0). The discrete model (M3) fits the data about as well as the best of other models considered.
All models that allow for positive selection do suggest a substantial proportion (18–70%) of positively selected sites. Models such as M5 (gamma) and M6 (2gamma) suggest Lshaped distributions for ω with heavy tails. The beta model (M7) suggests a U shape, as the model uses the density at ω ~ 1 to account for sites with ω > 1. Many amino acids are clearly under diversifying selection. For example, the discrete model (M3) suggests ~42% of sites are under relatively weak positive selection with ω_{1} = 1.8 and a further 4.6% of sites are under strong positive selection with ω_{2} = 7.1. Similarly, M8 (beta&ω) suggests that ~20% of sites are under positive selection with ω_{1} = 3.5. Note that the differences between the models may not be as large as they may seem, as we expect it to be difficult to distinguish a small proportion of strongly selected sites from a large proportion of weakly selected sites. The beta distribution in M8 (beta&ω) is Ushaped with a high proportion of sites with ω ~ 1. The LRT statistic for comparing M7 (beta) and M8 (beta&ω) is 2Δℓ = 2 × [(−1106.39) − (−1115.40)] = 2 × 9.01 = 18.02, with P = 0.00012 compared with the χ^{2} distribution with d.f. = 2. The beta mixture models (M9–M11) fit the data slightly better than M8 (beta&ω), and significantly better than the beta model (M7), again suggesting the operation of diversifying selection at some sites.
All models that allow for sites under positive selection identified sites 28, 66, and 87 with high posterior probability supports (≥99%). Model M12 (0&2normal>1) included sites 26 and 51 as well at the 99% level. At the 95% level, M12 suggested six additional sites: 22, 24, 68, 69, 76, and 83.
DISCUSSION
Effects of tree topology: Maximumlikelihood estimation under models in this article relies on the phylogenetic relationship among the sequences. To examine the effect of the phylogeny on the analysis, we used six candidate trees to fit all 14 models (M0–M13) to the vertebrate βglobin genes (D2). The six trees were either inferred from the βglobin data or based on conventional wisdom. The best tree according to the oneratio model (M0) was used to obtain results of Table 4. Results under another tree (the worst of the six trees under model M0) are presented for four models (M0, M3, M7, and M8) in Table 13. The estimates under this tree are highly similar to estimates presented in Table 4. LRTs of positive selection lead to the same conclusions no matter which of the two (or six) trees is used. The inference of sites under positive selection does not seem to be sensitive to the assumed tree topology either. As mentioned before, use of even the star tree generated lists of positively selected sites for the vertebrate βglobin genes that are very similar to those obtained under the best tree. A similar analysis was performed on the HIV vif data set using two candidate trees. Parameter estimates, LRTs, and posterior probability calculations are all highly similar between the candidate trees (results not shown). While the correct tree should obviously be used if it is known, those results suggest that a reasonably good phylogeny may be sufficient for estimating parameters in the ω distribution and for performing LRTs of positive selection.
Computational and theoretical problems: Models implemented in this article appear very useful in testing the existence of positively selected amino acid sites and in identifying such sites when they exist. The different models also appear to produce consistent and convincing results. However, we encountered numerous practical problems. The continuous mixture models M9–M13, and especially the parameterrich model M13 (3normal>0), were found to converge to ML estimates very slowly during the iteration. In some datamodel combinations, the likelihood value was also found to be sensitive to the number of categories (K) used in the discrete approximation. While the likelihood values reported in Tables 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 are reliable, parameter estimates under some models may not be. In some cases, different values of the parameters gave virtually the same likelihood values, and the likelihood surface appears to be nearly flat. Those computational difficulties appear to a large extent to be caused by our use of the discrete distributions to approximate the continuous ones. Two different continuous distributions (or two sets of parameters for the same continuous distribution) may look very similar after they are discretized. We note that distinguishing statistical distributions is almost always a difficult task, but the discretization appears to have further reduced the power of the method.
We also implemented a different discretization scheme, in which K_{1} = 6 categories are used for the region ω < 1 and K_{2} = 4 categories are used for the region ω ≥ 1. This scheme appears to perform better for data sets with a small proportion of positively selected sites (such as in D7), but worse when the data do not contain positively selected sites (such as in D9). When a large proportion of sites are under diversifying selection (such as in D10), both schemes appear to perform well. Since the new K_{1}K_{2} scheme is not consistently better than the old scheme of K = 10 equalprobability categories, we have not used it for analysis in this article. Future work to devise more efficient approximations to the integral of Equation 6 is highly desirable and may restore some power to distinguish those continuous distributions we implemented.
Models implemented in this article assume that the selective pressure indicated by the ω ratio is identical among evolutionary lineages. They also assume that the nonsynonymous substitution rate is independent of the amino acids being interchanged; that is, at a positively selected site, all amino acids are assumed to be acceptable and all amino acid changes are assumed to be advantageous. By the criterion we use, a site will be considered to be under positive selection only if the nonsynonymous rate, averaged over all lineages in the phylogeny and over all possible amino acidreplacement mutations at the site, is higher than the synonymous rate. Thus LRTs of positive selection implemented in this article are conservative. It appears desirable to develop models that allow the selective pressure to vary both among lineages and among sites; such models may be much more powerful for detecting adaptive molecular evolution than those implemented in this article. In this regard, it should be noted that our models, conservative as they are, identified positive selection in several genes not previously suspected of being under positive selection.
Another assumption made in this article is a constant mutation (synonymous) rate among sites. This assumption may not always be realistic. However, we suggest that synonymous rate variation is unlikely to lead to false conclusions of adaptive evolution by the methods of this article. For example, at mutational hot spots, both synonymous and nonsynonymous rates will be elevated, and if nonsynonymous mutations do not offer a selective advantage, the underlying ω ratio at such sites will not be >1. Similarly, in our formulation, selection at silent sites for translational efficiency, which has been nicely demonstrated in Drosophila by Akashi (1995, 1999) and suggested for mammals and viruses as well, has the sole effect of changing the codon usage pattern (π_{j} in Equation 1). It will not lead to ω > 1 if nonsynonymous mutations do not offer a selective advantage. We note that the ω ratio in our models measures the net effect of selection at the protein level (see Equation 1). Unlike many tests of neutrality suggested in population genetics (see, e.g., Wayne and Simonsen 1998; Fu and Li 1999 for reviews), which may be powerful in rejecting strict neutrality but not so powerful in distinguishing among different forms of natural selection, the LRTs described in this article aim to detect molecular adaptation.
How many genes are under positive selection? Our analysis demonstrated existence of sites under diversifying selection in 6 out of the 10 genes analyzed. The HIV1 env gene (data set D10) is one of the bestknown examples of adaptive evolution (e.g., Holmeset al. 1995; Mindell 1996; Yamaguchi and Gojobori 1997); the selective pressure is presumably the surveillance of the host immune system. Previous analysis (Fitchet al. 1997) also suggested positive selection in the human influenza virus HA gene (data set D5). Besides those two genes, our analysis also detected adaptive evolution in the HIV1 vif and pol genes (data sets D6 and D7) and in the mitochondrial (D1) and βglobin (D2) genes. The ω ratios averaged over all sites are ⪡1 in those data sets, and our models inferred adaptive molecular evolution in spite of this overwhelming effect of purifying selection. The 10 genes analyzed in this article are not a random sample of genes in various organisms. However, it appears likely that molecular adaptation happens much more often than has been recognized. We hope that our inference of sites under diversifying selection may prompt further labbased investigation on the structure and function of the proteins to identify the selective agents.
Recommendations concerning models implemented in this article: We note that the selection model (M2), or the LRT comparing models M1 (neutral) and M2 (selection), does not detect selection in three out of the six data sets in which positive selection is inferred by other models. These three data sets are the mitochondrial genes (D1), the βglobin gene (D2), and the HIV1 pol gene (D7). The lack of power of M2 (selection) appears to be due to the same reason in all three data sets, that is, the existence of a substantial proportion of sites in the gene with 0 < ω < 1 and the failure of M1 (neutral) to account for them. As a result, the extra category added in M2 is forced to account for sites with 0 < ω < 1, and the small proportion of positively selected sites with ω > 1 is incorporated into the class of neutral sites with ω = 1. In such cases, model M3 (discrete) appears much more powerful. Nevertheless, model M2 does detect positive selection in the three other data sets analyzed in this article. It was also found much more powerful than ad hoc pairwise comparisons or sliding window analysis in a recent study of positive selection in the HIV1 nef gene (Zanottoet al. 1999). Thus we suggest that the model be used in real data analysis, with its limitations borne in mind. Models M0–M3 all involve much less computation than other models implemented in this article and may all be successfully fitted to the data.
We also recommend LRTs based on the beta null model (M7). In particular, comparison between M7 (beta) and M8 (beta&ω) appears to provide a powerful test of positive selection. Models M9–M11 can also be compared with M7 to test for positive selection, but those models detect selection only if a substantial proportion of positively selected sites exist in the gene. They often suffer from convergence problems and seldom fit the data better than M8 despite their use of an additional parameter. In using models M9–M11 as well as other continuous distribution models not based on the beta distribution (M5, M6, M12, and M13), it is important to examine the discrete distributions to see whether there is any category with ω > 1. Although we did find data sets for which M12 and M13 gave good fits, M12 often caused serious convergence problems and M13 was even worse and hardly usable.
Data and program availability and program performance: The sequence alignments, the phylogenetic trees used, and extensive lists of positively selected sites and their posterior probabilities inferred under different models will be made available at the anonymous ftp site (ftp://abacus.gene.ucl.ac.uk/pub/YNGP2000/). Models developed in this article are implemented in the codeml program in the paml program package (Yang 1997), which is distributed at the web site http://abacus.gene.ucl.ac.uk/software/paml.html.
Acknowledgments
We thank Eddie Holmes and Walter Fitch for providing some of the data sets analyzed in this article. We thank Joe Bielawski and Simon Whelan for discussions and Daniel Haydon and two anonymous referees for comments. Z.Y. is supported by Biotechnology and Biological Sciences Research Council grant 31/G10434. R.N. is supported by National Science Foundation grant 9815367 to John Wakeley. N.G. is supported by a Wellcome Trust Fellowship in Biodiversity Research. A.K.P. is supported in part by grant 9701412 from the Danish Natural Science Research Council.
Footnotes

Communicating editor: W. Stephan
 Received October 26, 1999.
 Accepted January 17, 2000.
 Copyright © 2000 by the Genetics Society of America