| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Corresponding author: Ziheng Yang, Department of Biology, 4 Stephenson Way, London NW1 2HE, United Kingdom., z.yang{at}ucl.ac.uk (E-mail)
Communicating editor: W. STEPHAN
| ABSTRACT |
|---|
Comparison of relative fixation rates of synonymous (silent) and nonsynonymous (amino acid-altering) mutations provides a means for understanding the mechanisms of molecular sequence evolution. The nonsynonymous/synonymous rate ratio (
=
) is an important indicator of selective pressure at the protein level, with
= 1 meaning neutral mutations,
< 1 purifying selection, and
> 1 diversifying positive selection. Amino acid sites in a protein are expected to be under different selective pressures and have different underlying
ratios. We develop models that account for heterogeneous
ratios among amino acid sites and apply them to phylogenetic analyses of protein-coding DNA sequences. These models are useful for testing for adaptive molecular evolution and identifying amino acid sites under diversifying selection. Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to estimate the distributions of
among sites. In all data sets analyzed, the selective pressure indicated by the
ratio is found to be highly heterogeneous among sites. Previously unsuspected Darwinian selection is detected in several genes in which the average
ratio across sites is <1, but in which some sites are clearly under diversifying selection with
> 1. Genes undergoing positive selection include the ß-globin gene from vertebrates, mitochondrial protein-coding genes from hominoids, the hemagglutinin (HA) gene from human influenza virus A, and HIV-1 env, vif, and pol genes. Tests for the presence of positively selected sites and their subsequent identification appear quite robust to the specific distributional form assumed for
and can be achieved using any of several models we implement. However, we encountered difficulties in estimating the precise distribution of
among sites from real data sets.
COMPARISON of synonymous (silent) and nonsynonymous (amino acid-altering) substitution rates provides an important means for studying the mechanisms of DNA sequence evolution (![]()
![]()
![]()
![]()
=
), termed the "acceptance rate" by ![]()
measures the selective pressure at the protein level. An
> 1 means that nonsynonymous mutations offer fitness advantages to the protein (individual) and have higher fixation probabilities than synonymous mutations. This is our working definition of positive selection (adaptive molecular evolution) in this article.
The
ratio has almost always been calculated as an average over all codon (amino acid) sites in the gene and over the entire evolutionary time that separates the sequences. The criterion that this average
be >1 is a very stringent one for detecting positive selection (e.g., ![]()
![]()
![]()
close to 0) due to strong functional constraints. Furthermore, most proteins appear to be under purifying selection most of the time (![]()
ratio averaged over time and over sites will not be significantly >1 even if adaptive molecular evolution has occurred. For example, ![]()
An alternative approach is to examine the
ratio over a short evolutionary time (for example, along particular lineages in a phylogeny) or in functionally distinct regions of the gene. ![]()
![]()
ratio is >1 in a region of the major histocompatibility complex (MHC) that codes for the antigen recognition site, while it is <1 in other regions of the gene. When knowledge of functional domains of the protein is unavailable, or when only a few sites are expected to be undergoing positive selection, a promising approach is to devise statistical models that allow for heterogeneous
ratios among sites (![]()
![]()
ratios among sites. One, the "neutral" model, assumes the existence of two classes of sites: conserved sites, at which nonsynonymous mutations are deleterious and removed (
= 0), and completely neutral sites at which
= 1. A "selection" model adds a third class of sites with the underlying
ratio estimated from the data. These models appear too simple to capture the complexity of the substitution process of various proteins. In particular, the neutral model does not allow for sites with 0 <
< 1, such as sites at which nonsynonymous mutations are "slightly deleterious." The selection model with one additional category cannot account for both sites with 0 <
< 1 and sites with
> 1.
In this article, we implement a number of models (statistical distributions) for heterogeneous
ratios among sites. We have two major objectives in fitting those models. The first is to test for the presence of positively selected sites (sites with
> 1) and to identify such sites along the gene. We find that this type of analysis seems insensitive to the exact distribution assumed for
. We note that this test of molecular adaptation is still conservative, as it requires that positively selected sites be under diversifying selection along all lineages on the phylogeny. If positive selection affects only a few lineages, and purifying selection dominates during the rest of the evolutionary time, our test will fail.
Our second objective is to understand what distributions best describe the heterogeneous
ratios among sites in real data. This appears to be a much harder task. Nevertheless, the distribution of
is closely related to the fitness distribution of new mutations, and knowledge of it is important for testing competing population genetics models of molecular evolution (![]()
![]()
![]()
Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to examine the fit of the models and to accumulate empirical knowledge of the
distribution among sites. These analyses reveal diversifying selection in several genes for which it was not previously suspected.
| DATA |
|---|
Ten data sets of protein-coding genes are analyzed. Table 1 lists the number of sequences (s) and the sequence length (n) for each data set. The transition/transversion rate ratio (
), the (average) nonsynonymous/synonymous rate ratio (
), and the tree length (sum of all branch lengths) are estimated under the simple model of one
ratio for all sites (![]()
|
D1: Mitochondrial protein-coding genes from the hominoids:
The 12 protein-coding genes on the H-strand of the mitochondrial genome are concatenated and analyzed as one data set. All the 12 proteins are transmembrane proteins and appear to have similar substitution patterns (![]()
![]()
at ~0.04.
D2: Vertebrate ß-globin genes:
The ß-globin gene of 17 vertebrate species (human, tarsier, bush baby, hare, rabbit, cow, sheep, pig, elephant seal, rat, mouse, hamster, opossum, duck, chicken, African clawed frog, and western clawed frog) were extracted from the EMBL and GenBank databases. The sequences were aligned by hand, and the alignment appeared straightforward.
D3: Drosophila adh gene:
The data set contains alcohol dehydrogenase (adh) gene sequences from 23 species of Drosophila. The data set was described in ![]()
D4: Flavivirus E-glycoprotein gene:
Twenty-two dengue virus E-glycoprotein gene sequences from the alignment published by ![]()
ratio of ~0.05).
D5: Human influenza virus type A HA gene:
The data set is a subset of the HA1 domain of the hemagglutinin (HA) gene of human influenza viruses A analyzed by ![]()
D6: HIV-1 vif gene:
A total of 29 isolates from subtype B are used. The HIV-1 vif gene encodes an accessory protein that is believed to be essential for pathogenic infection, but its exact function is not known (e.g., ![]()
D7: HIV-1 pol gene:
A total of 23 isolates from subtype B are used. The HIV-1 pol gene encodes for three proteins: the reverse transcriptase responsible for reverse transcribing the viral RNA into DNA; the polymerase responsible for cleaving the pol and gag precursor proteins into their final products; and the endonuclease (integrase) responsible for cleaving the host DNA so that the HIV DNA can be inserted. All three proteins are essential for virus replication.
D8: Japanese encephalitis env gene:
Sequences from 23 isolates are used.
D9: Tick-borne flavivirus NS-5 gene:
The data are from ![]()
= 0.025).
D10: HIV-1 env gene V3 region:
HIV-1 env gene V3 region from 13 HIV-1 isolates with a known transmission history was published by ![]()
The alignment for the influenza virus HA gene (D5) was kindly provided by Walter Fitch and those for data sets D6D9 by Edward Holmes.
| THEORY |
|---|
Markov model of codon substitution:
Suppose there are n codons (sites) in the sequence. Let the data at site h (h = 1, 2, ... , n) be xh; that is, xh is a vector of codons from different sequences at that codon site. Models considered in this article allow the dN/dS (
) ratio to vary among sites. Let
(h) be the ratio for site h. The codon substitution model specifies the relative instantaneous substitution rate from codon i to j at site h (h = 1, 2, ... , n) as
![]() |
(1) |
Parameter
is the transition/transversion rate ratio and
j is the equilibrium frequency of codon j. Different assumptions can be made concerning
j (![]()
![]()
![]()
![]()
Likelihood calculation under a model of heterogeneous
ratios among sites:
Using one
parameter for each site would lead to too many parameters in the model. Instead we use a statistical distribution to account for heterogeneous
ratios among sites (![]()
ratios given as
![]() |
(2) |
We want to calculate the probability of observing data xh at site h. Note that given the
ratio for the site, the conditional probability of the data, P(xh|
), can be calculated for any phylogenetic tree and branch lengths using ![]()
![]()
![]()
:
![]() |
(3) |
We assume that the substitution process is independent among codon sites, and then the log-likelihood is a sum over all sites in the sequence
![]() |
(4) |
After maximum-likelihood (ML) estimates of parameters are obtained, an empirical Bayes approach can be used to infer which class the site is most likely from (![]()
(h) =
k) is
![]() |
(5) |
The class k that maximizes the posterior probability is the most likely class for the site. When the
values (
k) for some site classes are >1, this approach can be used to identify sites under positive selection. The posterior probabilities corresponding to classes with
> 1 may be summed up to give a probability P(
> 1) for each site. Sites at which this probability is larger than a threshold value (say, 50, 95, or 99%) may be identified as potentially under positive selection.
Continuous distributions:
In theory, a continuous distribution for
, say, f(
), can be used in the likelihood calculation. The sum in Equation 3 will then be replaced by an integral. However, we have not found a feasible way to perform this calculation and instead resort to the use of a discrete distribution as an approximation, as in ![]()
within each category to represent the distribution of
ratios within that category. The p's and
's in Equation 2 are then calculated as functions of parameters in the continuous
distribution. Let F(
) be the cumulative distribution function (CDF) of the
distribution f(
); F(
) = 
0f(x)dx. Let F-1(·) be the inverse CDF; that is, if p = F(
), then
= F-1(p). We then have pk =
and
k = F-1(
) for k = 0, 1, ... , K - 1. The CDF of the
distribution can be calculated in a straightforward manner for all distributions we consider, and we use a linear-search algorithm to obtain the inverse CDF.
The probability of the data xh is then approximated by
![]() |
(6) |
This may be considered a crude way of doing numerical integration.
Statistical distributions implemented:
The models considered in this article are summarized in Table 2. The codes given are as used in the paml program (![]()
, and base frequencies at the three codon positions. These parameters are not listed in Table 2. Model 0 (M0) assumes one
for all sites (![]()
0 = 0 and a proportion p1 = 1 - p0 of neutral sites with
1 = 1. The selection model (M2) adds an additional class of sites with frequency p2 = 1 - p0 - p1 and with
2 estimated from the data. M1 and M2 were implemented by ![]()
distribution that are likely to occur in real data. The details follow.
|
M3 (discrete):
The discrete model uses an unconstrained discrete distribution to model heterogeneous
ratios among sites (Equation 2). The
k values are arranged in increasing order for unique identification. The model with K classes involves K - 1 frequency parameters and K parameters
k. In this article, K = 3 classes are used.
M4 (freqs):
This model fixes
at several prespecified values and estimates the corresponding frequencies for the site classes. The model with K classes involves K - 1 free frequency parameters pk. We use K = 5, with
k fixed at 0, 1/3, 2/3, 1, and 3 for k = 0, 1, ... , 4.
M5 (gamma):
This model assumes a simple gamma distribution
(
, ß) for
among sites. The density function is f(
) =
for
> 0. The CDF, known as the incomplete gamma function ratio, is
![]() |
(7) |
This is calculated using the algorithm of ![]()
M6 (2gamma):
This model uses a mixture of two gamma distributions,
(
0, ß0) and
(
1, ß1), in the proportions p0 and p1 = 1 - p0. The mean of the second gamma distribution is fixed at 1 (ß1 =
1), so that the model has four parameters. The CDF of the mixture distribution can be calculated from its gamma distribution components.
M7 (beta):
The beta distribution
(p, q) has density
![]() |
(8) |
where B(p, q) is the beta function. The beta distribution can take different shapes (e.g., L, J, U, and inverted-U shapes) in the interval (0, 1). The CDF of the distribution, FB(
; p, q), is the incomplete beta function ratio, calculated using the algorithm of ![]()
> 1). The following four models (M8M11) add an extra component, mainly to account for the possible occurrence of positively selected sites.
M8 (beta&
):
This model adds one extra class of sites to the beta model. A proportion p0 of sites have
drawn from the beta distribution
(p, q), and the remaining sites (proportion p1 = 1 - p0) have the same ratio
1. This model can be compared with the beta model (M7) to test for the presence of positive sites using a likelihood-ratio test (LRT; see below).
M9 (beta&gamma):
This model uses a mixture of
(p, q) and
(
, ß), in proportions p0 and p1 = 1 - p0. The CDF of the mixture distribution is an average of the beta and gamma CDFs.
M10 (beta&gamma+1):
This model uses a mixture of a beta and a gamma. However, the gamma is shifted to the right by one unit, so that the gamma distribution accounts for positively selected sites (
> 1) only. A proportion p0 of sites have
from
(p, q) and a proportion p1 = 1 - p0 of sites have
= 1 + x, where x ~
(
, ß). The CDF of
is thus
![]() |
(9) |
M11 (beta&normal>1):
This model uses a mixture of a beta distribution
(p, q) and a normal distribution
(µ,
2), which is left-truncated at 1. Like the shifted gamma in model M10, the truncated normal accounts for positively selected sites only. The CDF for the truncated normal is 1 -
((µ -
)/
)/
((µ - 1)/
), where
(·) is the familiar CDF of
(0, 1). Therefore the CDF of the mixed distribution is
![]() |
(10) |
M12 (0&2normal>0):
This model assumes a proportion p0 of sites with
= 0 and a proportion (1 - p0) from a mixture of two normal distributions. This mixture is p1 from
(1,
21), and (1 - p1) from
(µ2,
21), truncated at
= 0 to disallow values of
< 0. Consequently, the
distribution for M12 takes value 0 with probability p0, is drawn from a truncated normal distribution centered at
= 1 with probability (1 - p0)p1, and is drawn from a truncated normal distribution centered at
= µ2 with probability (1 - p0)(1 - p1). The CDF of the normal
(µ,
2) truncated at 0 is 1 -
((µ -
)/
)/
(µ/
). The CDF for the mixture of the two truncated normal distributions is
![]() |
(11) |
M13 (3normal>0):
This model uses a mixture of three normal distributions truncated at
= 0: p0 from
(0,
20), p1 from
(1,
21), and p2 = 1 - p0 - p1 from
(µ2,
22). The CDF is
![]() |
(12) |
M13 involves six parameters and can represent a smooth distribution with as many as three modes. However, the use of many parameters in M13 was found to cause problems for the optimization algorithm.
We use 10 categories (K = 10) to approximate the continuous part of the
distribution, so that 11 categories are actually used in models M8 and M12. An example is shown in Fig 1, where model M6 (2gamma) is discretized.
|
The continuous mixture distributions involving the beta (M9, M10, and M11) are not smooth at
= 1. Furthermore, the discretized versions of those models may not be powerful to detect positive selection. Because we use 10 categories in the discrete distribution, each of proportion 10%, there may not be any category with
> 1, if the proportion of sites in the data under positive selection is substantially <10%. In such a case, the model may not suggest positive selection, even if the parameter estimates suggest a component of positively selected sites (that is, even if p1 > 0). It appears important to examine the inferred discrete distribution to determine whether the model suggests sites under selection, as is the case with other continuous mixture models (M5, M6, M12, and M13).
Phylogenetic tree topologies are obtained by ML search using simple nucleotide substitution models. For the purpose of this study, minor differences in the phylogeny do not make any significant difference in estimation of parameters or identification of positively selected sites (see below). The codon frequencies are estimated using the observed nucleotide frequencies at the three codon positions, while other parameters are estimated by numerical maximization of the likelihood function. Independent computer programs were written by Z. Yang and R. Nielsen for error checking.
Likelihood-ratio tests to compare models:
The LRT may be used to compare models implemented in this article. When two models are nested, twice the log-likelihood difference will be compared with a
2 distribution with the degrees of freedom
equal to the difference in the number of parameters between the two models. For example, models M2 (selection) and M3 (discrete) are more general than M1 (neutral) and can be compared with M1. When
2 > 1 in M2 or M3, this becomes a test for the presence of positively selected sites. Similarly, M7 (beta) is a special case of models M8 (beta&
), M9 (beta&gamma), M10 (beta&gamma+1), and M11 (beta&normal>1). Any of those more general models can be compared with M7. When the more general models indicate the presence of sites with
> 1, the comparison constitutes an LRT of positive selection. We note that the null hypothesis in those tests, M1 (neutral) or M7 (beta), corresponds to fixing one of the parameters at the boundary of the parameter space of the alternative hypothesis; that is, the proportion for the component of positively selected sites (p2 in M2 and M3 and p1 in M8M11) is set to zero in the null hypothesis. In such cases, use of the simple
2 distribution is not reliable (![]()
Although the continuous version of model M8 is nested within each of the continuous versions of models M9M11, this is not the case with the discretized versions used in this article. Those models cannot be compared using a
2 approximation to the test statistic, although in theory the null distribution can be generated by Monte Carlo simulation (![]()
![]()
distributions and also rendering formal statistical tests among those models unnecessary.
| RESULTS |
|---|
Likelihood values and parameter estimates obtained from different data sets are listed in Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12. Estimates of branch lengths are not shown, although their sum (the tree length) under model M0 is given in Table 1. Estimates of the transition/transversion rate ratio (
) are quite homogeneous among models in each data set and thus are not shown in Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12. For example,
ranges from 14.3 to 17.0 among the 14 models for the mitochondrial data set (D1), while for the ß-globin genes (D2),
ranges from 2.0 to 2.4. Estimates of branch lengths and of
from the models of heterogeneous
among sites (M1M13) are usually slightly larger than estimates from the model of one
for all sites (M0). This is probably due to the insufficient correction for multiple nonsynonymous substitutions at the same site by the one-ratio model (M0).
|
|
|
|
|
|
|
|
|
|
Our focus is the distribution of
among sites, and we are interested in the shape of the
distribution, as indicated by parameter estimates under different models and the fit of the models to data measured by their log-likelihood values. We discuss tests for the presence of positively selected sites and, when such sites exist, their identification. In the following, we describe results obtained from each data set. The general patterns are summarized in the DISCUSSION.
D1: Mitochondrial protein-coding genes from hominoids (Table 3):
The mitochondrial genes are highly conserved. The average
ratio ranges from 0.04 to 0.05 among all models except for M1 (neutral) and M4 (freqs), suggesting that a nonsynonymous mutation has only 45% as much chance as a synonymous mutation of being fixed in the population. Models M1 and M4 do not fit the data well and gave much larger and probably unreliable estimates of the average
ratio. The selective pressure indicated by the
ratio is highly variable among sites. For example, the model of one
ratio for all sites (M0) is rejected by a big margin when compared with model M3 (discrete), which allows for three classes of sites with different
ratios. The LRT statistic for this comparison is 2
= 2 x [-29690.55 - (-29967.86)] = 554.62, much greater than critical values from a
2 distribution with d.f. = 4.
Parameter estimates from models such as M5 (gamma) and M7 (beta) suggest highly skewed L shapes for the
distribution, with most amino acids highly conserved or almost invariable. The strictly neutral model (M1) fits the data worse than the one-ratio model (M0). The reason appears to be that M1 does not account for sites with positive but very small
ratios. For similar reasons, M4 (freqs) fits the data poorly, as it does not account for highly conserved sites with 0 <
< 1/3 (recall that our implementation of this model fixes
at 0, 1/3, 2/3, 1, and 3). Models M9M11, which add a continuous component to the beta distribution, do not fit the data as well as model M8, which adds a class of sites with a constant
ratio.
The discrete model (M3) suggests a small proportion of sites (p1 = 0.7%) under positive selection, with
2 = 1.43. This model fits the data significantly better than M0 (one-ratio), as mentioned above, or M1 (neutral). Similarly, model M8 (beta&
) suggests a small proportion of sites (p1 = 0.6%) under diversifying selection with
1 = 1.46. The LRT statistic for comparing M7 (beta) and M8 (beta&
) is 2
= 2 x [-29690.71 - (-29699.38)] = 18.54. The P value for this comparison is 0.9 x 10-4, in comparison with the
2 distribution with d.f. = 2. M7 is thus rejected in favor of M8. However, the selection model (M2) does not detect positive selection in this data set. This is apparently due to the fact that the strict neutral model (M1) on which it is based is unrealistic, and the extra category added in M2 optimally accounts for deleterious mutations (with
2 = 0.11). Models M10 (beta&gamma+1) and M11 (beta&normal>1) have the same likelihood value and the estimates of parameters under those models appear to suggest presence (at ~0.5%) of sites under positive selection. However, the discretized distributions do not have any category with
> 1 and do not suggest positive selection; neither is the fit of those two models significantly better than the beta model (M7); 2
= 4.98 with P = 0.17 with d.f. = 3. The other continuous distributions, such as M5 (gamma), M6 (2gamma), M12 (0&2normal>1), and M13 (3normal>1), do not have site classes with
> 1 either. However, this failure is due to those models' insistence on having 10% of sites in a category with
> 1. The small proportion of positively selected sites is detected only when the proportion is a free parameter estimated from the data, that is, under models M3 (discrete) and M8 (beta&
). In sum, the models provide consistent evidence for the presence of a small proportion of positively selected sites in the mitochondrial genes.
The Bayes approach (Equation 5) can be used to identify sites potentially under diversifying selection. Model M3 (discrete) suggested 14 positively selected sites with P(
> 1) > 0.5, while model M8 (beta&
) located only 12, all included in the 14 found by model M3. If the stricter 95% threshold is used, only 2 sites are identified: one in ATP6 (with the posterior probabilities 0.96 under M3 and 0.94 under M8) and another in ND5 (with the probabilities 0.98 under M3 and 0.96 under M8). Overall, although the two models are constructed very differently, they produced very similar results concerning the inference of the positively selected sites.
D2: ß-globin genes from vertebrates (Table 4):
The ß-globin genes are moderately conserved, with an average
ratio between 0.31 and 0.36 among all but the worst-fitting models. Parameter estimates from models such as M5 (gamma), M7 (beta), and M9 (beta&gamma) suggest that the distribution of
over sites is L-shaped, with most sites highly conserved. Although better than M0, the strictly neutral model (M1) does not fit the data well. Neither did model M4 (freqs). The continuous mixture distributions (M8M13) fit the data almost equally well.
Parameter estimates and LRTs suggest presence of sites under diversifying selection (Table 4). For example, the discrete model (M3) indicates ~7.9% of sites are under positive selection with
2 = 1.69. M3 fits the data significantly better than any of models M0 (one-ratio), M1 (neutral), and M2 (selection). Similarly to the case of the mitochondrial data set, the selection model (M2) does not detect positive selection, as the strict neutral model it is based on does not allow for sites with 0 <
< 1. The beta distribution (M7) is rejected when compared with any of M8 (beta&
), M9 (beta&gamma), M10 (beta&gamma+1), or M11 (beta&normal>1). The LRT statistic is 2
= 22.18 between M7 and M8 (with P = 0.15 x 10-4, d.f. = 2) and 20.48 between M7 and M10 or M11 (with P = 0.13 x 10-3, d.f. = 3). The discretized versions of models M9, M10, and M11 have one category (10%) with an
ratio of ~1.7. Except for model M2 (selection), all other models that allow for positively selected sites (M3M6 and M8M13) do suggest existence of such sites.
Positively selected sites are identified using the Bayes approach (Equation 5). The different models give very similar lists of positively selected sites, although the posterior probabilities vary somewhat among models. The discrete model (M3) identified more sites under positive selection than other models. At the 99% level, eight sites are identified: 7, 42, 48, 50, 54, 67, 85, and 123. Those sites are also the top eight under model M8 (beta&
); but at the 99% level, M8 located only sites 7, 50, and 123. To test whether the choice of tree topology has any effect, we also used the star phylogeny to analyze the data under model M3 (discrete). That analysis identified the following sites at the 99% level: 7, 42, 48, 50, 54, 67, 74, 85, 114, 123, 124, and 128. This list is very similar to those obtained using the best tree.
D3: Drosophila adh gene (Table 5):
The average estimate of
is ~0.11 for all the best-fitting models. Parameter estimates from models such as M5 (gamma) and M7 (beta) suggest that the
distribution has a highly skewed L shape, with most sites highly conserved with very small
ratios. Model M3 (discrete) fits the data significantly better than models M0 (one-ratio), M1 (neutral), or M2 (selection), but none of the models suggest presence of positively selected sites. The beta model (M7) provides a good fit to the data, although slightly worse than M3 (discrete), and adding an extra component to the beta (as in models M811) leads to no significant improvement in the log-likelihood score. Other continuous mixture models (M5, M6, M8, and M12M13) do not suggest positive selection either.
We note that previous studies (e.g., ![]()
D4: Flavivirus E-glycoprotein gene (Table 6):
The average estimate of
is ~0.06 for all the best-fitting models. The
ratios are highly variable among sites and the
distribution appears to have a highly skewed L shape, with most sites highly conserved with very small
ratios. The discrete model (M3) fits the data much better than M0 (one-ratio), M1 (neutral), or M2 (selection), but none of the models suggest the existence of sites under diversifying selection. The beta distribution (M7) fits the data better than M3 (discrete). Since M7 also has three fewer parameters than M3, it is the preferred model for this data set. Although models M12 and M13 have marginally higher likelihood values than M7, this improvement is not more than expected given their greater numbers of parameters. Adding an additional component to the beta model to allow for positively selected sites (models M8M11) provides no significant improvement to the model's fit to data.
D5: Human influenza A virus HA gene HA1 domain (Table 7):
The average
ranges from 0.39 to 0.41 among all models except for M1 (neutral) and M7 (beta), which do not allow for positively selected sites, fit the data badly, and give smaller estimates of
. The average acceptance rate is <1, indicating that, on average, purifying selection dominates the evolution of the gene. The one-ratio model (M0) is easily rejected when compared with all other models, which allow the
ratio to vary among sites. Distributions of
among sites estimated under M5 (gamma) and M6 (2gamma) are L-shaped, with heavy tails. The beta distribution (M7) has an interesting U shape, possibly because the model is forced to attempt to account for sites with
> 1. The discrete model (M3) fits the data as well as or better than all other models considered; M8 has nearly as high a likelihood value, and uses one fewer parameter.
Models that allow for positively selected sites, that is, M2 (selection), M3 (discrete), M5 (gamma), M6 (2gamma), and M8M13, all suggest presence of positively selected sites. For example, the selection model (M2) suggests ~1.1% of sites under positive selection with
2 = 5.8. Model M3 (discrete) suggests a large proportion of sites (25.1%) under weak diversifying selection with
1 = 1.28 and a small proportion of sites (0.8%) under strong diversifying selection with
2 = 6.90. Note that the selection model (M2) suggests a large proportion of neutral sites with
1 = 1, and the estimates from M2 and M3 are not very different. Both models have significantly higher likelihood values than models M0 and M1, providing strong evidence for adaptive evolution. Similarly, M8 (beta&
) suggests ~1.3% of sites under positive selection with
2 = 5.2. The LRT statistic for comparing M7 (beta) and M8 (beta&
) is 2
= 2 x 5.43 = 10.86, >
21% = 9.21 with d.f. = 2. Models M10 (beta&gamma+1) and M11 (beta&normal>1) have the same likelihood value, and both models fit the data significantly better than the beta model (M7); the test statistic is 2
= 2 x [-3079.25 - (-3083.63)] = 8.76, with P = 0.033 with d.f. = 3. In addition to models M9M11, which have components specifically designed to allow for positively selected sites, models M5 (gamma) and M6 (2gamma) also have categories with
> 1 when discretized.
Amino acid sites 135 and 226 are identified to be under positive selection at the 95% level by all models that allow for positive selection. Model M3 (discrete) suggested many more sites because the parameter estimates under this model suggest a large proportion of weakly selected sites. At the 50% level, models M5, M6, and M9M13 suggested the same 23 positively selected sites: 15, 94, 121, 124, 133, 135, 137, 138, 155, 156, 157, 159, 163, 174, 186, 189, 196, 197, 219, 226, 246, 262, and 273.
D6: HIV-1 vif gene (Table 8):
The pattern for this data set is rather similar to that of the influenza virus HA gene (D5; see Table 7). The average
ratio over all sites ranges from 0.6 to 0.9 among models. The one-ratio model (M0) is easily rejected when compared with any of the more-general models that allow for variable
ratios among sites. Model M1 (neutral) also fits the data set much better than M0. The gamma (M5) and double gamma (M6) models suggested L-shaped distributions for
, with heavy tails. The beta model (M7) suggests a U shape, possibly because the underlying
ratio at some sites is >1. Model M3 (discrete) fits the data as well as or better than all other models considered.
All models that allow for positively selected sites do suggest existence of such sites in this gene. For example, the selection model (M2) suggests ~8.5% of sites under positive selection with
2 = 4.2. Model M3 (discrete) suggests a large proportion of sites (32.5%) under weak diversifying selection with
1 = 1.21 and a small proportion of sites (7%) under strong selective pressure with
2 = 4.0. Both models M2 and M3 have significantly higher likelihood values than models M0 (one-ratio) or M1 (neutral). Similarly, M8 (beta&
) suggests that ~9% of sites are under positive selection with
= 3.4. The LRT statistic for comparing M7 (beta) and M8 (beta&
) is 2
= 2 x [(-3370.66) - (-3400.45)] = 59.58, >>
21% = 9.21 with d.f. = 2. LRTs comparing the beta model (M7) with any of models M9 (beta&gamma), M10 (beta&gamma+1), and M11 (beta&normal>1) give similar results. In sum, all the models provide consistent and statistically significant evidence for the existence of positively selected sites in this gene.
We plotted the posterior probability distributions, calculated using Equation 5, for sites along the HIV-1 vif gene. This was done for models M2 (selection), M3 (discrete), M8 (beta&
), and M9 (beta&gamma), with results for M3 and M9 shown in Fig 2. Posterior probabilities under M2 and M3 are very similar, but as one may expect, many sites included in the neutral class under M2 are included in the weakly selected class under M3 (results not shown). Plots of M8 and M9 are virtually identical (results not shown). At the 99% level, model M9 identified 10 positively selected sites (31, 33, 39, 63, 92, 101, 109, 122, 127, and 167). At the same level, model M2 identified 5 of the sites only: 31, 39, 122, 127, and 167, while sites 33, 63, 92, 101, and 109 are included at the 95% level. The two models used in Fig 2, M3 and M9, are constructed very differently, and yet the posterior distributions under the two models are highly similar.
|
D7: HIV-1 pol genes (Table 9):
The average
ratio over all sites ranges from 0.20 to 0.27 among models except for M1 (neutral) and M7 (beta), indicating relatively strong purifying selection. The strict neutral model (M1) and the beta model (M7) give lower estimates of the average
ratio, as they do not allow for sites with
> 1. M1 fits the data much better than the one-ratio model (M0), but is much worse than M7 (beta). There are clearly sites with 0 <
< 1. Parameter estimates under models such as M5 (gamma) and M6 (2gamma) suggest L-shaped distributions for
, with heavy tails. The beta model (M7) suggests a U shape, and the peak at
= 1 is probably caused by sites with
> 1 in the data. The discrete model (M3) fits the data as well as or better than all other models considered. Simpler nested models (M0, M1, and M2) are all rejected when compared with M3.
The selection model (M2) does not suggest presence of positively selected sites. Similar to the cases of the mitochondrial (D1) and ß-globin (D2) genes, this failure appears to be due to the unrealistic nature of the neutral null model (M1), which does not account for sites with 0 <
< 1. Model M3 (discrete) suggests that a small proportion of sites (1.9%) are under strong diversifying selection with
2 = 4.7. This model fits the data significantly better than models M0, M1, and M2. For example, the LRT statistic for the comparison between M0 and M3 is 2
= 2 x [(-9363.57) - (-9619.30)] = 2 x 255.73 = 511.46, >>
21% = 13.27 with d.f. = 4. Similarly, model M8 (beta&
) suggests that ~2.5% of sites are under strong positive selection with
= 4.1. The LRT statistic for comparing M7 (beta) and M8 (beta&
) is 2
= 2 x [(-9365.88) - (-9405.74)] = 2 x 39.86 = 79.72, >>
21% = 9.21 with d.f. = 2. Models M9M11 all have the same log-likelihood value, substantially lower than that for M8 (beta&
). Nevertheless, those models are significantly better than M7 (beta). The discretized versions of models M9M11 all have a category of sites (10%) with
1.8. Apart from M2, which fits the data relatively poorly, all the models that allow for positively selected sites gave significant evidence for the existence of such sites. These results provide strong support for adaptive evolution in the HIV-1 pol gene.
All models that allow for positive selection (M3M6 and M8M13) pinpoint similar sets of amino acid sites as targets of diversifying selection. For example, at the 99% level, M8 (beta&
) identified the following sites: 67, 347, 478, 568, 761, and 779, while at the 95% level, four additional sites are identified by that model (sites 3, 14, 41, and 379).
![]()
![]()
![]()
![]()
D8: Japanese encephalitis env gene (Table 10):
This gene is under strong purifying selection, as the average
ratio over sites is ~0.05 for all models except for M1 (neutral) and M4 (freqs). Models M1 and M4 gave much larger estimates (0.17 and 0.07, respectively), but these do not appear reliable because the two models do not fit the data well. M1 is even poorer than the one-ratio model (M0). The discrete model (M3) suggests a small proportion (0.3%) of nearly neutral sites with
2 = 1.04, and there is no evidence for positive selection. Models such as M5 (gamma), M6 (2gamma), and M7 (beta) suggest highly skewed L-shaped distributions for
, with most sites under strong selective pressure (with
close to 0) and with very little probability density for the region
> 1. The discrete model (M3) fits the data as well as or better than all other models considered. Model 8 (beta&
) fits the data better (although not significantly) than M7 (beta). However, the estimated
for the additional category of sites is <1. Furthermore, the discretized versions of models M9M13 do not have any category with
> 1.
D9: Tick-borne flavivirus NS-5 gene (Table 11):
This gene is the most conserved among the 10 genes analyzed in this article. The average
ratio over all sites ranges from 0.02 to 0.03 among models, except for M1 (neutral) and M4 (freqs). The latter two models produced larger but unreliable estimates as they fit the data very poorly. The strict neutral model (M1) fits the data even more poorly than the one-ratio model (M0). Although the two models have the same number of parameters and are not nested, the log-likelihood difference (
= 582.48) is huge. Neither the selection (M2) nor discrete (M3) model indicates presence of positively selected sites. The beta mixture models (M8M11) do not fit the data any better than the simple beta model (M7). Furthermore, none of the 10 categories in the discrete distributions used to approximate the continuous beta mixture models (M9M11) have an
ratio >1. Models such as M5 (gamma), M6 (2gamma), and beta (M7) all suggest highly skewed L-shaped distributions for
. The discrete model (M3) fits the data as well as the best of other models.
D10: HIV-1 env gene V3 region (Table 12):
The
ratio averaged over all sites ranges from 0.9 to 1.2 among models, except for models M1 (neutral) and M7 (beta). The latter two models give lower and unreliable estimates as they do not account for positively selected sites. This gene region has the highest overall
ratios among the 10 data sets analyzed in this article. The HIV-1 env gene and in particular the V3 region are well known to be under diversifying selection. The strict neutral model (M1) fits the data much better than the model of one
for all sites (M0). The discrete model (M3) fits the data about as well as the best of other models considered.
All models that allow for positive selection do suggest a substantial proportion (1870%) of positively selected sites. Models such as M5 (gamma) and M6 (2gamma) suggest L-shaped distributions for
with heavy tails. The beta model (M7) suggests a U shape, as the model uses the density at
~ 1 to account for sites with
> 1. Many amino acids are clearly under diversifying selection. For example, the discrete model (M3) suggests ~42% of sites are under relatively weak positive selection with
1 = 1.8 and a further 4.6% of sites are under strong positive selection with
2 = 7.1. Similarly, M8 (beta&
) suggests that ~20% of sites are under positive selection with
1 = 3.5. Note that the differences between the models may not be as large as they may seem, as we expect it to be difficult to distinguish a small proportion of strongly selected sites from a large proportion of weakly selected sites. The beta distribution in M8 (beta&
) is U-shaped with a high proportion of sites with
~ 1. The LRT statistic for comparing M7 (beta) and M8 (beta&
) is 2
= 2 x [(-1106.39) - (-1115.40)] = 2 x 9.01 = 18.02, with P = 0.00012 compared with the
2 distribution with d.f. = 2. The beta mixture models (M9M11) fit the data slightly better than M8 (beta&
), and significantly better than the beta model (M7), again suggesting the operation of diversifying selection at some sites.
All models that allow for sites under positive selection identified sites 28, 66, and 87 with high posterior probability supports (
99%). Model M12 (0&2normal>1) included sites 26 and 51 as well at the 99% level. At the 95% level, M12 suggested six additional sites: 22, 24, 68, 69, 76, and 83.
| DISCUSSION |
|---|
Effects of tree topology:
Maximum-likelihood estimation under models in this article relies on the phylogenetic relationship among the sequences. To examine the effect of the phylogeny on the analysis, we used six candidate trees to fit all 14 models (M0M13) to the vertebrate ß-globin genes (D2). The six trees were either inferred from the ß-globin data or based on conventional wisdom. The best tree according to the one-ratio model (M0) was used to obtain results of Table 4. Results under another tree (the worst of the six trees under model M0) are presented for four models (M0, M3, M7, and M8) in Table 13. The estimates under this tree are highly similar to estimates presented in Table 4. LRTs of positive selection lead to the same conclusions no matter which of the two (or six) trees is used. The inference of sites under positive selection does not seem to be sensitive to the assumed tree topology either. As mentioned before, use of even the star tree generated lists of positively selected sites for the vertebrate ß-globin genes that are very similar to those obtained under the best tree. A similar analysis was performed on the HIV vif data set using two candidate trees. Parameter estimates, LRTs, and posterior probability calculations are all highly similar between the candidate trees (results not shown). While the correct tree should obviously be used if it is known, those results suggest that a reasonably good phylogeny may be sufficient for estimating parameters in the
distribution and for performing LRTs of positive selection.
|
Computational and theoretical problems:
Models implemented in this article appear very useful in testing the existence of positively selected amino acid sites and in identifying such sites when they exist. The different models also appear to produce consistent and convincing results. However, we encountered numerous practical problems. The continuous mixture models M9M13, and especially the parameter-rich model M13 (3normal>0), were found to converge to ML estimates very slowly during the iteration. In some data-model combinations, the likelihood value was also found to be sensitive to the number of categories (K) used in the discrete approximation. While the likelihood values reported in Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12 are reliable, parameter estimates under some models may not be. In some cases, different values of the parameters gave virtually the same likelihood values, and the likelihood surface appears to be nearly flat. Those computational difficulties appear to a large extent to be caused by our use of the discrete distributions to approximate the continuous ones. Two different continuous distributions (or two sets of parameters for the same continuous distribution) may look very similar after they are discretized. We note that distinguishing statistical distributions is almost always a difficult task, but the discretization appears to have further reduced the power of the method.
We also implemented a different discretization scheme, in which K1 = 6 categories are used for the region
< 1 and K2 = 4 categories are used for the region
1. This scheme appears to perform better for data sets with a small proportion of positively selected sites (such as in D7), but worse when the data do not contain positively selected sites (such as in D9). When a large proportion of sites are under diversifying selection (such as in D10), both schemes appear to perform well. Since the new K1-K2 scheme is not consistently better than the old scheme of K = 10 equal-probability categories, we have not used it for analysis in this article. Future work to devise more efficient approximations to the integral of Equation 6 is highly desirable and may restore some power to distinguish those continuous distributions we implemented.
Models implemented in this article assume that the selective pressure indicated by the
ratio is identical among evolutionary lineages. They also assume that the nonsynonymous substitution rate is independent of the amino acids being interchanged; that is, at a positively selected site, all amino acids are assumed to be acceptable and all amino acid changes are assumed to be advantageous. By the criterion we use, a site will be considered to be under positive selection only if the nonsynonymous rate, averaged over all lineages in the phylogeny and over all possible amino acid-replacement mutations at the site, is higher than the synonymous rate. Thus LRTs of positive selection implemented in this article are conservative. It appears desirable to develop models that allow the selective pressure to vary both among lineages and among sites; such models may be much more powerful for detecting adaptive molecular evolution than those implemented in this article. In this regard, it should be noted that our models, conservative as they are, identified positive selection in several genes not previously suspected of being under positive selection.
Another assumption made in this article is a constant mutation (synonymous) rate among sites. This assumption may not always be realistic. However, we suggest that synonymous rate variation is unlikely to lead to false conclusions of adaptive evolution by the methods of this article. For example, at mutational hot spots, both synonymous and nonsynonymous rates will be elevated, and if nonsynonymous mutations do not offer a selective advantage, the underlying
ratio at such sites will not be >1. Similarly, in our formulation, selection at silent sites for translational efficiency, which has been nicely demonstrated in Drosophila by