Detecting Selection in Noncoding Regions of Nucleotide Sequences
Wendy S. W. Wong, Rasmus Nielsen

Abstract

We present a maximum-likelihood method for examining the selection pressure and detecting positive selection in noncoding regions using multiple aligned DNA sequences. The rate of substitution in noncoding regions relative to the rate of synonymous substitution in coding regions is modeled by a parameter ζ. When a site in a noncoding region is evolving neutrally ζ = 1, while ζ > 1 indicates the action of positive selection, and ζ < 1 suggests negative selection. Using a combined model for the evolution of noncoding and coding regions, we develop two likelihood-ratio tests for the detection of selection in noncoding regions. Data analysis of both simulated and real viral data is presented. Using the new method we show that positive selection in viruses is acting primarily in protein-coding regions and is rare or absent in noncoding regions.

MUCH attention has recently been given to positive selection at the molecular level because of its functional importance. Positive selection has been identified in the coding region in the human immunodeficiency virus (HIV)-1 envelope gene (Bonhoeffer et al. 1995), the major histocompatibility complex (MHC; Hughes and Nei 1988), the tumor suppressor gene BRCA1 (Huttley et al. 2000), female reproductive proteins in mammals (Swanson et al. 2001), and many other proteins (Yang and Bielawski 2000).

Several methods can be used to detect selection acting on protein-coding regions. One of the common approaches is to estimate the nonsynonymous rate (dN) to synonymous rate (dS) ratio (the most frequently used symbols in this article are given in Table 1) . The dN/dS ratio is sometimes referred to as ω (e.g., Goldman and Yang 1994). When a codon site undergoes negative selection, synonymous substitutions occur at a faster rate than nonsynonymous substitutions, and therefore ω < 1. However, when there is no selection (neutrality), the rate of synonymous substitutions is equal to the rate of nonsynonymous substitutions, i.e., ω = 1. Alternatively, if the site undergoes positive selection, new mutations are beneficial and ω > 1.

View this table:
TABLE 1

Definitions of symbols used

Inferences regarding ω have been used to demonstrate the presence of selection in many viral systems. Viruses may escape an existing immune response due to mutations in the proteins involved in interactions with the immune system. As a result several viral proteins have been observed to evolve under strong positive selection. In the HIV-1 envelope gene, positive selection has been found at sites that code for the surface positions in the protein (Bonhoeffer et al. 1995; Mindell 1996; Nielsen and Yang 1997; Yamaguchi and Gojobori 1998; Yamaguchi-Kabata and Gojobori 2000). In human influenza A (H3N2), positive selection has been found in the hemagglutinin (HA) gene, which encodes for a molecule that triggers the humoral immune response in humans (Fitch et al. 1997). Endo et al. (1996) and Yang and Bielawski (2000) gave more comprehensive lists of genes that are undergoing positive selection.

While positive selection has been found in the viral genes that code for proteins that interact with the host immune system, very little is known regarding selection in noncoding regions. A variety of research has shown that viral noncoding regions play an important role in gene regulation and function (Shiroki et al. 1995; Walker et al. 1995; Takayoshi et al. 1998). Furthermore, Carter and Roizman (1996) showed that viral introns are involved in alternative splicing and regulation of their own gene expression. The functional importance of the noncoding regions suggests that selection may be acting on them. However, since most interaction between the host immune system and viruses is at the level of proteins and peptides, very little positive selection is expected in noncoding regions compared to that in coding regions. Unfortunately, this and other hypotheses regarding selection in the noncoding region have not been subject to scientific evaluation because no appropriate statistical method has been available for this purpose.

We here extend the Nielsen and Yang (1998) and Yang et al. (2000) methods for detecting positive selection in coding regions to noncoding regions. We model the evolution in coding regions and the evolution in noncoding regions jointly and assume that the neutral (synonymous) nucleotide substitution rate is constant in both the coding and noncoding regions of the same gene. Under this assumption, we introduce a new parameter ζ to model the evolution in the noncoding region. ζ is the nucleotide substitution rate in the noncoding region, normalized by the synonymous nucleotide substitution rate in the coding region. Therefore, when a site is subject to neutral selection, ζ = 1. Similarly, ζ > 1 indicates positive selection, while ζ < 1 suggests the presence of negative selection. The interpretation of ζ is, therefore, similar to the interpretation of ω in models of evolution in coding regions. Using such a combined model we are able to develop a test to detect selection that is applicable to noncoding regions.

Three different simulated data sets are used to test the validity of the models. As an illustration of the method, we also compile 13 viral data sets from publications and GenBank to examine the hypothesis that positive selection mainly targets coding regions of viral genomes.

MODEL OF THE CODING REGION

We model the evolution of coding regions using a continuous-time Markov chain model of codon evolution. This model was proposed by Goldman and Yang (1994) and Muse and Gaut (1994). Like their models, the state space is given by the 61 sense codons, since the 3 stop codons are not allowed in the model. The rate matrix is 61 × 61, Math, where qMathij Math is the rate of transition from codon i to codon j. The transition rate from codon i to codon j is assumed to be proportional to the stationary distribution of the substituted nucleotide in codon j. If we represent codon i as a triplet i1i2i3 and codon j as j1j2j3 (i1, i2, i3, j1, j2, j3 ∈ {T, C, A, G}), and if codons i and j differ by exactly one nucleotide in position k, then Math1

Additionally, Math if codons i and j differ in more than one position. The diagonal entries are defined as Math to fulfill the mathematical requirement that the row sums must equal 0. The parameter κ is the transition/transversion rate ratio. ω, as defined before, is the nonsynonymous/synonymous (dN/dS) rate ratio. μjk is the stationary distribution of jk, the nucleotide in the kth position of the codon. For instance, Math, since it is a nonsynonymous transversion (from histidine to glutamine and C → G is a transversion). The stationary frequency of codon i (i1i2i3) is assumed to be the product of the stationary frequencies of nucleotides i1, i2, and i3, divided by the sum of the stationary frequencies of the sense codons: Math2

Here we assume a universal genetic code, but with slight modification the method can be applied to other genetic codes.

To reduce the number of free parameters, we restrict our model to be time reversible, i.e., Math for any i, j. It is easy to prove that this is indeed the case using standard methods. Furthermore, we take advantage of Felsenstein's (1981) pruning algorithm to save computational time.

Note that this model is different from the Goldman and Yang (1994) model. In our model the codon transition substitution rates depend on the stationary nucleotide frequencies, whereas in the Goldman and Yang (1994) model, the codon transition rates depend on the stationary codon frequencies. κ and the stationary nucleotide substitution rates (μT, μC, μA, μG) govern the neutral mutation rates while ω models the effect of selection.

MODEL OF THE NONCODING REGION

The Hasegawa-Kishino-Yano (HKY)85 model (Hasegawa et al. 1985) is used in the noncoding region, with a parameterization that allows comparisons of substitution rates between coding and noncoding regions. Here, the unit of evolution is one nucleotide. Substitutions between nucleotides are described by a continuous-time Markov process. The instantaneous substitution rate from nucleotide i to j (ij) is given by Math3where μi is the equilibrium frequency of nucleotide i, where i ∈ {A, C, T, G}. ξ is the rate of substitution relative to the neutral selection rate. As in the codon model, the diagonal entries are Math.

Under this model, the estimate of ζ at a site indicates the type of selection acting on it. As in the models by Gaut and Weir (1994), Muse (1995), Nielsen and Yang (1998), and Yang et al. (2000) and in similar work by other authors, we assume that the synonymous substitution rate in the coding region reflects the neutral nucleotide substitution rate. Then Math4

When ζ = 1, the rate of nucleotide substitution at a site is equal to the synonymous codon substitution rate in the coding region, which in turn equals the synonymous nucleotide substitution rate. To illustrate this, consider C → G changes in both the noncoding and coding regions, and assume that the C → G change in the coding region is a change of codon TCC (Serine) to TCG (Serine). In the noncoding region, Math since the C → G change is a transversion, whereas in the coding region, Math since the change is a synonymous transversion. Likewise, ζ > 1 implies that the substitution rate is greater than the neutral nucleotide substitution rate, whereas ζ < 1 implies that the nucleotide substitution rate is lower than the neutral rate.

In our study we consider three rate classes of sites (negative, neutral, and positively evolving) in the noncoding region. We have implemented three models, namely the neutral model, the two-category model, and the three-category model (see Table 2) . In the neutral model, we assume that there is no positive selection. Therefore, ζ can only be ≤1. In the two-category model, ζ can be <1, and it can be ≥1. In the three-category model, ζ can take on values in three categories: ζ < 1, ζ = 1, and ζ > 1; that is, Mathwhere 0 < ζ0 < 1, ζ1 = 1, 1 < ζ2 < ∞, and p1 + p2 + p3 = 1.

View this table:
TABLE 2

Models of variable ζ among sites

The neutral model is a special case of the three-category model in which p2 = 0 and a special case of the two-category model in which ζ2 = 1. Since the neutral model is nested within the two other models, a likelihood-ratio test of the hypothesis of no positive selection can be performed by comparing the maximum-likelihood value obtained under the two- and three-category models to the maximum-likelihood value obtained under the neutral model.

COMBINING BOTH THE NONCODING AND CODING REGIONS

We assume that the DNA sequences are related through a single phylogenetic tree with a known topology, and we assume no recombination within each region analyzed. The true topology of the tree is rarely known; however, the codon-based likelihood methods for detecting positive selection have been shown to be highly robust to the assumptions regarding the topology of the phylogenetic tree (Yang et al. 2000). In practice, a good estimate of the topology of the tree will suffice. Joint optimization of the continuous parameters and the tree topology is currently not computationally feasible for the codon-based models.

In our joint model of coding and noncoding regions, the individual sites (codon sites in the case of the coding region and nucleotide sites in the case of the noncoding region) are assumed to be independent. Therefore, given the model and a particular phylogenetic tree, the log likelihood can be calculated as the sum of the log likelihoods among sites, Math5where NN is the number of nucleotides in the noncoding region, NC is the number of codons in the coding region, and t1, t2 … are the branch lengths of the tree.

The log likelihood defined in Equation 5 can be optimized using standard optimization techniques. The popular local unconstrained optimization procedure [Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm] adopted from Numerical Recipes in C (Press et al. 1992) is used in our program EvoNC. However, since the parameters in our models are constrained (all parameters are constrained to be >0, the proportions of ζ in each category have to add 1 to 1, and some ζ's have to be <1 and some >1), it is a constrained optimization problem. We replaced the original objective function by a quadratic penalty function. The quadratic penalty function consists of the original objective function and a penalty term for each constraint. The penalty term is a multiple of the square of the constraint violation when the current parameter vector violates the constraint and is 0 otherwise (Nocedal and Wright 1999). A barrier term was also incorporated for each parameter to ensure that the vector of parameters lies in the interior of the parameter space.

In Nielsen and Yang (1998) and Yang et al. (2000) ω was allowed to vary among sites. In this study we do not pursue such models as our primary interest is in the noncoding region. Allowing ω to vary among sites in the codon region is likely to increase the likelihood of the model. It might affect the parameter estimates of ζ since ζ and ω may be correlated. However, implementing a model that allows variable ω will make the optimization procedure even more computationally intensive.

LIKELIHOOD-RATIO TESTS AND POSTERIOR DECODING

In general, if we have a model with k categories of ζ, let the corresponding proportions of nucleotide sites in the three categories be p1pk, under the constraint that p1 + … + pk = 1. The probability of observing data (xh) at site h is then Math6

The conditional probability can be calculated using an empirical Bayes approach as shown in Nielsen and Yang (1998). The posterior probability that the nucleotide site h belongs to category k is given by Math7

We assign sites to categories by choosing the category k that maximizes prob(ζk|xh).

The maximum-likelihood framework allows us to test the null hypothesis of no positive selection using likelihood-ratio tests. Two different likelihood-ratio tests can be performed by comparing the neutral model against the three-category model and the neutral model against the two-category model. The three-category model has two more parameters (ζ3, the category that allows ζ > 1, and p3, the proportion of ζ3) than the neutral model. Comparing twice the log-likelihood difference between these two models with the χ2 distribution with 2 d.f. may be used to approximate P values of this test. However, because one of the parameters is on the boundary of the parameter space and another parameter is not estimable under the null hypothesis, the true asymptotic distribution of the likelihood-ratio test statistic is not known under the null hypothesis. The resulting test will therefore be conservative. A better test, similar to test II in Swanson et al. (2003) for the coding region, is to compare the neutral model against the two-category model. In this test twice the difference in log-likelihood is asymptotically distributed as a 50:50 mixture of a point mass at 0 and a χ2 distribution with 1 d.f. (Chernoff 1954; Self and Liang 1987). The benefits of this test are that the reduction in the degrees of freedom may in some cases increase the power of the test and that the true asymptotic distribution of the test statistic is actually known.

SIMULATION

Nucleotide sequence data were simulated using Evolver in the PAML package (Yang 1997) to verify EvoNC. Each simulated data set consisted of 15 sequences of 1400 nucleotides. Each sequence was composed of 200 noncoding nucleotide sites and 400 codon sites (1200 nucleotides). In the noncoding region, positive, neutral, and negative selected data were simulated by varying the branch lengths. In Evolver, the tree length specifies the expected number of substitutions per site along the branches in the tree. Therefore, if we let the neutral sites have a total tree length equal to 1, then the sites that evolve five times faster would have the total tree length equal to 5. Similarly, the sites that evolve five times slower can be simulated by letting the total tree length be 0.2. In the coding region, ω was set to be 0.4. Both regions shared the same κ = 5. The proportions of the three categories of ζ were chosen with an intention to reflect the distribution in real data sets. The distribution of ζ in the four data sets is shown in Table 3 .

View this table:
TABLE 3

Percentage of ζ of the three categories in the four simulation data sets

DATA ANALYSIS

The viral data sets were assembled by searching GenBank for previously published data sets. We used Clustal W version 1.8 (Thompson et al. 1994) for multiple alignments of the data sets. To prevent false positive results because of misalignments, data with poor alignments and/or with detectable recombination were discarded. For the same reason, because interspecific viral sequences are often too diverged to give a reasonably good alignment, only intraspecific viral data were used. We also made sure that the data did not consist of published results of overlapping reading frames.

Table 4 gives the summarized source and the details of the 13 data sets and the GenBank accession numbers are available from the authors upon request. NEIGHBOR in the PHYLIP package version 3.6 (Felsenstein 2002) was used for estimating phylogenetic trees. The alignment gaps were removed by our program EvoNC.

View this table:
TABLE 4

Details of the 13 viral data sets

The coding region of each of the data sets was analyzed by codeml in the PAML package version 3.12 (Yang 1997). Models M7, M8 (Yang et al. 2000), and M8a (Swanson et al. 2003) were used in the study to determine if positive selection exists in the coding region.

For each of the data sets, two tests were performed for the coding region. Test 1 compares twice the log-likelihood ratio difference between M7 and M8 to a χ2 distribution with 2 d.f.. As noted in Yang et al. (2000), this test is conservative. Test 2 compares twice the log-likelihood difference between M8 and M8a to a Mathχ20 + Mathχ21 distribution, as suggested in Swanson et al. (2003).

The selection in the noncoding region was investigated using EvoNC. Both the coding and the noncoding regions were used in the program. Similar to the coding region, two tests were performed for each of the data sets. Test 1 compares twice the log-likelihood ratios between the neutral and the three-category model to a χ22 distribution, whereas test 2 compares twice the log-likelihood ratios between the neutral and two-category model to a Mathχ20 + Mathχ21 distribution. The critical value of the likelihood-ratio test statistic, for a test performed at the 5% significance level, is then 2.71.

RESULTS

Simulation data:

The simulated data were analyzed using the three models (the neutral model, the two-category model, and the three-category model, as mentioned above), and the log-likelihood of each model was used to perform likelihood-ratio tests. Estimates of all parameters of the model were obtained from the three-category model. Each site was categorized according to the posterior probabilities calculated using Equation 8. The estimates of parameters are summarized in Table 5 , and the classification of sites in conserved and positive sets is shown in Figure 1 .

View this table:
TABLE 5

True and estimated values of parameters of the simulated data under the three-category model

Figure 1.—

Predicted vs. actual values of ζ in two of the simulated data sets. The three-category model was used for prediction. Predicted parameter values are the MAP (maximum a priori) estimates.

The estimated values of ζ match quite well with the actual values. In the neutral data set, estimates of Math were obtained for all sites for all models. In this boundary of the parameter space, the proportion of sites in the different categories is not identifiable. In the conserved data set, maximum-likelihood estimates of Math and Math were obtained for the three-category model, which were virtually identical to the true values of 0.2 and 1.0. Again, the proportion of sites in category 1 and 2 is not identifiable. The estimate of the proportion of conserved sites was 0 = 0.51, which was slightly larger than the true proportion 0.50. In the positive data set 1, estimates of Math and Math were obtained for the three-category model, which differed slightly from the true values of 0.2 and 5.0, respectively. The estimated proportion of neutral sites was less than expected (0.15 vs. 0.25). On the other hand, 0 = 0.79 and 2 = 0.06, which were both larger than the true values of 0.725 and 0.025. In the positive data set 2, the estimates from the three-category model are Math and Math, and the corresponding proportions are 0 = 0.79 and 2 = 0.10. Again, the estimated proportions were slightly larger than the true values.

In the neutral and conserved data sets, all three models gave approximately the same maximum-log-likelihood value in these two sets. In these two cases, the neutral null hypothesis was true and was not rejected by the likelihood-ratio test.

In the positive data set 1, the maximum-log-likelihood values for the three models were Math

Both tests reject the false null hypothesis of no positive selection. However, from the maximum-log-likelihood shown below, only test 2 (neutral model vs. two-category model) was significant in positive data set 2. This is probably due to the fact that only 5 of 200 sites were slightly positively selected.

Math

The Bayesian classification of sites performed quite well for the first three simulated data sets. In the neutral data set, EvoNC classified all 200 sites correctly. In the conserved set, 14 of the actual neutral sites were miscategorized as conserved sites while 3 of the actual conserved sites were miscategorized as neutral sites. A total of 183 sites were classified correctly. In positive data set 1, the 5 positively selected sites were all included in the estimated set of positively selected sites. Three neutral sites were also falsely included in this set.

In positive data set 2, 1 out of the 5 positively selected sites was classified as being neutral and 16 out of the 195 negatively selected and neutral sites were classified as being positively selected. The main reason the method performs worse when there are only a few positively selected sites is that the estimates of Math2 are more unreliable. When the maximum-likelihood estimates of parameters have large variance, the empirical Bayes estimates of posterior probabilities may have reduced accuracy. Nevertheless, a close examination revealed that the falsely classified sites have low posterior probabilities. Therefore, by adjusting the cutoff probability the accuracy of the method can be controlled.

These results suggest that our method is capable of picking up strong positive selection, even though as few as 2.5% of the sites are positively selected. On the other hand, when only a small portion of the sites are undergoing weak positive selection, the classification of sites does not have great accuracy.

The viral data sets:

The results of the PAML analysis are summarized in Table 6 . Five out of the 13 viral data sets have significant evidence for positive selection, using test 1 for the coding region (M7 vs. M8): glycoprotein of Ebola virus, foot-and-mouth disease polyprotein, hepatitis C polyprotein, New Castle disease virus nucleocapsid protein, and poliovirus polyprotein. Among these 5 data sets, 2 were not significant using test 2 for the coding region (M8a vs. M8). This may be because the true value of ω was only slightly >1 in the positively selected sites or, possibly, because the beta distribution assumed in model M7 did not fit the true distribution of ω well. For instance, in the foot-and-mouth disease data set, ω was equal to 1.01 in the positively selected sites. Clearly, this cannot be interpreted as good evidence for positive selection in this virus. On the other hand, for example, 11% of the sites in Ebola glycoprotein have ω = 5.92, which is probably a good indication that this protein is undergoing positive selection.

View this table:
TABLE 6

Log-likelihood values under models M7, M8 (Yang et al. 2000), and M8a (Swanson et al. 2003) and parameter estimates under model M8 by PAML in the coding region

In the noncoding regions there was little or no evidence of positive selection (Table 6). For most data sets, the log-likelihoods of all three models were almost exactly the same (Table 7) . One exception was the Japanese encephalitis virus. The test of the two-category model vs. the neutral model in the 3′-untranslated region of Japanese encephalitis virus was marginally significant (P = 0.05), but the test of the three-category model vs. the neutral model (P = 0.22) was not. Nam et al. (2002) described this region as the variable region since it showed a high degree of sequence variation and deletion. However, they also pointed out that despite the fact that the region was highly variable, the predicted RNA structures all had a similar type loop at the 5′ terminus.

View this table:
TABLE 7

Log-likelihood values and parameter estimates of the 13 viral data sets under the three models

After correcting for multiple testing by the Bonferroni procedure, we conclude that the likelihood-ratio tests provide no evidence for positive selection in the viral data that we examined. This result confirms our expectation that positive selection occurs primarily in the coding regions of viruses.

DISCUSSION

Because positive selection is expected to occur in regions where viruses interact with the host immune system, this type of selection is expected to be much more predominant in coding than in noncoding regions. Our study confirmed this belief. However, it should be noted that most of the data sets in the study had a low number of sequences and perhaps had low sequence diversity as well. Anisimova et al. (2001) showed that such data sets would reduce the power of the tests. Therefore, data sets with a low number of sequences should be reexamined when more data become available.

The results of our analysis of simulated data suggest that the new method provides accurate parameter estimates. The likelihood-ratio tests also performed well and detected selection from 15 sequences when only 2.5% of the sites were undergoing positive selection. When more than one category of ζ was present, the program miscategorized ∼10% of the sites. In small data sets the classification of sites may not have the highest accuracy. A similar conclusion has been reached regarding classification of sites in coding regions (Anisimova et al. 2001, 2002). In the analysis of real data, it is advisable to confirm the presence of positive selection in particular sites by employing additional structural, functional, or evolutionary information.

Our model is based on the assumption that there is no selection acting on the synonymous sites and the rate of substitution is constant among sites. This assumption would be violated if there is codon usage bias in the coding region. A lot of the viruses may have codon usage bias due to the overlapping reading frames and/or RNA structural constraints in coding regions. We do not know how much the bias would have affected the parameter estimates but this is a question that could be addressed using simulations.

The models presented here do not allow rate variation among the lineages in the phylogeny. This may result in a loss of statistical power of the likelihood-ratio tests. Our method will not be able to identify positively selected sites if positive selection occurs in only a few lineages, while the negative selection dominates the rest. In the future, we would like to incorporate rate variation among the lineages into the program as well.

Since the optimization procedure (BFGS algorithm) used here is a local optimization procedure, it is possible that the likelihood is trapped at a local minimum. While we do not have a rigorous solution to the problem, it is advisable to run EvoNC with different sets of initial values. If they all result in the same log-likelihood it is very likely that one has found the global minimum.

EvoNC is currently under development for incorporating new models and functionalities. The beta version for the models tested in this article is available from the authors upon request.

Acknowledgments

We are grateful to Charles F. Aquadro and David B. Shmoys for their valuable comments. This work was supported by National Science Foundation (NSF) grant DEB-0089487, NSF/National Institutes of Health grant DMS/NIGMS-0201037, and Human Frontier in Science Program grant RGY0055/2001-M.

Footnotes

  • Communicating editor: J. J. Hein

  • Received October 30, 2002.
  • Accepted December 31, 2003.

References

View Abstract