- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Supplemental data
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Barrier, M.
- Articles by Purugganan, M. D.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Barrier, M.
- Articles by Purugganan, M. D.
Selection on Rapidly Evolving Proteins in the Arabidopsis Genome
Marianne Barriera, Carlos D. Bustamantec, Jiaye Yub, and Michael D. Puruggananaa Department of Genetics, North Carolina State University, Raleigh, North Carolina 27695
b Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695
c Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom
Corresponding author: Michael D. Purugganan, North Carolina State University, 3513 Gardner Hall, Box 7614, Raleigh, NC 27695., michaelp{at}unity.ncsu.edu (E-mail)
Communicating editor: M. K. UYENOYAMA
| ABSTRACT |
|---|
Genes that have undergone positive or diversifying selection are likely to be associated with adaptive divergence between species. One indicator of adaptive selection at the molecular level is an excess of amino acid replacement fixed differences per replacement site relative to the number of synonymous fixed differences per synonymous site (
= Ka/Ks). We used an evolutionary expressed sequence tag (EST) approach to estimate the distribution of
among 304 orthologous loci between Arabidopsis thaliana and A. lyrata to identify genes potentially involved in the adaptive divergence between these two Brassicaceae species. We find that 14 of 304 genes (
5%) have an estimated
> 1 and are candidates for genes with increased selection intensities. Molecular population genetic analyses of 6 of these rapidly evolving protein loci indicate that, despite their high levels of between-species nonsynonymous divergence, these genes do not have elevated levels of intraspecific replacement polymorphisms compared to previously studied genes. A hierarchical Bayesian analysis of protein-coding region evolution within and between species also indicates that the selection intensities of these genes are elevated compared to previously studied A. thaliana nuclear loci.
THE genetic architecture of species differences has been the subject of intense study in the last few years (![]()
![]()
![]()
![]()
![]()
![]()
One indicator of adaptive selection at the molecular level is an excess of amino acid replacement fixed differences per replacement site relative to the number of synonymous fixed differences per synonymous site (
= Ka/Ks; ![]()
![]()
![]()
< 1. If most amino acid variation is neutral, such as in pseudogenes,
1 (![]()
> 1 (![]()
![]()
varies radically among different classes of genes: in plant Brassicaceae species, the mean
is
0.14 (![]()
has been shown to cluster around 1.0 (![]()
> 1 have been observed in gamete recognition protein-coding genes (![]()
![]()
![]()
in these latter genes, which encode rapidly evolving proteins, are believed to arise from selection for divergence in protein structure and function.
Identifying genes with increased values of
will be facilitated by evolutionary genomic approaches that permit investigators to sample and compare large numbers of genes between species genomes to search for those loci characterized by rapid evolution (![]()
![]()
![]()
![]()
Our objective is to examine whether an evolutionary EST approach can identify genes with increased selection intensities in the Arabidopsis genome. Expressed sequence tags from developing inflorescences of Arabidopsis lyrata were compared to the whole-genome sequence of the model plant A. thaliana to estimate the distribution of nonsynonymous and synonymous substitution differences between these two Brassicaceae species. Fourteen genes with
> 1, which encode rapidly evolving proteins, were identified between these two plant taxa. These genes have accelerated rates of nonsynonymous substitutions that may be associated with adaptive evolution since the divergence of these two species
5.2 MYA (![]()
| MATERIALS AND METHODS |
|---|
Isolation and sequencing of expressed sequence tags:
Seeds from individuals of a population of A. lyrata in Karhumaki, Russia were obtained from Outi Savolainen and Helmi Kuittinen. Total RNA was extracted from the A. lyrata inflorescences using the RNeasy plant mini kit (QIAGEN, Valencia, CA) and a cDNA library constructed in the plasmid vector pCMV-PCR using the PCR cDNA library construction kit (Stratagene, La Jolla, CA). Plasmid DNA from cDNA clones was isolated using the REAL Prep96 BioRobot kit (QIAGEN) with the BioRobot 9600 (QIAGEN) and sequenced from the 5' end using an ABI Prism 3700 96-capillary automated sequencer (Perkin-Elmer, Norwalk, CT). Sequences were edited on the basis of Phred (![]()
Analyses of ESTs:
A. thaliana sequences homologous to the high-quality A. lyrata EST sequences were identified by BLAST analysis against the A. thaliana whole-genome coding database found at The Arabidopsis Information Resource database (http://www.arabidopsis.org), using a maximum expected value (E) of e-5. The GenBank nonredundant nucleotide sequence database was also searched to find the closest matching A. thaliana genomic bacterial artificial chromosome (BAC) clone sequence. The top matches from each database were visually aligned with their matching A. lyrata EST sequence. Calculations were made on the basis of pairwise comparisons between the A. lyrata EST and A. thaliana coding region genomic DNA sequence. EST-based nonsynonymous and synonymous distances were calculated using the modified ![]()
![]()
![]()
and
, where
is an error term that reflects the empirical error in sequence determination. Sequence comparisons using duplicate ESTs suggest that
0.2% (M. BARRIER, unpublished results), although this may be an overestimate since some of these differences may arise from allelic variation.
Isolation and sequencing of alleles:
Six genes with K*a/ K*s values >1 were chosen for molecular population genetic analysis. Primers were designed to amplify 1- to 2-kb regions of these genes on the basis of the A. thaliana sequences in the selected comparisons (see Table S1 at http://www.genetics.org/supplemental/). Leaf tissue from 1013 A. thaliana ecotypes was obtained from single-seed propagated material provided by the Arabidopsis Biological Resource Center (see Table S2 at http://www.genetics.org/supplemental/). DNA was isolated from these A. thaliana ecotypes as well as 25 A. lyrata individuals (see above), using the DNeasy plant mini kit (QIAGEN). PCR of A. thaliana samples was performed with Taq DNA polymerase (Eppendorf, Madison, WI), using standard protocols. A. thaliana samples were sequenced directly via cycle sequencing with primers in both directions. PCR of A. lyrata samples was performed with the error-correcting Tgo polymerase (Roche, Indianapolis), using the manufacturer's amplification protocol. The error rate of error-correcting polymerases is <1 in 7000 bp (![]()
Molecular population genetic data analysis:
Sequences of A. thaliana and A. lyrata populations were aligned and visually corrected. All polymorphisms were visually checked against chromatographs or via resequencing. Analysis of polymorphism and divergence was carried out using DnaSP 3.5 (![]()
(![]()
(![]()
![]()
![]()
![]()
| RESULTS AND DISCUSSION |
|---|
Expressed sequence tags in A. lyrata:
A small collection of ESTs were isolated and sequenced to obtain genes expressed in developing inflorescences in A. lyrata. From a cDNA library of
2800 colonies, 768 clones were sequenced. From this sequence collection, 561 good-quality sequences of at least 200 bp in length were subjected to further analysis. A. thaliana orthologs to these A. lyrata ESTs were identified by BLAST searches of the whole-genome A. thaliana coding sequence database; 78 of the sequences did not match a clear A. thaliana ortholog and were not considered further. Ninety-five duplicate EST matches for A. thaliana coding sequences were also eliminated. The GenBank nonredundant nucleotide database was also searched for A. thaliana BAC clone sequences containing genes homologous to the A. lyrata EST sequences. By aligning the genomic BAC clone sequence with the A. lyrata and A. thaliana coding sequences, the boundaries of noncoding regions were located. After eliminating 84 sequence alignments with <150 bp of coding sequence, 304 unique ESTs remained for further analysis (see Table S3 at http://www.genetics.org/supplemental/). Although these ESTs represent coding region fragments, we subsequently refer to these as genes.
The A. lyrata EST sequences were assigned to different functional categories using the classifications of the orthologous A. thaliana sequences from the TAIR database (ARABIDOPSIS GENOME INITIATIVE 2000). Unclassified genes and those whose classifications were ambiguous were not included in this comparison. Of the >25,000 sequences in the A. thaliana coding sequence database, only slightly >4000 have thus far been unambiguously classified (ARABIDOPSIS GENOME INITIATIVE 2000). Eighty-seven of the 304 unique A. lyrata ESTs matched an A. thaliana coding sequence that has yet to be classified. In determining the count for each functional category, multiple categories listed for a single sequence were each counted as an equal fraction of the sample count. On the basis of this analysis, the range of functional categories represented by the 304 unique ESTs appears to be representative of those observed for the entire A. thaliana gene set (see Fig 1).
|
Distribution of nucleotide substitution rates between A. thaliana and A. lyrata:
Comparisons of the coding sequences of the 304 ESTs from A. lyrata with the whole genome sequence from A. thaliana allow us to compare the distribution of the rates of nucleotide substitution between these two species. The distributions of both synonymous and nonsynonymous substitutions are shown in Fig 2. The mean length of aligned coding sequences is 111 ± 2.19 codons. The distribution of the number of synonymous nucleotide substitutions ranges from 0.000 to 0.552 synonymous substitutions per synonymous site. The distribution of K*s has one mode between 0.10 and 0.15 and has a long tail. The mean synonymous substitution distance is 0.119 ± 0.004 (see Fig 2A), which is comparable to estimates observed in previous comparisons of A. thaliana/A. lyrata loci. If we assume a divergence time of 5.2 MYA for the two species (![]()
1.1 x 10-8 substitutions/site/year.
|
The distribution of the number of nonsynonymous substitutions between the two species differs from that observed for synonymous substitutions. The nonsynonymous distance distribution has a mode of <0.050 nonsynonymous substitutions/nonsynonymous site, and the frequency decreases with increasing nonsynonymous substitution distances until
0.150 (see Fig 2B). The range of K*a is 0.0000.159 nonsynonymous substitutions/nonsynonymous site, with a mean of 0.025 ± 0.002. As expected, the mean K*a < K*s, which reflects that action of purifying selection that prevents many nonsynonymous mutations from reaching fixation between species. The mean rate of nonsynonymous substitutions in nuclear genes between the two species is 0.24 x 10-8 substitutions/site/year, which is approximately fivefold lower than the average synonymous mutation rate.
The distribution in evolutionary rates between species assumes that the comparisons are for orthologs between the two species. Identifying the correct orthologs between species will be confounded by gene duplications immediately at or prior to the most recent common ancestor of A. thaliana and A. lyrata, followed by deletion of one of the duplicate gene copies in the A. thaliana genome. It is unclear what the rate of gene duplication/deletion is within these species genomes.
Variation in selective constraints among Arabidopsis genes:
The historical action of selection on a gene can be inferred from the relative ratio of nonsynonymous to synonymous substitutions,
= Ka/Ks (![]()
![]()
![]()
![]()
* between A. thaliana and A. lyrata. The value of
is ascertained for each of the 304 orthologous pairs between the two species, corrected for sequencing errors. The estimates of
* range from 0.00 to 2.59; the distribution is similar to that of K*a in that the most genes have a low
* value, and the distribution decreases with increasing
* (see Fig 2C). The mean value of
*, obtained by bootstrap resampling of K*a and K*s pair values 100,000 times (![]()
*
0.241]. This is higher than previous estimates of
0.14 for a set of comparisons between A. thaliana and Brassica rapa (![]()
0.18 for four gene comparisons between A. thaliana and A. lyrata (![]()
Of the 304 orthologous gene pairs in this evolutionary EST study, 37 genes (12%) have
* = 0. These represent genes whose encoded proteins are under very strong selective constraint. At the other extreme, 14 genes (
5%) have
* values >1, suggesting that these genes may have accelerated rates of nonsynonymous substitutions associated with higher selection intensities on amino acid replacement changes. These loci include genes that encode RNA and zinc-finger helicases, extensin-like proteins, and zinc-finger transcription factors. More than half of these rapidly evolving protein loci (8 out of 14 genes) encode hypothetical or putative proteins of unknown function. Genes with
> 1 have a higher mean K*a (0.075 ± 0.011) and a lower mean K*s (0.042 ± 0.009) than genes that comprise the entire EST data set. The increased
estimates for these genes thus stem from having both elevated absolute levels of nonsynonymous substitutions and lower levels of synonymous substitutions.
It is possible that identifying 14 loci with
* > 1 may not represent genes subject to gene-specific selection mechanisms, but may simply be due to stochastic sampling from a set of 304 loci. To test this possibility, we approximated the distribution of K*a/K*s under the null hypothesis that all genes evolved according to some common evolutionary process such as selection. To approximate this null distribution, 1000 data sets were simulated, each of which consisted of 304 simulated gene pairs. The 304 simulated gene pairs were generated by sampling the aligned A. thaliana and A. lyrata codons from the EST data set without replacement. The lengths of the 304 simulated genes matched the lengths of the 304 genes in the actual EST data set. For every pseudodata set, the
* ratio was separately estimated for all 304 simulated genes, and the number of these 304
* estimates that exceeded 1 was then counted. None of the 1000 simulated data sets yielded as many as 14 genes with
* > 1 (P < 0.001; see Fig 3). In fact, 10 was the highest number of genes with K*a/K*s estimates >1 and this occurred only once. The mean number of genes with
* > 1 under this null hypothesis was 2.121 with a sample standard error of 0.047. This indicates that finding 14 genes with
* > 1 by chance in a set of 304 loci is improbable under the null hypothesis that the action of evolutionary forces is homogenous across all loci.
|
Molecular population genetics of Arabidopsis loci that encode rapidly evolving proteins:
One other approach to confirm the roles of positive or diversifying selection in the evolution of rapidly evolving protein genes is to examine the levels and patterns of nucleotide variation at these loci within and between species (![]()
![]()
![]()
Analysis of within-species nucleotide diversity in A. thaliana was undertaken for six genes identified from the evolutionary EST analysis as having
* > 1 (see Table 1). These genes have a range of
* from 1.28 to 2.03 and were chosen to encompass the range of high
* values. These sampled genes include the Arabidopsis NAC2 transcriptional activator (![]()
![]()
|
Alleles of each of these genes were isolated and sequenced from 1013 ecotypes in A. thaliana and 25 individuals from a Russian population of A lyrata. The latter sample was included to provide an interspecific comparison; the sample sizes from the A. lyrata population were too small to permit a meaningful assessment of diversity for these genes in this species. The sequenced portions range in size from
0.4 to 1.6 kb and include exon sequences that encompass the protein-coding regions that display the high
* values observed in the evolutionary EST analysis. The number of codons analyzed in these molecular population genetic data sets ranges from 84 to 341 codons. For four of the six genes, the amount of coding sequence assayed in the molecular population genetic analysis was
1.56 times greater than the size of the sequenced ESTs. The other two had coding sequence lengths nearly equal to the length in the evolutionary EST analysis. Levels of within-species nucleotide diversity at silent sites,
, for these six rapidly evolving protein genes ranged from 0.0003 to 0.0140 in A. thaliana, with a mean of 0.0050 ± 0.0022 (see Table 1). This is slightly lower but comparable to the mean of
0.007 observed for other previously studied A. thaliana nuclear genes (![]()
![]()
![]()
![]()
is 0.0057 ± 0.0023.
The mean
estimates for the regions sequenced in this population genetic survey are all, except for the unknown gene AT2G04410, <1 (see Table 1). This reduction in
values compared to those obtained in the EST study may arise from the longer lengths of coding sequences analyzed for most of the genes in the molecular population genetic study and underlying heterogeneities in selective constraint across these loci. All of the
values, however, are greater than the mean for A. thaliana genes. Moreover, the mean Ks from this expanded sequence region is 0.09 ± 0.02 substitutions/site, which is close to the mean for the entire data set (mean
). Permutation analysis indicates that the mean Ks of the genes in the molecular population genetic study is not significantly different from the mean Ks of the EST data set (P < 0.22). Thus, the molecular population genetic analysis is based on sequence information whose synonymous substitution rate is comparable to the mean for A. thaliana genes.
Elevated levels of fixed replacement differences among rapidly evolving Arabidopsis protein genes:
The relative levels of within- to between-species polymorphisms in nucleotide sites that encode a gene's products provide information on the selective forces that act in protein-coding regions (![]()
![]()
![]()
values ranging from 0.03 to 0.30.
|
The posterior distributions of the interspecies divergence time, t, between A. thaliana and A. lyrata are comparable between the two gene classes (for the method, see the Appendix). For the previously studied gene class, the mean of the posterior distribution for t1 equaled 8.6 in multiples of twice the effective population size with 95% highest posterior credibility interval (CI) of [6.9
t1
11.2]. Using data for the rapidly evolving protein genes only, the posterior distribution of t2 has mean 9.5 and 95% highest posterior (HP) CI of [7.0
t2
13.9]. Using all of the data, we get 95% HPCI of [7.41
t
11.22] for t with the posterior distribution having a mean of 9.16. The similarity in interspecies divergence time estimates suggests that the data from both gene classes compare loci of similar divergence times and are thus likely orthologous, and not paralogous, between A. thaliana and A. lyrata. This also indicates that the levels of synonymous substitutions between the two gene classes give comparable estimates of divergence time, indicating that the levels of interspecific synonymous divergence are comparable between both gene classes.
As expected, there is a significant elevation in the levels of fixed replacement differences between A. thaliana and A. lyrata in these six rapidly evolving protein genes. Among these loci, a total of 89 of the 168 coding region differences between these two species (
53%) result in amino acid replacements in the encoded proteins. In contrast, only 123 of 373 fixed differences (
33%) in previously studied Arabidopsis nuclear genes are replacement differences. The contrast in relative levels of replacement to synonymous fixed differences between these two gene classes is significant (Fisher's exact test, P < 2 x 10-5).
By comparison, the relative levels of within-species replacement to synonymous nucleotide polymorphisms within A. thaliana do not differ significantly between the rapidly evolving protein loci and previously studied nuclear genes. Among the genes in this study, 24 of the 38 intraspecific coding region polymorphisms (
63%) are replacement polymorphisms. Among 12 previously studied A. thaliana nuclear genes, 108 of 212 polymorphisms (
51%) are replacement changes. The relative levels of within-species replacement to synonymous site polymorphisms are not significantly different between the two gene classes (Fisher's exact test, P < 0.22). These results indicate that while the rapidly evolving protein loci have increased levels of fixed replacement differences this is not accompanied by a significant increase in relative levels of intraspecific replacement polymorphisms.
Rapidly evolving protein genes display elevated selection intensities in protein-coding regions:
Selection in a specific protein-coding gene is conventionally detected in a test of homogeneity (the McDonald-Kreitman test) that examines within- and between-species replacement to synonymous nucleotide changes (![]()
Although none of these individual genes show evidence of positive selection, each gene contains information regarding the selective forces that act on amino acid replacements (![]()
S for synonymous sites,
R for replacement sites, t for interspecies divergence time, and
for replacement sites selection coefficient; ![]()
, for each individual Arabidopsis gene (![]()
2, for both the rapidly evolving protein and the previously studied gene classes.
The Markov chain Monte Carlo (MCMC) sampling scheme described in the Appendix was used to draw from the joint posterior probability distribution of several parameters in the model given the data in the McDonald-Kreitman tables for all 18 genes. These include the mean and variance parameters for previously studied (µ1,
1) and rapidly evolving protein loci (µ2,
2), the scaled species divergence parameter (t), the vectors of mutation rates at synonymous sites (
S) and replacement sites (
R), and the vector of selection coefficients (
). At completion of the sampling scheme, we have 10,000 quasi-independent vectors for each parameter in the model drawn from the joint probability distribution of the parameters given the data.
The distribution of the sampled values of
for each locus [
i(1),
i(2), ... ,
i(10,000) for 1
i
18] within and between the two classes of Arabidopsis genes provides several striking results. The means of the
draws for each gene in the class of previously studied Arabidopsis loci ranged from -2.285 to +0.941. Six of these loci have 95% HPCIs that are entirely <0 (see Fig 4). These negative selection intensities suggest that most amino acid replacements are slightly deleterious and persist due to the inbreeding associated with the predominant selfing observed in this species (![]()
samples for the rapidly evolving protein loci ranged from -0.823 to 0.856. Three of the six rapidly evolving protein genes (50%) have
< 0. Only one of these genes, which encodes a protein of unknown function, has a 95% HPCI <0. Three genes have
> 0, although the 95% HPCIs for all of these also encompass 0.
|
The posterior distribution of µ, the average selective effect of amino acid replacement changes for rapidly evolving protein genes, shows a shift in the positive direction (see Fig 5). The previously studied Arabidopsis loci have a posterior mean for the average selective effect of amino acid replacements (µ1) of -0.9622. We find that the posterior probability that µ1 > 0, P(µ1
0), is 0.02. In contrast, the posterior distribution for µ2 (the average selective effect of amino acid replacement in the rapidly evolving protein genes) has a mean of +0.1201 and P(µ2
0)
0.56. The posterior mean of the difference between the average selective effects (µ1 - µ2) is -1.0823. This analysis indicates that amino acid replacement changes in these rapidly evolving protein genes are more beneficial (or less deleterious) than those found in previously studied nuclear loci.
|
Evolution of rapidly evolving protein genes in the genome:
Approximately 5% of the inflorescence-expressed genes examined in this evolutionary study have values of
* > 1 between A. thaliana and A. lyrata orthologs and are potential candidates for genes associated with adaptive divergence between these two species. The high proportion of genes with
* > 1 suggests that rapidly evolving protein-coding loci may represent a significant portion of genes in eukaryotic genomes. This proportion, however, is higher than that observed in a similar analysis between A. thaliana and B. rapa, two species that last shared a common ancestor
35 MYA (![]()
> 1. A larger study using 3595 gene sequences across all species represented in DNA databases also indicates that the number of genes that have
> 1 is <0.5%, suggesting the number of genes in this class may be very low (![]()
> 1 may be facilitated by using more closely related species, where the signature for accelerated nonsynonymous substitution, possibly arising from positive selection, may be more readily apparent. Indeed, a study of male accessory gland ESTs from closely related Drosophila species has identified 11% of genes with
> 1 (![]()
Molecular population genetic analysis confirms the increased selection intensities associated with genes that display accelerated rates of nonsynonymous evolution. In previously studied A. thaliana nuclear genes, most replacement changes are slightly deleterious and their estimated selection intensities are generally negative (![]()
![]()
Positive selection associated with the fixation of protein sequence variants may explain the increased selection intensities on these genes. The increase in selection intensities for rapidly evolving protein genes may also arise, however, from neutral evolutionary forces on replacement polymorphisms, leading to increased fixation of amino acid changes. This is underscored by the distribution of selection coefficients, µ2, for the rapidly evolving protein gene class, which while shifted to the positive direction is nevertheless centered near µ = 0 (see Fig 5). All the rapidly evolving protein genes used in the molecular population genetic study, however, are expressed in both species and there are no premature stop codons in these loci. This suggests that if neutral evolution underlies the increased selection intensities in these rapidly evolving protein loci, they do not appear to be associated with pseudogene formation.
It is likely that both neutral evolution and positive selection may be responsible for the rapidly evolving protein genes identified in this evolutionary EST study. Nevertheless, evolutionary ESTs do appear to provide a general genomic approach to identify loci associated with increased selection intensities on protein sequence, some of which may underlie adaptive evolution between these species. It should be noted that while some of the loci with
* > 1 identified in this evolutionary EST study may be associated with adaptive divergence, this may underestimate the role of positive or diversifying selection in shaping gene structure and function in the genome. The criteria of
* > 1 as an indicator of selection can be overly stringent, as it requires that amino acid fixations occur throughout the gene and does not recognize adaptive fixation of small numbers of replacement changes. Moreover, it does not identify genes in which the selective force acts on regulatory regions of the gene, which is believed to be a major factor in adaptive divergence between species (![]()
![]()
The function of the genes that encode rapidly evolving proteins remains largely unknown. Of the three genes in the molecular population genetic analysis that have selection intensities >0, one encodes a previously unknown protein while the other two are homologous to known genes in A. thaliana or other eukaryotic organisms. One gene is NAC2, which belongs to a family of transcription factors required in Arabidopsis development (![]()
![]()
| FOOTNOTES |
|---|
Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. BQ834040BQ834596, BQ839827BQ839830, AY140430AY140446, and AY140459 AY140531. ![]()
| ACKNOWLEDGMENTS |
|---|
The authors thank the NCSU Genome Research Laboratory for use of its facilities and the Purugganan laboratory and three anonymous reviewers for helpful suggestions that improved the article. We are also grateful to Jeff Thorne for suggesting the permutation analysis. M.B. was funded by a National Institutes of Health training grant on the genetics of complex traits and C.D.B. by a Marshall-Sherfield Fellowship from the Marshall Commemoration Commission. This work was funded in part by grants from the National Science Foundation Integrated Research Challenges in Environmental Biology program to M.D.P, J. I. Schmitt, and T. F. C. Mackay; and from the Alfred P. Sloan Foundation to M.D.P.
Manuscript received June 10, 2002; Accepted for publication November 11, 2002.
| APPENDIX |
|---|
It has been shown (![]()
![]() |
(A1) |
![]() |
(A2) |
where
is the scaled selection coefficient for new mutations (2Nes),
is the scaled mutation rate (4Ne·µ), t is the scaled time since species divergence (number of generations since divergence/2Ne), Ne is the effective population size, and n and m are the sample sizes from the two species. The functions F(
, n) and G(
, n) are as previously described (![]()
![]()
Using the cell entries from a conventional McDonald-Kreitman table (![]()
S (mutation rate for silent sites),
R (mutation rate for replacement sites), t, and
(selection intensity for replacement sites) assuming silent sites are neutral (i.e.,
= 0 for all silent sites). For a set of such tables from the same species pairs, it is also possible to model variation in selection among genes by specifying a distribution for
and estimating the parameters of this hyperdistribution given the data in all of the tables. A convenient form to use is the normal distribution since selection coefficients can be either positive or negative. It should also be noted that the species divergence time is a shared parameter across all the tables in such an analysis.
Hierarchical model:
The analysis we present is based on a description of the joint and marginal posterior probability distributions of the following model, described in three parts.
- Part 1: Let
be the vector of selection coefficients, with
1, ... ,
12 being the set of previously studied loci (BUSTAMANTE et al. 2002B ) and
13, ... ,
18 representing rapidly evolving protein loci.
R and
S are the corresponding vectors of mutation rates at replacement and silent sites. Denote the mean and variance of the distribution of
among previously studied genes as µ1 and
21 and use µ2 and
22 for the analogous quantities for the rapidly evolving protein genes. - Part 2: Set a truncated uniform prior distribution for t on (0, T), where T is chosen on the basis of prior information on the upper bound for the species divergence time. We used T = 100, corresponding to an upper bound of between 20 and 200 million years ago.
- Part 3: Assume a normal conjugate prior probability distribution for the mean and variance parameters for each of the two classes of genes so that

(A3)
![]() |
(A4) |
where µ0,
0,
0, and
20 are parameters of the prior distributions for µi and
2i, Inv -
2 refers to an inverse
2 distribution, and i
{1, 2} indexes the two sets of hyperparameters for both classes of genes. The notation is borrowed from ![]()
0 and
0 are chosen to be small and
20 to be large, the prior distribution will be uninformative. In our runs we used
.
Results for conditional posterior distributions:
The joint posterior distribution of interest p(
,
R,
S, t, µ1,
21, µ2,
22|data) can be approximated using a Markov chain Monte Carlo sampling scheme similar to that implemented in ![]()
- Result 1: The conditional posterior distribution of µ1|
21,
depends only on the entries in
that are members of the class i and can be shown to be normally distributed as 
(A5) where
i is the arithmetic average of the entries in
for class i and Ji is the number of genes in the class. - Result 2: The marginal posterior distribution of
2i conditional on
, which depends only on the sample variance of the entries in
that are members of the class i and the parameters of the prior distribution, has an inverse
2 distribution with parameters
Ji and
2Ji as given in GELMAN et al. 1997 .
- Result 3: Using independent Gamma prior distributions with parameters
0 and ß0 for each of the mutation rates yields independent Gamma posterior distributions conditional on t and
with parameters
0 + K + S and 
The mean of this distribution is
/ß and the variance is
/ß2. As such, if
0 and ß0 are chosen to be small, the prior will be uninformative. For all our analysis, we used
= ß = 0.001. - Result 4: The posterior distribution p(t|
R,
S,
, data) is proportional to the likelihood function at the point (t,
R,
S,
) if t < T and 0 otherwise. - Result 5: The joint conditional posterior distribution p(
|
R,
S, t, µ,
2, data) factors into the product of the individuals' conditional distributions p(
j|
Rj,
Sj, t, µi,
2i, KRj, SRj, data). Furthermore, the conditional posterior distribution p(
j|
Rj,
Sj, t, µi,
2i, KRj, SRj) for a given gene j in class i is proportional to the product of the likelihood for the gene given
Rj,
Sj,
j, and t, and the probability density of a normal distribution with mean µi and variance
2i, at the point
j.
Markov chain Monte Carlo algorithm:
Given the model and results outlined above, it is then possible to sample from p(
,
R,
S, t, µ1,
21, µ2,
22|data) using the following algorithm (![]()
- Step 1: Initialize
by drawing a value for
j for 1
j
J1 + J2 independently from a normal distribution with mean near 0 and a reasonably large variance. We used several starting values for the mean in the range [-5, 5] and the variance in the range [1, 100]. - Step 2: Using the values in
, update
2i for i
{1, 2} by sampling from the conditional distribution of
2i|
, which is inverse-
2 distributed as detailed above. - Step 3: Using the values in
and
2i for i
{1, 2}, update µi by sampling a new value from a normal distribution with the updated parameters in Result 1 above. - Step 4: Update t by using Metropolis sampling.
- a. Sample a proposal value t' from a U(t -
t, t +
t) distribution. - b. If p(t'|
R,
S,
data) > p(t|
R,
S,
|data), set t = t'. Otherwise, set t = t' with probability proportional to the ratio of these two quantities. - Step 5: Update each
j in
by using J1 + J2 independent Metropolis steps as follows. - a. Sample a proposal value
'j from a U(
j, -
,
j +
). - b. If p(
'j|
Rj,
Sj, t, µi,
2i, SRj, KRj) > p(
j|
Rj,
Sj, t, µi,
2i, SRj, KRj), set
. Otherwise, set
with probability proportional to their ratio. - Step 6: For each gene, draw a value for
Rj and
Sj using the result that the posterior distribution for
j|
j, t is a Gamma distribution with parameters as described in Result 3 above. - Step 7: Repeat steps 26.
We used the above algorithm to approximate the joint posterior distributions using 10 different starting points (i.e., 10 different chains) run for 10,000 steps each after an initial 2000-step burn-in and retention of draws every 10 steps (for a total of 10,000 draws for each parameter in the model). For the Metropolis step for updating t, we used a proposal distribution with
t = 1.0, which gave a rejection rate of
26.36% for the 100,000 draws retained after the initial burn-in. To measure convergence we used a
statistic that was below 1.01 for all parameters in the model before samples were retained (conventionally one retains after 1.2 or less), illustrating that the 10 chains had converged well before we retained our samples.
| LITERATURE CITED |
|---|
AGUADE, M., 2001 Nucleotide sequence variation at two genes of the phenylpropanoid pathway, the FAH1 and F3H genes, in Arabidopsis thaliana. Mol. Biol. Evol. 18:1-9.
ANISIMOVA, M., J. P. BIELAWSI, and Z. H. YANG, 2001 Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol. 18:1585-1591.
Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.. (2000) Nature 408:796-815.[Medline]
BOLDIN, M. P., I. L. METT, and D. WALLACH, 1995 A protein related to a proteasomal subunit binds to the intracellular domain of the p55 TNF receptor upstream to its death domain. FEBS Lett. 367:39-44.[Medline]
BUSTAMANTE, C. D., R. NIELSEN, and D. L. HARTL, 2002a A maximum likelihood method for analyzing pseudogene evolution: implications for silent site evolution in humans and rodents. Mol. Biol. Evol. 19:110-117.
BUSTAMANTE, C. D., R. NIELSEN, S. SAWYER, K. M. OLSEN, and M. D. PURUGGANAN et al., 2002b The cost of inbreeding in Arabidopsis. Nature 416:531-534.[Medline]
DOEBLEY, J. and L. LUKENS, 1998 Transcriptional regulators and the evolution of plant form. Plant Cell 10:1075-1082.
ENDO, T., K. IKEO, and T. GOJOBORI, 1996 Large-scale selection for genes on which positive selection may operate. Mol. Biol. Evol. 13:685-690.[Abstract]
EWING, B., L. HILLIER, M. C. WENDL, and P. GREEN, 1998 Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185.
GELMAN, A., J. S. CARLIN, H. S. STERN and D. B. RUBIN, 1997 Bayesian Data Analysis. Chapman & Hall, Boca Raton, FL.
HAAG, E. S. and J. R. TRUE, 2001 Perspective: From mutants to mechanisms? Assessing the candidate gene paradigm in evolutionary biology. Evolution 55:1077-1084.[Medline]
HUGHES, A. L., 1991 Circumsporozoite protein genes of malaria parasites (Plasmodium spp.): evidence for positive selection on immunogenic regions. Genetics 127:345-353.[Abstract]
HUGHES, A. L. and M. YEAGER, 1998 Natural selection at major histocompatibility complex loci of vertebrates. Annu. Rev. Genet. 32:415-435.[Medline]
KOCH, M., B. HAUBOLD, and T. MITCHELL-OLDS, 2000 Comparative evolutionary analysis of the chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis and related genera. Mol. Biol. Evol. 17:1483-1498.
KUMAR, S., K. TAMURA and M. NEI, 1993 MEGA: Molecular Evolutionary Genetics Analysis, Version 1.0. Pennsylvania State University,







