help button home button Genetics Journal Watch
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

Originally published as Genetics Published Articles Ahead of Print on March 21, 2005.

Genetics, Vol. 170, 419-431, May 2005, Copyright © 2005
doi:10.1534/genetics.103.025692

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
genetics.103.025692v1
170/1/419    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Suchard, M. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Suchard, M. A.

Stochastic Models for Horizontal Gene Transfer

Taking a Random Walk Through Tree Space

Marc A. Suchard1

Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, California 90095-1766

1 Address for correspondence: Department of Biomathematics, David Geffen School of Medicine, UCLA, 650 Charles Young Dr., Box 951766, Los Angeles, CA 90095-1766.
E-mail: msuchard{at}ucla.edu

Manuscript received December 11, 2003. Accepted for publication February 1, 2005.


    ABSTRACT
 TOP
 ABSTRACT
 MODEL
 STATISTICAL FRAMEWORK
 EXAMPLE
 REMARKS
 APPENDIX
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Horizontal gene transfer (HGT) plays a critical role in evolution across all domains of life with important biological and medical implications. I propose a simple class of stochastic models to examine HGT using multiple orthologous gene alignments. The models function in a hierarchical phylogenetic framework. The top level of the hierarchy is based on a random walk process in "tree space" that allows for the development of a joint probabilistic distribution over multiple gene trees and an unknown, but estimable species tree. I consider two general forms of random walks. The first form is derived from the subtree prune and regraft (SPR) operator that mirrors the observed effects that HGT has on inferred trees. The second form is based on walks over complete graphs and offers numerically tractable solutions for an increasing number of taxa. The bottom level of the hierarchy utilizes standard phylogenetic models to reconstruct gene trees given multiple gene alignments conditional on the random walk process. I develop a well-mixing Markov chain Monte Carlo algorithm to fit the models in a Bayesian framework. I demonstrate the flexibility of these stochastic models to test competing ideas about HGT by examining the complexity hypothesis. Using 144 orthologous gene alignments from six prokaryotes previously collected and analyzed, Bayesian model selection finds support for (1) the SPR model over the alternative form, (2) the 16S rRNA reconstruction as the most likely species tree, and (3) increased HGT of operational genes compared to informational genes.


TRADITIONAL views of molecular evolution hold that genetic material mutates slowly over time as it is passed in a vertical fashion from parent to progeny. Molecular phylogenetics then aims to reconstruct this history of inheritance of genetic sequence data from contemporary organisms into a tree-like structure. However, belief in a single tree, mandated by vertical transmission, for all genetic material is changing. Evolutionary biologists increasingly recognize the horizontal transmission of genetic material between distantly related organisms as an important mechanism of evolution (SYVANEN 1994; LAWRENCE 1999; JAIN et al. 2002).

The process of horizontal (or lateral) gene transfer (HGT) plays a critical role across all domains of life and in particular among prokaryotes (JAIN et al. 1999; KOONIN et al. 2001). For example, many prokaryotes are agile at quickly adapting to new environments. Often, this ability stems from the acquisition of new genes through HGT rather than through random mutation (LAWRENCE 1999). At least three mechanisms promote HGT in prokaryotes (JAIN et al. 2002). These include: (1) transformation in which free DNA sequences are absorbed from the environment, (2) conjugation between two different prokaryotic species, and (3) transduction of genetic material through viruses. Finally, HGT also has medical importance (BROWN 2003). In the field of infectious diseases, HGT among bacterial pathogens of antibiotic resistance genes has greatly contributed to the emergence of multidrug-resistant bacteria in clinical settings (LEVERSTEIN-VAN HALL et al. 2002). In the field of oncology, HGT may also affect tumor progression; BERGSMEDH et al. (2001) show that eukaryotic cells can transfer active oncogenes.

Three general methods have been employed to examine HGT. The first focuses on single genomes and identifies genes suspected to have been imported through HGT by examining variation in nucleotide base composition and codon usage patterns (LAWRENCE and OCHMAN, 1997). The latter two methods are comparative studies across species. One uses similarity approaches based on gene content to identify HGT (RAGAN 2001) and to propose average genome or species-level trees (SNEL et al. 1999), while the alternative method endorses phylogenetic reconstruction using orthologous genes (JAIN et al. 1999). Base composition and codon bias studies may perform poorly when compared to phylogenetic methods (KOSKI et al. 2001). Further, phylogenetic methods offer at least one advantage over similarity-based approaches. The reconstructed phylogenies have direct biological interpretability as descriptions of the underlying evolutionary histories of the different genes (DOOLITTLE 1999). If a reconstructed gene tree differs from the assumed phylogeny of the species being studied, then HGT is offered as a possible explanation (SYVANEN 1994). One intrinsic difficulty is that the true species tree is often itself unknown. Therefore, it is necessary to either fix the species tree to equal the inferred gene tree for a specially chosen gene, e.g., the 16S rRNA tree (WOESE 2000), or simultaneously estimate the species tree and gene trees given a biologically plausible model relating them. As a first step, several research groups have attacked the inverse problem of reconstructing a species tree given gene trees subject to HGT. Most notable are the parsimony-based reconciled tree work by Page and colleagues (e.g., PAGE 2000) and the algorithmic work of MIRKIN et al. (2003).

I propose a simple class of stochastic models for HGT that enable the simultaneous estimation of the underlying species tree relating a group of organisms and the gene trees subject to HGT for a set of orthologous gene alignments. These HGT models function in a hierarchical manner (SUCHARD et al. 2003a) in which standard Bayesian phylogenetic approaches (e.g., SINSHEIMER et al. 1996; YANG and RANNALA 1997; MAU et al. 1999; LI et al. 2000; HUELSENBECK et al. 2001) are used to reconstruct each gene tree from its corresponding gene alignment. Simultaneous to the reconstructions, the HGT models impose a second probabilistic distribution over the gene trees (MADDISON 1997). This hierarchical distribution describes the gene trees likelihoods given an unknown species tree and an unknown number of HGT events leading from that species tree to each gene tree. The model is fit in a Bayesian framework that naturally handles uncertainty in discrete parameters such as all the trees and the number of HGT events and compares various models using Bayes factors (SUCHARD et al. 2001). Stochastic models fit in statistical frameworks offer several advantages over parsimony approaches. First, parsimony may underestimate the number of HGT events linking the species tree to the gene trees. This consequence is similarly seen in parsimonious reconstructions of the tree themselves, in which the number of nucleotide substitutions is underestimated. Second, it is easier in a statistical framework to include measures of uncertainty and these levels may be high in the inferred gene trees given the sparse data from which they are reconstructed.

One additional advantage of building stochastic models for HGT is the ability to compare competing models and to incorporate possible differences in the stochastic processes across genes, while assessing the significance of these differences in a formal statistical framework. As one example of possible differences across genes, JAIN et al. (1999) propose the complexity hypothesis. Under this hypothesis, genes are divided into one of two classes, informational or operational genes. Between classes, the rates of HGT differ. It is suspected that rates are higher for operational genes than for informational genes. This hypothesis and others can be tested by integrating over all possible species trees and gene trees weighed by their posterior probabilities. This Bayesian model-averaging approach reduces the possible bias inherent in selecting a specific species tree, minimizes underestimation of the uncertainty associated with the hypotheses (TAYLOR et al. 1996), and eliminates the need for ad hoc analyses. Formal comparison of different models for HGT will help gather further insight into the underlying biological processes.


    MODEL
 TOP
 ABSTRACT
 MODEL
 STATISTICAL FRAMEWORK
 EXAMPLE
 REMARKS
 APPENDIX
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Within-gene reconstruction model:
I begin with a hierarchical framework for phylogenetic reconstruction using molecular sequence data Y (SUCHARD et al. 2003a). Data Y = (Y1, ... , YK) consist of K naturally disjoint partitions. Partition data Yk for k = 1, ... , K represent the aligned DNA sequences of length Lk from one specific gene per partition, sequenced from the same N taxa across all partitions. A hierarchical phylogenetic model enables the pooling of information across gene partitions to improve estimate precision in individual partitions, while permitting estimation and testing of tendencies in across-partition quantities. For HGT, such across-partition quantities include: (1) an overall species tree, (2) appropriate stochastic models from which to construct a probability distribution over individual gene trees given the species tree, and (3) the stochastic model parameters that may vary between different classes of genes.

To utilize standard Bayesian models for phylogenetic reconstruction (e.g., SINSHEIMER et al. 1996; YANG and RANNALA 1997; MAU et al. 1999; LI et al. 2000) within a gene partition, data Yk further divide into ordered homologous sites Ykl for l = 1, ... , Lk. Site data Ykl = (Ykl1, ... , YklN)t contain one nucleotide from each taxon, such that Ykln isin (A, G, C, T) or their ambiguous wildcards for n = 1, ... , N. I assume that sites within a partition are independent and identically distributed, and the likelihood of observing Ykl is given by a multinomial distribution over the 4N possible outcomes with ambiguous nucleotides being integrated over their possible realizations. The multinomial outcome probabilities become functions of an unknown tree {tau}k that describes the relatedness of the N taxa, branch lengths tk = (tk1, ... , tkB), and a model to describe nucleotide mutation along these branches, all within partition k.

I elect for a reversible, continuous-time Markov chain (CTMC) model for nucleotide substitution (FELSENSTEIN 1981) popularized by TAMURA and NEI (1993) (TN93). The TN93 model is further parameterized by two transition:transversion rate ratios, {alpha}k between purines A and G and {gamma}k between pyrimidines C and T, and the stationary distribution of the underlying Markov chain {pi}k = ({pi}kA, {pi}kG, {pi}kC, {pi}kT). The final scale parameter in the TN93 model is fixed such that branch lengths measure the expected number of nucleotide substitutions between the nodes in {tau}k that the branch connects. Because I assume a reversible model for nucleotide substitution and make no clock-like restrictions on branch lengths, the root of each tree is unidentifiable (FELSENSTEIN 1981). As a consequence, the descriptions of all trees to follow are unrooted with N – 2 internal nodes and B = 2N – 3 branches.

Across-gene hierarchical model:
Following the hierarchical framework of SUCHARD et al. (2003a), I take branch lengths tk as exponentially distributed with unknown expected divergence µk within partition k and model

Formula
and

Formula 1(1)
where V = (A, G, M)t and {Pi} = ({Pi}A, {Pi}G, {Pi}C, {Pi}T) are unknown across-partition-level expectations, variance-covariance matrix Formula 1 has diagonal form, and {sigma}–2{alpha}, {sigma}–2{gamma}, {sigma}–2µ, and N{Pi} are unknown across-partition-level measures of precision. Leaving V, {Sigma}, {Pi}, and N{Pi} as unknowns specified only by hyperprior distributions and estimating these parameters simultaneously with the within-partition-level continuous parameters, {alpha}k, {gamma}k, and µk for all k, enables the borrowing of strength of information from one partition by another, producing more precise within-partition-level estimates. I assume conjugate (when possible) and flat or noninformative hyperpriors on these across-partition-level parameters, as discussed in SUCHARD et al. (2003a). While the development of hierarchical priors over the continuous within-partition-level parameters has been straightforward, constructing a hierarchical prior over gene trees {tau}k that incorporates the stochastic nature of HGT is more involved. This is illustrated in the next section.

Horizontal gene transfer models:
To build a stochastic model for HGT, I first present a formal description of the set of all possible N-taxon trees, commonly referred to as "tree space" (BILLERA et al. 2001), as a mathematical graph and then discuss several possible random walks (D. ALDOUS and J. FILL, unpublished results) on this graph that mirror the observed effects of HGT.

There exist M = (2N – 5)!/2N–3(N – 3)! possible trees relating N extant taxa (FELSENSTEIN 1981). On the basis of these M trees, I construct a graph Formula 1 = ({nu}, {epsilon}) with vertex set {nu} and edge set {epsilon}. Each tree represents a different vertex, or node, in the graph, such that the size of the vertex set |{nu}| = M. An edge uv isin {epsilon} of a graph describes a direct connection between two of the graph's vertices u, v isin {nu}. The number of edges emanating from a single vertex v defines its degree d(v). Two vertices that are joined together by a single edge are called adjacent. Restricting attention to simple graphs in which pairs of vertices may be connected to each other only by a single edge and no vertex is connected to itself by a looping edge, a single vertex v from graph Formula 1 may be adjacent from as few as zero to as many as M – 1 other vertices. The set of all vertices adjacent to v are its neighborhood {Gamma}(v) and the size of this neighborhood |{Gamma}(v)| = d(v). The specification of a neighborhood for each vertex completes the description of Formula 1, and many choices are available.

Subtree-prune-regraft-based model:
One approach to defining neighborhoods for each possible tree stems from subtree transfer operations (ALLEN and STEEL 2001). Subtree transfer operators act on trees producing local rearrangements. Applying a subtree transfer operator to one tree {tau} results in the creation of one of several possible new topologies that differs from {tau} by an extent dependent on the operator. The collection of all trees one operation away from {tau} = v becomes its neighborhood {Gamma}(v) under that operator. Nearest-neighbor interchange (ROBINSON 1971), tree bisection and reconnection (SWOFFORD et al. 1996), and subtree prune and regraft (SPR) (HEIN 1990, 1993) are three examples. In light of the goals of this article, SPR offers an advantage over the former two operators because of its potential biological interpretation. Applying the SPR operator to {tau} = v with its resultant drawn from {Gamma}SPR(v) mirrors the differences observed between a species tree and an individual gene tree affected by one HGT (or recombination) event (HEIN 1990, 1993; JAIN et al. 1999; ALLEN and STEEL 2001).

Figure 1 illustrates one realization of the SPR operator applied to a six-taxon tree. The operator works in two steps. The first step selects and cuts any branch in the initial tree, {tau}initial. Cutting the branch prunes away a subtree, {tau}subtree. This subtree then regrafts itself using the same cut branch to a new internal node obtained by subdividing a preexisting branch in {tau}initial {tau}subtree.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 1.— Subtree prune and regraft operator applied to a six-taxon tree. (1) Operator selects and cuts any branch in the initial tree, pruning away a subtree. (2) Operator regrafts subtree by selecting and subdividing a preexisting branch in the remaining tree. (3) Resultant tree for this realization.

 
Several important properties about the graph Formula 1SPR induced by the SPR operator have been previously studied. First, Formula 1SPR is regular, implying that every vertex v isin {nu}SPR possesses the same degree d(v) = 2(N – 3)(2N – 7) and, hence, neighborhood size (ALLEN and STEEL 2001). Also, Formula 1SPR is connected, meaning that a sequence of consecutive edges (a path) exists, connecting every pair of vertices in Formula 1SPR (ROBINSON 1971; ALLEN and STEEL 2001).

One straightforward stochastic process on any simple graph Formula 1 is an unweighted random walk. A random walk on Formula 1 proceeds from vertex to vertex along existing edges of the graph, generating a discrete-time Markov chain (DTMC), where the states of the chain are the visited vertices. As unweighted, the chain uniform randomly chooses its next vertex to visit from all neighbors of its current vertex. For this DTMC, the one-event transition probability matrix A has entries

Formula 2(2)
defining the probability of u changing into v as a result of one random event. It should be noted that A is just the adjacency matrix of Formula 2 rescaled to be a stochastic matrix [i.e., {sum}v(A)uv = 1].

On the basis of K random walks on the graph Formula 2SPR induced by the SPR operator, I construct a hierarchical prior over the joint distribution of all gene trees {tau}k. To accomplish this task, I assume:

Figure 2 depicts one set of the possible paths of K = 4 Markov chains starting at species tree {Upsilon} and ending at gene trees {tau}k on a small portion of a representative graph. The lengths of paths Ek shown range from one to three. I illustrate no paths of length zero, but these realizations should be most likely. A parsimony-like analysis considering beginning and end points of the chains in Figure 2 would, for example, underestimate E4 as zero instead of three.


Figure 2
View larger version (26K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 2.— One possible Markov chain realization on a simplified graph for the species tree {Upsilon} and four gene trees {tau}1, ... , {tau}4. All chains begin at the same vertex representing the species tree {Upsilon} (in white). The chain producing gene tree {tau}1 has length E1 = 3 (blue), the chain for {tau}2 has length E2 = 1 (red), the chain for {tau}3 has length E3 = 2 (yellow), and the chain for {tau}4 has length E4 = 3 (green). Note that this latter chain returns to its starting state; a parsimony-like analysis would estimate E4 = 0. Not depicted are chains with actual length zero; these are most probable a priori.

 
Given the assumptions listed above, the probability of species tree {Upsilon} giving rise to gene tree {tau}k after Ek HGT events is

Formula 3(3)

To complete the hierarchical specification, I assign a prior distribution over {Upsilon} by letting

Formula 4(4)
where z = (z1, ... , zM) are constants, the prior probabilities of the M possible N-taxon trees. When little or no information is available about {Upsilon}, one reasonable choice is z1 = ... = zM = 1/M; alternately, one may choose z such that the prior odds of competing hypotheses regarding {Upsilon} are one in a hypothesis-testing setting (SUCHARD et al. 2003a). A further choice is discussed later. I further assume a conditionally independent prior on all Ek,

Formula 5(5)
where {Lambda}k is the expected number of HGT events for gene k and is a deterministic function of across-gene-level parameters. This prior is conjugate to (3), allowing all Ek to be integrated out of the model, improving sampling efficiency (LIU 1994),

Formula 6(6)
Letting q({tau}k = v|{Upsilon} = u, {Lambda}k) = (P)uv, the multistep transition probability matrix,

Formula 7(7)
where I is the M x M identity matrix and Q = PI is the CTMC infinitesimal rate matrix representation of the HGT process. In this parameterization, {Lambda}k are scaled as the expected number of HGT events per gene. Let {Lambda} = ({Lambda}1, ... , {Lambda}K). Then, recalling the conditional independence assumption between Markov chains, the joint distribution over all gene trees {tau}k becomes

Formula 8(8)

Calculating the probabilities in (8) requires numerical methods to determine the matrix exponential involving PSPR. These methods involve calculating the complete set of eigenvalues and eigenvectors of PSPR, requiring Formula 8(M3) operations. Such procedures become quickly computationally prohibitive as N, and hence M, increases. As a consequence, numerical approximations may be necessary to develop weighted graph extensions to Formula 8SPR directly. The weights in these extended graphs would be functions of unknown parameters and sampling these parameters would necessitate repetitive diagonalization.

Random walks with analytic solutions:
An alternative to this computational barrier involves using random walks on graphs for which analytic solutions are known for any size M. To help find such solutions, Equation 7 demonstrates the close connection between a DTMC with a Poisson-distributed number of events and a CTMC. In fact, any such DTMC can be expressed as a unique CTMC, called the "continuized" version (D. ALDOUS and J. FILL, unpublished results). Analytic solutions for several weighted and unweighted CTMC processes on a complete graph are commonly used in phylogenetics. In a complete graph, all vertices are adjacent to all others. The most notable examples are the CTMC models for nucleotide substitution. The simplest model by JUKES and CANTOR (1969) is unweighted. In the APPENDIX, I present the multistep transition probability matrix PGJC for a generalized Jukes-Cantor (GJC) model involving an arbitrary number of vertices M. Proposed by KIMURA (1980), the next most sophisticated model for a complete graph is weighted. This model presupposes that the vertices are divided into two disjoint sets, {nu}1 {cup} {nu}2 = {nu}, and that transitions within and between {nu}1 and {nu}2 occur at varying rates. In terms of HGT, such a weighted random walk may prove useful to model varying rates of HGT between different groups of taxa. Letting M1 = |{nu}1|, M2 = |{nu}2|, and R equal the ratio of within- to between-transition rates, I present the multistep transition probability matrix PGK given M1, M2, and R for a generalized Kimura (GK) model in the APPENDIX.

Modeling differences across gene classes:
I incorporate potential differences across genes in the expected number of HGT events {Lambda}k by employing a generalized linear model (GLM) approach (MCCULLAGH and NELDER 1983). GLMs link the mean response, in this case {Lambda}k, to a set of linear predictors. First, I divide all K genes into one of C possible classes, where the definition of the classes depends on the specific research question at hand. To identify gene-class membership in the GLM, I construct a K x C design matrix D = (Dkc), where matrix elements Dk1 = 1 for all k, representing the baseline multiplier for the reference class, and

Formula 9(9)
for c = 2, ... , C, representing the offset multipliers for the remaining classes. Such a design matrix is standard in regression problems involving categorical dependent variables. I model

Formula 10(10)
where linear combinations of predictors {lambda} = ({lambda}1, ... , {lambda}C) specify, on the log-scale, the expected number of HGT events for all classes. I complete the hierarchical prior specification by assuming

Formula 11(11)
I set L = (–2, 0, ... , 0) and {Psi} = diag(10, ... , 10). This provides a quite diffuse prior on {lambda}, with the median expected number of HGT events per gene {approx} 0.14 (GARCIA-VALLVE et al. 2000) for all classes.

As an example of how this GLM construction functions, consider the C = 2 classes case. Then,

Formula 12(12)
When {lambda}2 = 0, no difference across classes exists. Likewise, when {lambda}2 < 0, the expected number of HGT events per gene is smaller in class 2 than in class 1, and when {lambda}2 > 0, the expected number is larger.


    STATISTICAL FRAMEWORK
 TOP
 ABSTRACT
 MODEL
 STATISTICAL FRAMEWORK
 EXAMPLE
 REMARKS
 APPENDIX
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Comparing the relative appropriateness of the various stochastic models for HGT proposed in preceding sections and testing for significant differences in the expected number of HGT events across genes can be accomplished using Bayesian model selection via Bayes factors. Bayes factors are the Bayesian analog of the likelihood-ratio test (LRT), but suffer from fewer difficulties than LRTs in discrete spaces, when comparing non-nested models and with sparse data (SUCHARD et al. 2001). A Bayes factor B10 measures the relative change in the support of the data Y in favor of one statistical model M1 over another model M0 and equals the ratio of the marginal likelihood m(Y|M1) of M1 over the marginal likelihood m(Y|M0) of M0 (KASS and RAFTERY 1995). To calculate Bayes factors, frequently more efficient methods than estimating the multidimensional integrals hidden in the marginal likelihoods directly are available.

When models are nested, a relatively simple Bayes factor calculation is available via the Savage-Dickey ratio (VERDINELLI and WASSERMAN 1995) and involves generating a posterior sample from the larger model only (SUCHARD et al. 2003b). For example, to assess the significance of differences across gene classes in the expected number of HGT events, let M1 represent the unrestricted model proposed above. Nested within M1 exists M0, the equal-rates model, where {lambda}c = 0 for c = 2, ... , C. Further, the GJC model is nested within the GK model, as both are equal when R = 1.

On the other hand, the GJC and SPR models are non-nested, but both possess zero free parameters in their respective P matrices. For two arbitrary models M0 and M1 in situations like this, it is possible to estimate the posterior probabilities p(M0|Y) and p(M1|Y) by constructing a mixture model over the joint space of M0 and M1. By applying the Bayes theorem,

Formula 13(13)
where q(M0) and q(M1) are the prior probabilities of models M0 and M1 in the mixture. Generally, I assume equal prior probabilities, q(M0) = q(M1) = 1/2, when reporting posterior estimates. However, improved efficiency in estimating B10 can be garnered by adjusting these prior probabilities such that p(M0|Y) {approx} p(M1|Y) (CARLIN and CHIB 1995; SUCHARD et al. 2002).

Models SPR and GK neither are nested nor contain the same number of free parameters. One might entertain constructing a reversible-jump Markov chain Monte Carlo (MCMC) sampler (GREEN 1995) over the joint space of these models to compute the Bayes factor in support of SPR over GK. However, a simpler algebraic solution exists given the two preceding Bayes factor calculations,

Formula 14(14)

To estimate all model parameters and Bayes factors, I employ MCMC. I further develop this MCMC algorithm and discuss its performance in the APPENDIX.


    EXAMPLE
 TOP
 ABSTRACT
 MODEL
 STATISTICAL FRAMEWORK
 EXAMPLE
 REMARKS
 APPENDIX
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
To illustrate these stochastic models for HGT and methods to test hypotheses about them, I examine a large set of orthologous, prokaryotic genes collected by JAIN et al. (1999). The data consist of K = 144 separate gene alignments. Each alignment contains orthologous copies of a single gene from six prokaryotes. These prokaryotes are: Aquifex aeolicus (Aa), an early branching thermophilic eubacterium; Escherichia coli (Ec), a proteobacterium; Synechocystis 6803 (S6), a cyanobacterium; Bacillus subtilis (Bs), a gram-positive bacterium; Methanococcus jannaschii (Mj), a methanogen; and Archaeoglobus fulgidus (Af), a thermophilic sulfate-reducing methanogen relative. The first four organisms are Eubacteria, while the last two are Archaea. JAIN et al. (1999) construct the gene alignments on the basis of amino acid translations, assuming a star tree to reduce alignment bias (LAKE 1991), and classify each gene into one of two distinct classes, informational and operational genes (RIVERA et al. 1998). Table 1 lists the functional characteristics of the genes that fall into each class. As a generalization, informational gene products interact in large complex systems; this is especially true of the translational and transcriptional apparatuses. On the other hand, most operational gene products function independently or in small protein assemblies. In total, JAIN et al. (1999) assign 56 genes as informational and 88 as operational, employ these genes to examine the complexity hypothesis, and find support for higher levels of HGT among the operational genes as compared to the informational genes.


View this table:
[in this window]
[in a new window]

 
TABLE 1 Functional definitions of two distinct gene classes, adopted from RIVERA et al. (1998)

 
I parallel the above analysis by assuming that the number of the different gene classes C = 2. I let class c = 1 represent the informational genes and class c = 2 represent the operational genes. To further maintain consistency with JAIN et al. (1999), I exclude third codon position nucleotides from all alignments and assume that first and second codon position nucleotides are evolving independently under the same process for each gene.

Selection of stochastic model:
I begin by comparing the relative likelihoods of the three different stochastic models, SPR, GJC, and GK. For the GK model, I define my two disjoint sets of trees as (1) those that support a split between the four Eubacteria and the two Archaea, {nu}1, and (2) those that do not, {nu}2. These definitions offer a first approximation to modeling differing rates of HGT within life domains and across domains in this example. HGT events that start and end in set {nu}1 are within domain transfers, while events that start in {nu}1 and end in {nu}2, or vice versa, are across domain transfers.

The log10 Bayes factor in favor of SPR over GJC and the log10 Bayes factor in favor of GK over GJC are

Formula 15(15)
Figure 4a illustrates the scaled regeneration quantile (SRQ) plot for estimating the relative posterior probabilities used to calculate log10 BSPR,GJC. No substantial deviation from the slope = 1 line implies the MCMC chain is mixing sufficiently to generate this estimate. Combining the results in (15), I calculate the log10 Bayes factor in favor of SPR over GK as

Formula 16(16)


Figure 4
View larger version (24K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 4.— Scaled regeneration quantile (SRQ) plots to assess MCMC sampler performance when estimating four relative posterior probabilities. Plot a was generated when comparing the SPR and GJC stochastic models. Plots b–d were generated when comparing the three most probable species trees. No substantial deviation in the slopes from 1 (dashed lines) implies that the chains are mixing well enough to produce stable estimates.

 
Considering these Bayes factor estimates, the data strongly reject (KASS and RAFTERY 1995) the two complete graph models with analytic solutions in favor of the more biologically plausible process based on the SPR operator. However, the GJC and GK models should not be discounted completely; their computational complexity does not increase with increasing number of taxa N and they can offer some insight into the underlying biological processes. For example, the Bayes factor in favor of GK over GJC offers some indirect support for differing HGT rates within domains rather than across domains. One caveat should be kept in mind to keep from drawing too strong a conclusion from this finding—the unbalanced study design with only two Archaea precludes identifying HGT events within that domain. All further results in this article are based on the SPR model.

Estimating the species tree:
Figure 3 displays the currently accepted species tree relating the six prokaryotes studied here. The four Eubacteria and two Archaea form two distinct clades (FENG et al. 1997) and Aa is the earliest branching species of the Eubacteria studied (DECKERT et al. 1998). The branching order of the remaining three Eubacteria Ec, S6, and Bs is more ambiguous (GIOVANNONI et al. 1996). The three possible resolutions of this trifurcation are depicted on the right side of Figure 3. Much of the debate surrounding the trifurcation depends on data choice and reconstruction methodology. For example, the top resolution produces species tree {Upsilon}Ec-S6 that places Ec and S6 as nearest neighbors. Protein synthesis elongation factor (EF) Tu gene reconstructions support this tree (LAKE and RIVERA 1996) and JAIN et al. (1999) fix {Upsilon}Ec-S6 as their reference tree in their analysis. Reconstructions of 16S rRNA phylogeny support the middle resolution of species tree {Upsilon}Bs-S6 (COLE et al. 2003) with Bs and S6 as nearest neighbors. The final resolution of species tree {Upsilon}Ec-Bs gains support from reconstructions of phenylalanyl-tRNA synthetase (TEICHMANN and MITCHISON 1999). However, even these three critical genes are subject to HGT (WOLF et al. 1999; ZAP et al. 1999; KE et al. 2000) and their reconstructed phylogenies may inaccurately represent the true species tree.


Figure 3
View larger version (19K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 3.— Species tree relating six prokaryotes. Species are: Aquifex aeolicus (Aa), Escherichia coli (Ec), Synechocystis 6803 (S6), Bacillus subtilis (Bs), Methanococcus jannaschii (Mj), and Archaeoglobus fulgidus (Af). Branch order of three Eubacteria Ec, S6, and Bs is under debate, leading to three possible subtrees (shown on right).

 
On the basis of the SPR model for HGT, I infer {Upsilon}Bs-S6 as the most likely species tree with >0.999 posterior probability. The two other resolutions, {Upsilon}Ec-Bs and {Upsilon}Ec-S6, are the second and third most likely species trees, respectively. To estimate the Bayes factors in favor of {Upsilon}Bs-S6 against {Upsilon}Ec-Bs and {Upsilon}Ec-S6, I judiciously reweight my prior probabilities on trees z and calculate

Formula 17(17)
Similar to the back calculation completed in previous section, I estimate

Formula 18(18)
while direct calculation of log10 BBs-S6,Ec-S6 using the sampler yields approximately the same result. Figure 4, b–d, depicts the SRQ plots relevant to these Bayes factor calculations. Again, the MCMC chain appears well mixing. Although the posterior support for {Upsilon}Ec-Bs and {Upsilon}Ec-S6 initially appears quite small, on a relative scale it is not; probabilities for the remaining 102 trees are >15 orders of magnitude smaller.

Data sets as large as the K = 144 gene alignments from JAIN et al. (1999) are currently rare. Consequentially, I examine via simulation the number of alignments necessary to identify the species tree under the SPR model. Under this simulation, I randomly sample without replacement a fixed number of gene alignments K and then estimate the posterior support for {Upsilon}Bs-S6, assuming this is the true species-tree. I repeat this simulation 20 times for each value of K. For K = 2, the expected posterior probability of {Upsilon}Bs-S6 = 0.14. This estimate is approaching its prior value, signifying appropriate MCMC sampling with limited data. Approximately K = 50 gene alignments are required to achieve an expected posterior probability ≥0.80 and K = 70 are required for ≥0.90.

Hierarchical estimates of evolutionary pressures:
Table 2 presents the posterior estimates of the across-gene-level parameters used to pool information about ({alpha}k, {gamma}k, µk, {pi}k). The table also lists posterior estimates of

Formula 19(19)
These transformed variables report the across-gene-level averages of the two transition:transversion ratios and expected divergence on their usual, instead of log, scale. As seen from Table 2, the average transition:transversion ratio for purines A' is significantly different from the ratio for pyrimidines G', as the ratios' 95% Bayesian credible intervals (BCIs) do not overlap, and both ratios are greater than one. This supports the use of the TN93 model for nucleotide substitution over a more restricted model. Estimates of A, G, and {Pi} are consistent with a previous study using a subset of the data in a hierarchical framework (SUCHARD et al. 2003a). Also in comparison to this previous study, differences in estimates of M, {sigma}2A, {sigma}2G, {sigma}2M, and N{Pi} all trend in the correct directions given the increase in the number of taxa and genes fit here.


View this table:
[in this window]
[in a new window]

 
TABLE 2 Hierarchical across-gene-level estimates of evolutionary pressures

 
Varying rates of HGT across gene classes:
Figure 5 displays model estimates for the linear predictors {lambda}1 and {lambda}2 and for the expected number of HGT events per gene, {Lambda}k, for the informational and operational gene classes. The two top plots display histograms of the posterior samples of {lambda}1 (left) and {lambda}2 (right). These plots also include normal approximations to the posterior (solid lines) and prior densities (dashed lines). Examining the plot on the right, the prior density at {lambda}2 = 0 (dotted vertical line) is considerably higher than the normal approximation to the posterior density. Further, the 95% BCI of {lambda}2 = (0.27–1.15) and does not cover zero. Both observations support the hypothesis that {lambda}2 != 0 and, hence, that rates of HGT differ between informational and operational genes. Formally, the Bayes factor in favor of differing rates is given by the Savage-Dickey ratio. The log10 Bayes factor,

Formula 20(20)
offers substantial support (KASS and RAFTERY 1995) in favor of differing rates.


Figure 5
View larger version (27K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 5.— Analysis of the complexity hypothesis. The top two plots depict the posterior distributions of linear predictors {lambda}1 and {lambda}2 using histograms and normal approximations (solid lines). Also shown are the predictors' prior densities (dashed lines). Greater prior than posterior density at {lambda}2 = 0 (dotted line) supports a difference in HGT rates between gene classes. The bottom plot depicts the posterior distributions of the expected number of HGT events per gene for informational genes (light shading) and operational genes (dark shading).

 
The bottom plot in Figure 5 transforms {lambda}1 and {lambda}2 into the expected number of HGT events per gene and displays histograms of the posterior samples of these quantities. Depicted in dark shading is {Lambda}k for the operational genes and depicted in light shading is {Lambda}k for the informational genes. Although {Lambda}k for operational genes is significantly greater than {Lambda}k for informational genes from the argument above, a small amount of overlap is observed (solid shading) between these marginal histograms. This overlap results from the high negative correlation between {lambda}1 and {lambda}2 (data not shown) and illustrates the need for caution in making inference on the basis of marginal posterior summaries alone.


    REMARKS
 TOP
 ABSTRACT
 MODEL
 STATISTICAL FRAMEWORK
 EXAMPLE
 REMARKS
 APPENDIX
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
In this article, I proposed a simple class of stochastic models for HGT. The models are based on a random walk process in tree space and allow for the development of a joint distribution over multiple gene trees given an unknown species tree. I consider two general forms of random walks. The first stems from subtree transfer operations, in particular the SPR operator that mirrors the observed effects that HGT has on an inferred tree. The second form is based on walks over complete graphs and offers numerically tractable solutions for increasing number of taxa. I fit these models using a Bayesian framework to data from six prokaryotes. I find strongest support for the species tree that places Bs and S6 as nearest neighbors. This tree is supported by 16S rRNA reconstructions, but differs from the EF-Tu tree assumed by JAIN et al. (1999). I demonstrate the flexibility of these stochastic models to test competing ideas about HGT by examining the complexity hypothesis and find support for increased HGT of operational genes compared to informational genes. This latter finding remains unchanged if I fix the species tree to equal the EF-Tu tree (data not shown).

The specific stochastic models for HGT developed in this article have important limitations. First and foremost, the random walks explore only the discrete, topological portion of tree space and do not consider changes in branch lengths between trees as part of the underlying HGT process. As a result, HGT between nearest neighbors in a tree remains unidentified as this process does not result in a change in the topological configuration of the tree. Model extensions that consider a continuous random drift process on the joint space of ({tau}, t) (BILLERA et al. 2001) may circumvent this shortfall. For a related problem involving coalescence, YANG (2002) shows that including branch lengths t into the probabilistic model across loci improves power. Additionally, I assume that the K DTMCs representing the random walks of the gene trees {tau}k away from the species tree {Upsilon} are conditionally independent given {Upsilon}. This assumption implies that the evolutionary histories of all genes are unlinked, while evidence for the HGT of, at a minimum, complete operons abounds in prokaryotes (KOONIN et al. 2001). Possible modeling aspects include allowing for linked or partially linked genes.

HGT is not the only process that may cause incongruence between gene trees. Although the effects of lineage sorting should be minor given the extensive divergence between the species studied here, the inclusion of paralogous genes copies within the orthologous alignments may mislead inference. Also important, stochastic error due to sparse phylogenetic data, evolutionary model misspecification, and parallel/convergent evolution can falsely produce incongruence between trees (CAO et al. 1998). These effects should upwardly bias the inferred number of HGT events. However, I suspect this bias is less than one HGT event per gene as only a modest percentage of genes should be affected and the error should produce just minor changes in the inferred tree. There is no a priori reason to suspect that this bias differs between the informational and operational gene classes; so the bias does not affect the relative difference between classes in HGT rates and inference regarding the complexity hypothesis.

For the SPR model, numerical approximations to the matrix exponentials involving the multistep transition probability matrix PSPR may offer promise in handling research problems with larger numbers of taxa N (MOLER and VAN LOAN 2003). As N increases, the square dimensions of PSPR grow superexponential, while the size of the neighborhood of each vertex grows only as Formula 20(N2). As a consequence, PSPR becomes increasingly sparse. In this situation, the number of unique eigenvalues increases substantially slower than the matrix's dimension. Krylov subspace techniques (SIDJE and STEWART 1999) may stretch computational limits upward to N = 8 or more.

In spite of these limitations, these stochastic models for HGT offer several advantages over previous approaches to studying HGT using multiple orthologous gene alignments. Under these stochastic models, the species tree is an unknown parameter that may be either integrated out of the analysis as a nuisance parameter or estimated jointly with the multiple gene trees. Joint analysis decreases the possibility of bias introduced through fixing the species tree when knowledge about it is uncertain. A stochastic approach also overcomes the bias inherent in parsimony-like estimation. Further, the hierarchical framework in which the stochastic model sits enables the borrowing of strength in the estimation of all gene-partition-level estimates including the gene trees themselves. Finally, and most importantly, stochastic models lend themselves well to formal statistical testing, with no need for ad hoc procedures. The ability to compare differing models for HGT will continue to shed further insight into the underlying biological process.


    APPENDIX
 TOP
 ABSTRACT
 MODEL
 STATISTICAL FRAMEWORK
 EXAMPLE
 REMARKS
 APPENDIX
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Complete models:
To determine the multistep transition probability matrix PGJC for the GJC model with M ≥ 2 states, I first recall that

Formula 21(A1)
is generated from an unweighted complete graph. As a complete graph, it is trivially connected and, therefore, has a unique stationary distribution. This distribution is (1/M, ... , 1/M).

To determine the eigenvalues of QGJC, I write

Formula 22(A2)
where QGJC is scaled such that {Lambda}k is expressed in terms of the expected number of HGT events per gene, J is the M x M matrix of all ones, and I is the M x M identity matrix. Matrix J has a rank of one and, therefore, one nonzero eigenvalue that equals M/(M – 1). Given the eigenvalues of J and expression (A2), the M eigenvalues of QGJC become

Formula 23(A3)
Like the standard Jukes-Cantor model, where M = 4, the GJC model for any M ≥ 2 continues to have only two distinct eigenvalues. Conceptually this results because the qualitative behavior of the underlying Markov chain does not change as the size of the state-space increases.

By letting {Lambda}k -> {infty}, I see that the stationary distribution is the eigenvector corresponding to the 0 eigenvalue. By examining the other limiting case where {Lambda}k = 0 and considering the initial conditions, algebraic rearrangement yields

Formula 24(A4)

The state-space of the GK model is partitioned into two disjoint sets {nu}1 and {nu}2. Let M1 = |{nu}1| and M2 = |{nu}2|, where M1 + M2 = M, and let R be the ratio of rates for transitions within a structural set to transitions between sets. Then, following arguments similar to those above, one can find the multistep transition probability matrix PGK for the GK model.

If u isin {nu}1, then

Formula 25(A5)
where

Formula 26(A6)
By symmetry, if u isin {nu}2, then

Formula 27(A7)
where {phi}3 = M2R + M1. For R != 1, note that there are four unique eigenvalues when M1 != M2 and three unique eigenvalues otherwise. This is consistent with the standard Kimura model, in which M1 = M2 = 2 with three unique eigenvalues.

Sampling algorithm:
For each gene-partition k, let {theta}k = ({tau}k, tk, {alpha}k, {gamma}k, µk, {pi}k) and, then, assemble {theta} = ({theta}1, ... , {theta}K) to be the collection of all gene-level parameters. To specify the hierarchical prior parameters, let {phi} = (V, {Sigma}, {Pi}, N{Pi}, {Upsilon}, {lambda}). Across-gene-level parameters {phi} also include R when considering the GK model and mixing parameter {psi} isin {0, 1} when comparing models SPR and GJC. I employ a MCMC approach to sample from each model's joint posterior distribution, p({theta}, {phi}|Y). I generate samples from these posteriors using two nested Metropolis-within-Gibbs cycles, as laid out in SUCHARD et al. (2003a) for hierarchical phylogenetic models. The outer cycle first iterates over gene partitions k and then over the parameters in {phi}. Within each gene partition k, the inner cycle proceeds over the parameters in {theta}k. With the exception of proposals for {Upsilon}, {lambda}, R, and {psi}, all parameter proposals follow those in SUCHARD et al. (2003a).

The multinomial prior placed on {Upsilon} is conjugate to its sampling density. As a result, it is possible to Gibbs sample {Upsilon} from its full conditional distribution for moderately small M. This full conditional distribution remains multinomial with M state probabilities given by

Formula 28(A8)
where {tau}k = vk for all k and {Omega}{Upsilon} is the vector of all model parameters ({theta}, {phi}) excluding {Upsilon}. Similar to the reweighted prior approach to estimate {psi}, varying z can improve sampling efficiency when estimating the relative posterior probabilities of specific species trees {Upsilon}.

I draw the transition ratio R and linear predictors {lambda} via separate Metropolis-Hastings proposals. For R, I propose new parameter values by generating a normal random variate centered at the current value of R with a tunable variance s2R. Given the high degree of correlation between column vectors in the design matrix D, I expect the posterior distribution of {lambda} to also exhibit strong correlation. This expectation stems from a normal linear regression approximation to p(exp({lambda})|{Lambda}) that has a variance-covariance structure proportional to (D'D)–1. As a consequence, component-by-component updating of {lambda}c in {lambda} should lead to a slowly mixing MCMC chain (ROBERTS and SAHU 1997). To help ensure adequate mixing, I propose all {lambda}c simultaneously using a multivariate normal random variate centered at the current value of {lambda} with a tunable variance-covariance matrix diagFormula 28{Xi}. I adjust the tunable variances such that proposals have acceptance rates of 30–40% (GELMAN et al. 1996) and fix the correlation matrix {Xi} approximately equal to the posterior correlation of {lambda} determined by a trial chain.

When comparing HGT models using a mixture approach, I sample the mixing parameter {psi} directly from its full conditional distribution in a Gibbs step,

Formula 29(A9)
where

Formula 30(A10)
for i = 0, 1, {Upsilon} = u, and {tau}k = vk. Values a may be saved at each iteration and used to construct a Rao-Blackwellized estimator for p(M1|Y) (SUCHARD et al. 2003a).

Finally, the inferred number of HGT events Ek for the SPR model can be recovered after posterior simulation. The full conditional distribution

Formula 31(A11)
where {Upsilon} = u and {tau}k = vk. Since Formula 31uvk ≤ 1,

Formula 32(A12)
where

Formula 33(A13)
As the full conditional distribution of Ek is bounded above, I can generate random draws from it using rejection sampling. Starting with a posterior sample Formula 33, I draw one replicate EFormula 33k for each p = 1, ... , P. For each p, I first generate E* from a PoissonFormula 33 distribution and U from the uniform distribution. Then, if U ≤ (AE*)uv/(P)uv, where {Upsilon}(p) = u and Formula 33, I set Ek(p) = E*. Otherwise, I reject the current proposal and begin again by regenerating (E*, U).

MCMC performance:
I run my MCMC chains for 1.1 x 105 outer Metropolis-within-Gibbs cycles, discard the first 104 cycles as burn-in, and subsample every 10 cycles. This process retains P = 104 posterior samples with decreased autocorrelation. The total chain length and burn-in time appear moderately longer than required by examining time-series plots of the model log-likelihood during simulation.

To assess the performance of the MCMC sampler, I employ scaled SRQ plots (MYKLAND et al. 1995; LI et al. 2000; SUCHARD et al. 2002). SRQ plots are useful to demonstrate adequate sampler mixing within discrete model parameters. For the primary measures in this study, two important discrete parameters are the species tree {Upsilon} and the model mixture parameter {psi}. In particular, I use SRQ plots to assess mixing when comparing the relative probabilities of two possible species trees and of differing stochastic models for HGT. In these SRQ plots, the local slope around a given point depicts the ratio of the relative posterior probability estimate based on the entire MCMC chain to an estimate based on a short segment of the chain around that point. Substantial deviation of the slope from one implies that the sampler is slowly mixing and, as a result, the chain is not sufficiently long to generate stable estimates. For continuous model parameters and Bayes factors based on the Savage-Dickey ratio, I assess convergence by comparing posterior estimates obtained from simulations of at least five independent chains with starting values drawn directly from the model priors.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 MODEL
 STATISTICAL FRAMEWORK
 EXAMPLE
 REMARKS
 APPENDIX
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
I thank the Lake lab, in particular Jon Moore and Jim Lake, for stimulating my interest in HGT, for many provocative discussions, and for providing the SPR adjacency matrices and prokaryote sequences used in this study. I also thank John Huelsenbeck for his insights into HGT and Janet Sinsheimer and Vladimir Minin for commenting on this manuscript. Fred Fox and the National Science Foundation grant 9987641-sponsored University of California, Los Angeles, Training Program in Bioinformatics graciously made possible the computing facilities to fit all 144 alignments simultaneously. The complete data set is available to interested readers at http://www.biomath.medsch.ucla.edu/msuchard/datasets.html. I am supported in part by National Institutes of Health grants GM08042 and GM068955 and U.S. Public Health Service grant CA16042.


    LITERATURE CITED
 TOP
 ABSTRACT
 MODEL
 STATISTICAL FRAMEWORK
 EXAMPLE
 REMARKS
 APPENDIX
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 

ALLEN, B., and M. STEEL, 2001 Subtree transfer operations and their induced matrices on evolutionary trees. Ann. Combinatorics 5: 1–15.

BERGSMEDH, A., A. SZELES, M. HENRIKSSON, A. BRATT, M. FOLKMAN et al., 2001 Horizontal transfer of oncogenes by uptake of apoptotic bodies. Proc. Natl. Acad. Sci. USA 98: 6407–6411.[Abstract/Free Full Text]

BILLERA, L., S. HOLMES and K. VOGTMANN, 2001 Geometry of the space of phylogenetic trees. Adv. Appl. Math. 27: 733–767.[CrossRef]

BROWN, J., 2003 Ancient horizontal gene transfer. Nat. Rev. Genet. 4: 121–132.[CrossRef][Medline]

CAO, Y., A. JANKE, P. WADDELL, M. WESTERMAN, O. TAKENAKA et al., 1998 Conflict among individual mitochondrial proteins in resolving the phylogeny of Eutherian ord