## Abstract

Horizontal gene transfer (HGT) plays a critical role in evolution across all domains of life with important biological and medical implications. I propose a simple class of stochastic models to examine HGT using multiple orthologous gene alignments. The models function in a hierarchical phylogenetic framework. The top level of the hierarchy is based on a random walk process in “tree space” that allows for the development of a joint probabilistic distribution over multiple gene trees and an unknown, but estimable species tree. I consider two general forms of random walks. The first form is derived from the subtree prune and regraft (SPR) operator that mirrors the observed effects that HGT has on inferred trees. The second form is based on walks over complete graphs and offers numerically tractable solutions for an increasing number of taxa. The bottom level of the hierarchy utilizes standard phylogenetic models to reconstruct gene trees given multiple gene alignments conditional on the random walk process. I develop a well-mixing Markov chain Monte Carlo algorithm to fit the models in a Bayesian framework. I demonstrate the flexibility of these stochastic models to test competing ideas about HGT by examining the complexity hypothesis. Using 144 orthologous gene alignments from six prokaryotes previously collected and analyzed, Bayesian model selection finds support for (1) the SPR model over the alternative form, (2) the 16S rRNA reconstruction as the most likely species tree, and (3) increased HGT of operational genes compared to informational genes.

TRADITIONAL views of molecular evolution hold that genetic material mutates slowly over time as it is passed in a vertical fashion from parent to progeny. Molecular phylogenetics then aims to reconstruct this history of inheritance of genetic sequence data from contemporary organisms into a tree-like structure. However, belief in a single tree, mandated by vertical transmission, for all genetic material is changing. Evolutionary biologists increasingly recognize the horizontal transmission of genetic material between distantly related organisms as an important mechanism of evolution (Syvanen 1994; Lawrence 1999; Jain *et al.* 2002).

The process of horizontal (or lateral) gene transfer (HGT) plays a critical role across all domains of life and in particular among prokaryotes (Jain *et al.* 1999; Koonin *et al.* 2001). For example, many prokaryotes are agile at quickly adapting to new environments. Often, this ability stems from the acquisition of new genes through HGT rather than through random mutation (Lawrence 1999). At least three mechanisms promote HGT in prokaryotes (Jain *et al.* 2002). These include: (1) transformation in which free DNA sequences are absorbed from the environment, (2) conjugation between two different prokaryotic species, and (3) transduction of genetic material through viruses. Finally, HGT also has medical importance (Brown 2003). In the field of infectious diseases, HGT among bacterial pathogens of antibiotic resistance genes has greatly contributed to the emergence of multidrug-resistant bacteria in clinical settings (Leverstein-van Hall *et al.* 2002). In the field of oncology, HGT may also affect tumor progression; Bergsmedh *et al.* (2001) show that eukaryotic cells can transfer active oncogenes.

Three general methods have been employed to examine HGT. The first focuses on single genomes and identifies genes suspected to have been imported through HGT by examining variation in nucleotide base composition and codon usage patterns (Lawrence and Ochman, 1997). The latter two methods are comparative studies across species. One uses similarity approaches based on gene content to identify HGT (Ragan 2001) and to propose average genome or species-level trees (Snel *et al.* 1999), while the alternative method endorses phylogenetic reconstruction using orthologous genes (Jain *et al.* 1999). Base composition and codon bias studies may perform poorly when compared to phylogenetic methods (Koski *et al.* 2001). Further, phylogenetic methods offer at least one advantage over similarity-based approaches. The reconstructed phylogenies have direct biological interpretability as descriptions of the underlying evolutionary histories of the different genes (Doolittle 1999). If a reconstructed gene tree differs from the assumed phylogeny of the species being studied, then HGT is offered as a possible explanation (Syvanen 1994). One intrinsic difficulty is that the true species tree is often itself unknown. Therefore, it is necessary to either fix the species tree to equal the inferred gene tree for a specially chosen gene, *e.g.*, the 16S rRNA tree (Woese 2000), or simultaneously estimate the species tree and gene trees given a biologically plausible model relating them. As a first step, several research groups have attacked the inverse problem of reconstructing a species tree given gene trees subject to HGT. Most notable are the parsimony-based reconciled tree work by Page and colleagues (*e.g.*, Page 2000) and the algorithmic work of Mirkin *et al.* (2003).

I propose a simple class of stochastic models for HGT that enable the simultaneous estimation of the underlying species tree relating a group of organisms and the gene trees subject to HGT for a set of orthologous gene alignments. These HGT models function in a hierarchical manner (Suchard *et al.* 2003a) in which standard Bayesian phylogenetic approaches (*e.g.*, Sinsheimer *et al.* 1996; Yang and Rannala 1997; Mau *et al.* 1999; Li *et al.* 2000; Huelsenbeck *et al.* 2001) are used to reconstruct each gene tree from its corresponding gene alignment. Simultaneous to the reconstructions, the HGT models impose a second probabilistic distribution over the gene trees (Maddison 1997). This hierarchical distribution describes the gene trees likelihoods given an unknown species tree and an unknown number of HGT events leading from that species tree to each gene tree. The model is fit in a Bayesian framework that naturally handles uncertainty in discrete parameters such as all the trees and the number of HGT events and compares various models using Bayes factors (Suchard *et al.* 2001). Stochastic models fit in statistical frameworks offer several advantages over parsimony approaches. First, parsimony may underestimate the number of HGT events linking the species tree to the gene trees. This consequence is similarly seen in parsimonious reconstructions of the tree themselves, in which the number of nucleotide substitutions is underestimated. Second, it is easier in a statistical framework to include measures of uncertainty and these levels may be high in the inferred gene trees given the sparse data from which they are reconstructed.

One additional advantage of building stochastic models for HGT is the ability to compare competing models and to incorporate possible differences in the stochastic processes across genes, while assessing the significance of these differences in a formal statistical framework. As one example of possible differences across genes, Jain *et al.* (1999) propose the complexity hypothesis. Under this hypothesis, genes are divided into one of two classes, informational or operational genes. Between classes, the rates of HGT differ. It is suspected that rates are higher for operational genes than for informational genes. This hypothesis and others can be tested by integrating over all possible species trees and gene trees weighed by their posterior probabilities. This Bayesian model-averaging approach reduces the possible bias inherent in selecting a specific species tree, minimizes underestimation of the uncertainty associated with the hypotheses (Taylor *et al.* 1996), and eliminates the need for *ad hoc* analyses. Formal comparison of different models for HGT will help gather further insight into the underlying biological processes.

## MODEL

### Within-gene reconstruction model:

I begin with a hierarchical framework for phylogenetic reconstruction using molecular sequence data **Y** (Suchard *et al.* 2003a). Data **Y** = (**Y**_{1}, … , **Y*** _{K}*) consist of

*K*naturally disjoint partitions. Partition data

**Y**

*for*

_{k}*k*= 1, … ,

*K*represent the aligned DNA sequences of length

*L*from one specific gene per partition, sequenced from the same

_{k}*N*taxa across all partitions. A hierarchical phylogenetic model enables the pooling of information across gene partitions to improve estimate precision in individual partitions, while permitting estimation and testing of tendencies in across-partition quantities. For HGT, such across-partition quantities include: (1) an overall species tree, (2) appropriate stochastic models from which to construct a probability distribution over individual gene trees given the species tree, and (3) the stochastic model parameters that may vary between different classes of genes.

To utilize standard Bayesian models for phylogenetic reconstruction (*e.g.*, Sinsheimer *et al.* 1996; Yang and Rannala 1997; Mau *et al.* 1999; Li *et al.* 2000) within a gene partition, data **Y*** _{k}* further divide into ordered homologous sites

**Y**

*for*

_{kl}*l*= 1, … ,

*L*. Site data

_{k}**Y**

*= (*

_{kl}*Y*

_{kl}_{1}, … ,

*Y*)

_{klN}*contain one nucleotide from each taxon, such that*

^{t}*Y*∈ (A, G, C, T) or their ambiguous wildcards for

_{kln}*n*= 1, … ,

*N*. I assume that sites within a partition are independent and identically distributed, and the likelihood of observing

**Y**

*is given by a multinomial distribution over the 4*

_{kl}*possible outcomes with ambiguous nucleotides being integrated over their possible realizations. The multinomial outcome probabilities become functions of an unknown tree τ*

^{N}*that describes the relatedness of the*

_{k}*N*taxa, branch lengths

**t**

*= (*

_{k}*t*

_{k}_{1}, … ,

*t*), and a model to describe nucleotide mutation along these branches, all within partition

_{kB}*k*.

I elect for a reversible, continuous-time Markov chain (CTMC) model for nucleotide substitution (Felsenstein 1981) popularized by Tamura and Nei (1993) (TN93). The TN93 model is further parameterized by two transition:transversion rate ratios, α* _{k}* between purines A and G and γ

*between pyrimidines C and T, and the stationary distribution of the underlying Markov chain*

_{k}**π**

*= (π*

_{k}

_{k}_{A}, π

_{k}_{G}, π

_{k}_{C}, π

_{k}_{T}). The final scale parameter in the TN93 model is fixed such that branch lengths measure the expected number of nucleotide substitutions between the nodes in τ

*that the branch connects. Because I assume a reversible model for nucleotide substitution and make no clock-like restrictions on branch lengths, the root of each tree is unidentifiable (Felsenstein 1981). As a consequence, the descriptions of all trees to follow are unrooted with*

_{k}*N*− 2 internal nodes and

*B*= 2

*N*− 3 branches.

### Across-gene hierarchical model:

Following the hierarchical framework of Suchard *et al.* (2003a), I take branch lengths **t*** _{k}* as exponentially distributed with unknown expected divergence μ

*within partition*

_{k}*k*and model and 1where

**V**= (

*A*,

*G*,

*M*)

*and*

^{t}**Π**= (Π

_{A}, Π

_{G}, Π

_{C}, Π

_{T}) are unknown across-partition-level expectations, variance-covariance matrix has diagonal form, and σ

^{−2}

_{α}, σ

^{−2}

_{γ}, σ

^{−2}

_{μ}, and

*N*

_{Π}are unknown across-partition-level measures of precision. Leaving

**V**,

**Σ**,

**Π**, and

*N*

_{Π}as unknowns specified only by hyperprior distributions and estimating these parameters simultaneously with the within-partition-level continuous parameters, α

*, γ*

_{k}*, and μ*

_{k}*for all*

_{k}*k*, enables the borrowing of strength of information from one partition by another, producing more precise within-partition-level estimates. I assume conjugate (when possible) and flat or noninformative hyperpriors on these across-partition-level parameters, as discussed in Suchard

*et al.*(2003a). While the development of hierarchical priors over the continuous within-partition-level parameters has been straightforward, constructing a hierarchical prior over gene trees τ

*that incorporates the stochastic nature of HGT is more involved. This is illustrated in the next section.*

_{k}### Horizontal gene transfer models:

To build a stochastic model for HGT, I first present a formal description of the set of all possible *N*-taxon trees, commonly referred to as “tree space” (Billera *et al.* 2001), as a mathematical graph and then discuss several possible random walks (D. Aldous and J. Fill, unpublished results) on this graph that mirror the observed effects of HGT.

There exist *M* = (2*N* − 5)!/2^{N}^{−3}(*N* − 3)! possible trees relating *N* extant taxa (Felsenstein 1981). On the basis of these *M* trees, I construct a graph 𝒢 = (ν, ε) with vertex set ν and edge set ε. Each tree represents a different vertex, or node, in the graph, such that the size of the vertex set |ν| = *M*. An edge *uv* ∈ ε of a graph describes a direct connection between two of the graph's vertices *u*, *v* ∈ ν. The number of edges emanating from a single vertex *v* defines its degree *d*(*v*). Two vertices that are joined together by a single edge are called adjacent. Restricting attention to simple graphs in which pairs of vertices may be connected to each other only by a single edge and no vertex is connected to itself by a looping edge, a single vertex *v* from graph 𝒢 may be adjacent from as few as zero to as many as *M* − 1 other vertices. The set of all vertices adjacent to *v* are its neighborhood Γ(*v*) and the size of this neighborhood |Γ(*v*)| = *d*(*v*). The specification of a neighborhood for each vertex completes the description of 𝒢, and many choices are available.

#### Subtree-prune-regraft-based model:

One approach to defining neighborhoods for each possible tree stems from subtree transfer operations (Allen and Steel 2001). Subtree transfer operators act on trees producing local rearrangements. Applying a subtree transfer operator to one tree τ results in the creation of one of several possible new topologies that differs from τ by an extent dependent on the operator. The collection of all trees one operation away from τ = *v* becomes its neighborhood Γ(*v*) under that operator. Nearest-neighbor interchange (Robinson 1971), tree bisection and reconnection (Swofford *et al.* 1996), and subtree prune and regraft (SPR) (Hein 1990, 1993) are three examples. In light of the goals of this article, SPR offers an advantage over the former two operators because of its potential biological interpretation. Applying the SPR operator to τ = *v* with its resultant drawn from Γ_{SPR}(*v*) mirrors the differences observed between a species tree and an individual gene tree affected by one HGT (or recombination) event (Hein 1990, 1993; Jain *et al.* 1999; Allen and Steel 2001).

Figure 1 illustrates one realization of the SPR operator applied to a six-taxon tree. The operator works in two steps. The first step selects and cuts any branch in the initial tree, τ_{initial}. Cutting the branch prunes away a subtree, τ_{subtree}. This subtree then regrafts itself using the same cut branch to a new internal node obtained by subdividing a preexisting branch in τ_{initial} − τ_{subtree}.

Several important properties about the graph 𝒢_{SPR} induced by the SPR operator have been previously studied. First, 𝒢_{SPR} is regular, implying that every vertex *v* ∈ ν_{SPR} possesses the same degree *d*(*v*) = 2(*N* − 3)(2*N* − 7) and, hence, neighborhood size (Allen and Steel 2001). Also, 𝒢_{SPR} is connected, meaning that a sequence of consecutive edges (a path) exists, connecting every pair of vertices in 𝒢_{SPR} (Robinson 1971; Allen and Steel 2001).

One straightforward stochastic process on any simple graph 𝒢 is an unweighted random walk. A random walk on 𝒢 proceeds from vertex to vertex along existing edges of the graph, generating a discrete-time Markov chain (DTMC), where the states of the chain are the visited vertices. As unweighted, the chain uniform randomly chooses its next vertex to visit from all neighbors of its current vertex. For this DTMC, the one-event transition probability matrix **A** has entries 2defining the probability of *u* changing into *v* as a result of one random event. It should be noted that **A** is just the adjacency matrix of 𝒢 rescaled to be a stochastic matrix [*i.e.*, ∑* _{v}*(

**A**)

*= 1].*

_{uv}On the basis of *K* random walks on the graph 𝒢_{SPR} induced by the SPR operator, I construct a hierarchical prior over the joint distribution of all gene trees τ* _{k}*. To accomplish this task, I assume:

An unknown species tree Υ exists.

The vertex representing Υ is the initial state of

*K*Markov chains.The Markov chains are conditionally independent given Υ and

**A**.The vertex representing τ

is the final state of the_{k}*k*th chain.And each chain is of unknown length 0 ≤

*E*< ∞._{k}

Figure 2 depicts one set of the possible paths of *K* = 4 Markov chains starting at species tree Υ and ending at gene trees τ* _{k}* on a small portion of a representative graph. The lengths of paths

*E*shown range from one to three. I illustrate no paths of length zero, but these realizations should be most likely. A parsimony-like analysis considering beginning and end points of the chains in Figure 2 would, for example, underestimate

_{k}*E*

_{4}as zero instead of three.

Given the assumptions listed above, the probability of species tree Υ giving rise to gene tree τ* _{k}* after

*E*HGT events is 3

_{k}To complete the hierarchical specification, I assign a prior distribution over Υ by letting 4where **z** = (*z*_{1}, … , *z _{M}*) are constants, the prior probabilities of the

*M*possible

*N*-taxon trees. When little or no information is available about Υ, one reasonable choice is

*z*

_{1}= … =

*z*= 1/

_{M}*M*; alternately, one may choose

**z**such that the prior odds of competing hypotheses regarding Υ are one in a hypothesis-testing setting (Suchard

*et al.*2003a). A further choice is discussed later. I further assume a conditionally independent prior on all

*E*, 5where Λ

_{k}*is the expected number of HGT events for gene*

_{k}*k*and is a deterministic function of across-gene-level parameters. This prior is conjugate to (3), allowing all

*E*to be integrated out of the model, improving sampling efficiency (Liu 1994), 6Letting

_{k}*q*(τ

*=*

_{k}*v*|Υ =

*u*, Λ

*) = (*

_{k}**P**)

*, the multistep transition probability matrix, 7where*

_{uv}**I**is the

*M*×

*M*identity matrix and

**Q**=

**P**−

**I**is the CTMC infinitesimal rate matrix representation of the HGT process. In this parameterization, Λ

*are scaled as the expected number of HGT events per gene. Let*

_{k}**Λ**= (Λ

_{1}, … , Λ

*). Then, recalling the conditional independence assumption between Markov chains, the joint distribution over all gene trees τ*

_{K}*becomes 8*

_{k}Calculating the probabilities in (8) requires numerical methods to determine the matrix exponential involving **P**_{SPR}. These methods involve calculating the complete set of eigenvalues and eigenvectors of **P**_{SPR}, requiring 𝒪(*M*^{3}) operations. Such procedures become quickly computationally prohibitive as *N*, and hence *M*, increases. As a consequence, numerical approximations may be necessary to develop weighted graph extensions to 𝒢_{SPR} directly. The weights in these extended graphs would be functions of unknown parameters and sampling these parameters would necessitate repetitive diagonalization.

#### Random walks with analytic solutions:

An alternative to this computational barrier involves using random walks on graphs for which analytic solutions are known for any size *M*. To help find such solutions, Equation 7 demonstrates the close connection between a DTMC with a Poisson-distributed number of events and a CTMC. In fact, any such DTMC can be expressed as a unique CTMC, called the “continuized” version (D. Aldous and J. Fill, unpublished results). Analytic solutions for several weighted and unweighted CTMC processes on a complete graph are commonly used in phylogenetics. In a complete graph, all vertices are adjacent to all others. The most notable examples are the CTMC models for nucleotide substitution. The simplest model by Jukes and Cantor (1969) is unweighted. In the appendix, I present the multistep transition probability matrix **P**_{GJC} for a generalized Jukes-Cantor (GJC) model involving an arbitrary number of vertices *M*. Proposed by Kimura (1980), the next most sophisticated model for a complete graph is weighted. This model presupposes that the vertices are divided into two disjoint sets, ν_{1} ∪ ν_{2} = ν, and that transitions within and between ν_{1} and ν_{2} occur at varying rates. In terms of HGT, such a weighted random walk may prove useful to model varying rates of HGT between different groups of taxa. Letting *M*_{1} = |ν_{1}|, *M*_{2} = |ν_{2}|, and *R* equal the ratio of within- to between-transition rates, I present the multistep transition probability matrix **P**_{GK} given *M*_{1}, *M*_{2}, and *R* for a generalized Kimura (GK) model in the appendix.

### Modeling differences across gene classes:

I incorporate potential differences across genes in the expected number of HGT events Λ* _{k}* by employing a generalized linear model (GLM) approach (McCullagh and Nelder 1983). GLMs link the mean response, in this case Λ

*, to a set of linear predictors. First, I divide all*

_{k}*K*genes into one of

*C*possible classes, where the definition of the classes depends on the specific research question at hand. To identify gene-class membership in the GLM, I construct a

*K*×

*C*design matrix

**D**= (

*D*), where matrix elements

_{kc}*D*

_{k}_{1}= 1 for all

*k*, representing the baseline multiplier for the reference class, and 9for

*c*= 2, … ,

*C*, representing the offset multipliers for the remaining classes. Such a design matrix is standard in regression problems involving categorical dependent variables. I model 10where linear combinations of predictors

**λ**= (λ

_{1}, … , λ

*) specify, on the log-scale, the expected number of HGT events for all classes. I complete the hierarchical prior specification by assuming 11I set*

_{C}**L**= (−2, 0, … , 0) and

**Ψ**= diag(10, … , 10). This provides a quite diffuse prior on

**λ**, with the median expected number of HGT events per gene ≈ 0.14 (Garcia-Vallve

*et al.*2000) for all classes.

As an example of how this GLM construction functions, consider the *C* = 2 classes case. Then, 12When λ_{2} = 0, no difference across classes exists. Likewise, when λ_{2} < 0, the expected number of HGT events per gene is smaller in class 2 than in class 1, and when λ_{2} > 0, the expected number is larger.

## STATISTICAL FRAMEWORK

Comparing the relative appropriateness of the various stochastic models for HGT proposed in preceding sections and testing for significant differences in the expected number of HGT events across genes can be accomplished using Bayesian model selection via Bayes factors. Bayes factors are the Bayesian analog of the likelihood-ratio test (LRT), but suffer from fewer difficulties than LRTs in discrete spaces, when comparing non-nested models and with sparse data (Suchard *et al.* 2001). A Bayes factor *B*_{10} measures the relative change in the support of the data **Y** in favor of one statistical model *M*_{1} over another model *M*_{0} and equals the ratio of the marginal likelihood *m*(**Y**|*M*_{1}) of *M*_{1} over the marginal likelihood *m*(**Y**|*M*_{0}) of *M*_{0} (Kass and Raftery 1995). To calculate Bayes factors, frequently more efficient methods than estimating the multidimensional integrals hidden in the marginal likelihoods directly are available.

When models are nested, a relatively simple Bayes factor calculation is available via the Savage-Dickey ratio (Verdinelli and Wasserman 1995) and involves generating a posterior sample from the larger model only (Suchard *et al.* 2003b). For example, to assess the significance of differences across gene classes in the expected number of HGT events, let *M*_{1} represent the unrestricted model proposed above. Nested within *M*_{1} exists *M*_{0}, the equal-rates model, where λ* _{c}* = 0 for

*c*= 2, … ,

*C*. Further, the GJC model is nested within the GK model, as both are equal when

*R*= 1.

On the other hand, the GJC and SPR models are non-nested, but both possess zero free parameters in their respective **P** matrices. For two arbitrary models *M*_{0} and *M*_{1} in situations like this, it is possible to estimate the posterior probabilities *p*(*M*_{0}|**Y**) and *p*(*M*_{1}|**Y**) by constructing a mixture model over the joint space of *M*_{0} and *M*_{1}. By applying the Bayes theorem, 13where *q*(*M*_{0}) and *q*(*M*_{1}) are the prior probabilities of models *M*_{0} and *M*_{1} in the mixture. Generally, I assume equal prior probabilities, *q*(*M*_{0}) = *q*(*M*_{1}) = ^{1}/_{2}, when reporting posterior estimates. However, improved efficiency in estimating *B*_{10} can be garnered by adjusting these prior probabilities such that *p*(*M*_{0}|**Y**) ≈ *p*(*M*_{1}|**Y**) (Carlin and Chib 1995; Suchard *et al.* 2002).

Models SPR and GK neither are nested nor contain the same number of free parameters. One might entertain constructing a reversible-jump Markov chain Monte Carlo (MCMC) sampler (Green 1995) over the joint space of these models to compute the Bayes factor in support of SPR over GK. However, a simpler algebraic solution exists given the two preceding Bayes factor calculations, 14

To estimate all model parameters and Bayes factors, I employ MCMC. I further develop this MCMC algorithm and discuss its performance in the appendix.

## EXAMPLE

To illustrate these stochastic models for HGT and methods to test hypotheses about them, I examine a large set of orthologous, prokaryotic genes collected by Jain *et al.* (1999). The data consist of *K* = 144 separate gene alignments. Each alignment contains orthologous copies of a single gene from six prokaryotes. These prokaryotes are: *Aquifex aeolicus* (Aa), an early branching thermophilic eubacterium; *Escherichia coli* (Ec), a proteobacterium; *Synechocystis 6803* (S6), a cyanobacterium; *Bacillus subtilis* (Bs), a gram-positive bacterium; *Methanococcus jannaschii* (Mj), a methanogen; and *Archaeoglobus fulgidus* (Af), a thermophilic sulfate-reducing methanogen relative. The first four organisms are Eubacteria, while the last two are Archaea. Jain *et al.* (1999) construct the gene alignments on the basis of amino acid translations, assuming a star tree to reduce alignment bias (Lake 1991), and classify each gene into one of two distinct classes, informational and operational genes (Rivera *et al.* 1998). Table 1 lists the functional characteristics of the genes that fall into each class. As a generalization, informational gene products interact in large complex systems; this is especially true of the translational and transcriptional apparatuses. On the other hand, most operational gene products function independently or in small protein assemblies. In total, Jain *et al.* (1999) assign 56 genes as informational and 88 as operational, employ these genes to examine the complexity hypothesis, and find support for higher levels of HGT among the operational genes as compared to the informational genes.

I parallel the above analysis by assuming that the number of the different gene classes *C* = 2. I let class *c* = 1 represent the informational genes and class *c* = 2 represent the operational genes. To further maintain consistency with Jain *et al.* (1999), I exclude third codon position nucleotides from all alignments and assume that first and second codon position nucleotides are evolving independently under the same process for each gene.

### Selection of stochastic model:

I begin by comparing the relative likelihoods of the three different stochastic models, SPR, GJC, and GK. For the GK model, I define my two disjoint sets of trees as (1) those that support a split between the four Eubacteria and the two Archaea, ν_{1}, and (2) those that do not, ν_{2}. These definitions offer a first approximation to modeling differing rates of HGT within life domains and across domains in this example. HGT events that start and end in set ν_{1} are within domain transfers, while events that start in ν_{1} and end in ν_{2}, or vice versa, are across domain transfers.

The log_{10} Bayes factor in favor of SPR over GJC and the log_{10} Bayes factor in favor of GK over GJC are 15Figure 4a illustrates the scaled regeneration quantile (SRQ) plot for estimating the relative posterior probabilities used to calculate log_{10} *B*_{SPR,GJC}. No substantial deviation from the slope = 1 line implies the MCMC chain is mixing sufficiently to generate this estimate. Combining the results in (15), I calculate the log_{10} Bayes factor in favor of SPR over GK as 16

Considering these Bayes factor estimates, the data strongly reject (Kass and Raftery 1995) the two complete graph models with analytic solutions in favor of the more biologically plausible process based on the SPR operator. However, the GJC and GK models should not be discounted completely; their computational complexity does not increase with increasing number of taxa *N* and they can offer some insight into the underlying biological processes. For example, the Bayes factor in favor of GK over GJC offers some indirect support for differing HGT rates within domains rather than across domains. One caveat should be kept in mind to keep from drawing too strong a conclusion from this finding—the unbalanced study design with only two Archaea precludes identifying HGT events within that domain. All further results in this article are based on the SPR model.

### Estimating the species tree:

Figure 3 displays the currently accepted species tree relating the six prokaryotes studied here. The four Eubacteria and two Archaea form two distinct clades (Feng *et al.* 1997) and Aa is the earliest branching species of the Eubacteria studied (Deckert *et al.* 1998). The branching order of the remaining three Eubacteria Ec, S6, and Bs is more ambiguous (Giovannoni *et al.* 1996). The three possible resolutions of this trifurcation are depicted on the right side of Figure 3. Much of the debate surrounding the trifurcation depends on data choice and reconstruction methodology. For example, the top resolution produces species tree Υ_{Ec-S6} that places Ec and S6 as nearest neighbors. Protein synthesis elongation factor (EF) Tu gene reconstructions support this tree (Lake and Rivera 1996) and Jain *et al.* (1999) fix Υ_{Ec-S6} as their reference tree in their analysis. Reconstructions of 16S rRNA phylogeny support the middle resolution of species tree Υ_{Bs-S6} (Cole *et al.* 2003) with Bs and S6 as nearest neighbors. The final resolution of species tree Υ_{Ec-Bs} gains support from reconstructions of phenylalanyl-tRNA synthetase (Teichmann and Mitchison 1999). However, even these three critical genes are subject to HGT (Wolf *et al.* 1999; Zap *et al.* 1999; Ke *et al.* 2000) and their reconstructed phylogenies may inaccurately represent the true species tree.

On the basis of the SPR model for HGT, I infer Υ_{Bs-S6} as the most likely species tree with >0.999 posterior probability. The two other resolutions, Υ_{Ec-Bs} and Υ_{Ec-S6}, are the second and third most likely species trees, respectively. To estimate the Bayes factors in favor of Υ_{Bs-S6} against Υ_{Ec-Bs} and Υ_{Ec-S6}, I judiciously reweight my prior probabilities on trees **z** and calculate 17Similar to the back calculation completed in previous section, I estimate 18while direct calculation of log_{10} *B*_{Bs-S6,Ec-S6} using the sampler yields approximately the same result. Figure 4, b–d, depicts the SRQ plots relevant to these Bayes factor calculations. Again, the MCMC chain appears well mixing. Although the posterior support for Υ_{Ec-Bs} and Υ_{Ec-S6} initially appears quite small, on a relative scale it is not; probabilities for the remaining 102 trees are >15 orders of magnitude smaller.

Data sets as large as the *K* = 144 gene alignments from Jain *et al.* (1999) are currently rare. Consequentially, I examine via simulation the number of alignments necessary to identify the species tree under the SPR model. Under this simulation, I randomly sample without replacement a fixed number of gene alignments *K* and then estimate the posterior support for Υ_{Bs-S6}, assuming this is the true species-tree. I repeat this simulation 20 times for each value of *K*. For *K* = 2, the expected posterior probability of Υ_{Bs-S6} = 0.14. This estimate is approaching its prior value, signifying appropriate MCMC sampling with limited data. Approximately *K* = 50 gene alignments are required to achieve an expected posterior probability ≥0.80 and *K* = 70 are required for ≥0.90.

### Hierarchical estimates of evolutionary pressures:

Table 2 presents the posterior estimates of the across-gene-level parameters used to pool information about (α* _{k}*, γ

*, μ*

_{k}*,*

_{k}**π**

*). The table also lists posterior estimates of 19These transformed variables report the across-gene-level averages of the two transition:transversion ratios and expected divergence on their usual, instead of log, scale. As seen from Table 2, the average transition:transversion ratio for purines*

_{k}*A*′ is significantly different from the ratio for pyrimidines

*G*′, as the ratios' 95% Bayesian credible intervals (BCIs) do not overlap, and both ratios are greater than one. This supports the use of the TN93 model for nucleotide substitution over a more restricted model. Estimates of

*A*,

*G*, and

**Π**are consistent with a previous study using a subset of the data in a hierarchical framework (Suchard

*et al.*2003a). Also in comparison to this previous study, differences in estimates of

*M*, σ

^{2}

_{A}, σ

^{2}

_{G}, σ

^{2}

_{M}, and

*N*

_{Π}all trend in the correct directions given the increase in the number of taxa and genes fit here.

### Varying rates of HGT across gene classes:

Figure 5 displays model estimates for the linear predictors λ_{1} and λ_{2} and for the expected number of HGT events per gene, Λ* _{k}*, for the informational and operational gene classes. The two top plots display histograms of the posterior samples of λ

_{1}(left) and λ

_{2}(right). These plots also include normal approximations to the posterior (solid lines) and prior densities (dashed lines). Examining the plot on the right, the prior density at λ

_{2}= 0 (dotted vertical line) is considerably higher than the normal approximation to the posterior density. Further, the 95% BCI of λ

_{2}= (0.27–1.15) and does not cover zero. Both observations support the hypothesis that λ

_{2}≠ 0 and, hence, that rates of HGT differ between informational and operational genes. Formally, the Bayes factor in favor of differing rates is given by the Savage-Dickey ratio. The log

_{10}Bayes factor, 20offers substantial support (Kass and Raftery 1995) in favor of differing rates.

The bottom plot in Figure 5 transforms λ_{1} and λ_{2} into the expected number of HGT events per gene and displays histograms of the posterior samples of these quantities. Depicted in dark shading is Λ* _{k}* for the operational genes and depicted in light shading is Λ

*for the informational genes. Although Λ*

_{k}*for operational genes is significantly greater than Λ*

_{k}*for informational genes from the argument above, a small amount of overlap is observed (solid shading) between these marginal histograms. This overlap results from the high negative correlation between λ*

_{k}_{1}and λ

_{2}(data not shown) and illustrates the need for caution in making inference on the basis of marginal posterior summaries alone.

## REMARKS

In this article, I proposed a simple class of stochastic models for HGT. The models are based on a random walk process in tree space and allow for the development of a joint distribution over multiple gene trees given an unknown species tree. I consider two general forms of random walks. The first stems from subtree transfer operations, in particular the SPR operator that mirrors the observed effects that HGT has on an inferred tree. The second form is based on walks over complete graphs and offers numerically tractable solutions for increasing number of taxa. I fit these models using a Bayesian framework to data from six prokaryotes. I find strongest support for the species tree that places Bs and S6 as nearest neighbors. This tree is supported by 16S rRNA reconstructions, but differs from the EF-Tu tree assumed by Jain *et al.* (1999). I demonstrate the flexibility of these stochastic models to test competing ideas about HGT by examining the complexity hypothesis and find support for increased HGT of operational genes compared to informational genes. This latter finding remains unchanged if I fix the species tree to equal the EF-Tu tree (data not shown).

The specific stochastic models for HGT developed in this article have important limitations. First and foremost, the random walks explore only the discrete, topological portion of tree space and do not consider changes in branch lengths between trees as part of the underlying HGT process. As a result, HGT between nearest neighbors in a tree remains unidentified as this process does not result in a change in the topological configuration of the tree. Model extensions that consider a continuous random drift process on the joint space of (τ, **t**) (Billera *et al.* 2001) may circumvent this shortfall. For a related problem involving coalescence, Yang (2002) shows that including branch lengths **t** into the probabilistic model across loci improves power. Additionally, I assume that the *K* DTMCs representing the random walks of the gene trees τ* _{k}* away from the species tree Υ are conditionally independent given Υ. This assumption implies that the evolutionary histories of all genes are unlinked, while evidence for the HGT of, at a minimum, complete operons abounds in prokaryotes (Koonin

*et al.*2001). Possible modeling aspects include allowing for linked or partially linked genes.

HGT is not the only process that may cause incongruence between gene trees. Although the effects of lineage sorting should be minor given the extensive divergence between the species studied here, the inclusion of paralogous genes copies within the orthologous alignments may mislead inference. Also important, stochastic error due to sparse phylogenetic data, evolutionary model misspecification, and parallel/convergent evolution can falsely produce incongruence between trees (Cao *et al.* 1998). These effects should upwardly bias the inferred number of HGT events. However, I suspect this bias is less than one HGT event per gene as only a modest percentage of genes should be affected and the error should produce just minor changes in the inferred tree. There is no *a priori* reason to suspect that this bias differs between the informational and operational gene classes; so the bias does not affect the relative difference between classes in HGT rates and inference regarding the complexity hypothesis.

For the SPR model, numerical approximations to the matrix exponentials involving the multistep transition probability matrix **P**_{SPR} may offer promise in handling research problems with larger numbers of taxa *N* (Moler and Van Loan 2003). As *N* increases, the square dimensions of **P**_{SPR} grow superexponential, while the size of the neighborhood of each vertex grows only as 𝒪(*N*^{2}). As a consequence, **P**_{SPR} becomes increasingly sparse. In this situation, the number of unique eigenvalues increases substantially slower than the matrix's dimension. Krylov subspace techniques (Sidje and Stewart 1999) may stretch computational limits upward to *N* = 8 or more.

In spite of these limitations, these stochastic models for HGT offer several advantages over previous approaches to studying HGT using multiple orthologous gene alignments. Under these stochastic models, the species tree is an unknown parameter that may be either integrated out of the analysis as a nuisance parameter or estimated jointly with the multiple gene trees. Joint analysis decreases the possibility of bias introduced through fixing the species tree when knowledge about it is uncertain. A stochastic approach also overcomes the bias inherent in parsimony-like estimation. Further, the hierarchical framework in which the stochastic model sits enables the borrowing of strength in the estimation of all gene-partition-level estimates including the gene trees themselves. Finally, and most importantly, stochastic models lend themselves well to formal statistical testing, with no need for *ad hoc* procedures. The ability to compare differing models for HGT will continue to shed further insight into the underlying biological process.

## APPENDIX

### Complete models:

To determine the multistep transition probability matrix **P**_{GJC} for the GJC model with *M* ≥ 2 states, I first recall that A1is generated from an unweighted complete graph. As a complete graph, it is trivially connected and, therefore, has a unique stationary distribution. This distribution is (1/*M*, … , 1/*M*).

To determine the eigenvalues of **Q**_{GJC}, I write A2where **Q**_{GJC} is scaled such that Λ* _{k}* is expressed in terms of the expected number of HGT events per gene,

**J**is the

*M*×

*M*matrix of all ones, and

**I**is the

*M*×

*M*identity matrix. Matrix

**J**has a rank of one and, therefore, one nonzero eigenvalue that equals

*M*/(

*M*− 1). Given the eigenvalues of

**J**and expression (A2), the

*M*eigenvalues of

**Q**

_{GJC}become A3Like the standard Jukes-Cantor model, where

*M*= 4, the GJC model for any

*M*≥ 2 continues to have only two distinct eigenvalues. Conceptually this results because the qualitative behavior of the underlying Markov chain does not change as the size of the state-space increases.

By letting Λ* _{k}* → ∞, I see that the stationary distribution is the eigenvector corresponding to the 0 eigenvalue. By examining the other limiting case where Λ

*= 0 and considering the initial conditions, algebraic rearrangement yields A4*

_{k}The state-space of the GK model is partitioned into two disjoint sets ν_{1} and ν_{2}. Let *M*_{1} = |ν_{1}| and *M*_{2} = |ν_{2}|, where *M*_{1} + *M*_{2} = *M*, and let *R* be the ratio of rates for transitions within a structural set to transitions between sets. Then, following arguments similar to those above, one can find the multistep transition probability matrix **P**_{GK} for the GK model.

If *u* ∈ ν_{1}, then A5where A6By symmetry, if *u* ∈ ν_{2}, then A7where φ_{3} = *M*_{2}*R* + *M*_{1}. For *R* ≠ 1, note that there are four unique eigenvalues when *M*_{1} ≠ *M*_{2} and three unique eigenvalues otherwise. This is consistent with the standard Kimura model, in which *M*_{1} = *M*_{2} = 2 with three unique eigenvalues.

### Sampling algorithm:

For each gene-partition *k*, let **θ*** _{k}* = (τ

*,*

_{k}**t**

*, α*

_{k}*, γ*

_{k}*, μ*

_{k}*,*

_{k}**π**

*) and, then, assemble*

_{k}**θ**= (

**θ**

_{1}, … ,

**θ**

*) to be the collection of all gene-level parameters. To specify the hierarchical prior parameters, let*

_{K}**φ**= (

**V**,

**Σ**,

**Π**,

*N*

_{Π}, Υ,

**λ**). Across-gene-level parameters

**φ**also include

*R*when considering the GK model and mixing parameter ψ ∈ {0, 1} when comparing models SPR and GJC. I employ a MCMC approach to sample from each model's joint posterior distribution,

*p*(

**θ**,

**φ**|

**Y**). I generate samples from these posteriors using two nested Metropolis-within-Gibbs cycles, as laid out in Suchard

*et al.*(2003a) for hierarchical phylogenetic models. The outer cycle first iterates over gene partitions

*k*and then over the parameters in

**φ**. Within each gene partition

*k*, the inner cycle proceeds over the parameters in

**θ**

*. With the exception of proposals for Υ,*

_{k}**λ**,

*R*, and ψ, all parameter proposals follow those in Suchard

*et al*. (2003a).

The multinomial prior placed on Υ is conjugate to its sampling density. As a result, it is possible to Gibbs sample Υ from its full conditional distribution for moderately small *M*. This full conditional distribution remains multinomial with *M* state probabilities given by A8where τ* _{k}* =

*v*for all

_{k}*k*and

**Ω**

_{−Υ}is the vector of all model parameters (

**θ**,

**φ**) excluding Υ. Similar to the reweighted prior approach to estimate ψ, varying

**z**can improve sampling efficiency when estimating the relative posterior probabilities of specific species trees Υ.

I draw the transition ratio *R* and linear predictors **λ** via separate Metropolis-Hastings proposals. For *R*, I propose new parameter values by generating a normal random variate centered at the current value of *R* with a tunable variance *s*^{2}_{R}. Given the high degree of correlation between column vectors in the design matrix **D**, I expect the posterior distribution of **λ** to also exhibit strong correlation. This expectation stems from a normal linear regression approximation to *p*(exp(**λ**)|**Λ**) that has a variance-covariance structure proportional to (**D′D**)^{−1}. As a consequence, component-by-component updating of λ_{c} in **λ** should lead to a slowly mixing MCMC chain (Roberts and Sahu 1997). To help ensure adequate mixing, I propose all λ_{c} simultaneously using a multivariate normal random variate centered at the current value of **λ** with a tunable variance-covariance matrix diagΞ. I adjust the tunable variances such that proposals have acceptance rates of 30–40% (Gelman *et al.* 1996) and fix the correlation matrix **Ξ** approximately equal to the posterior correlation of **λ** determined by a trial chain.

When comparing HGT models using a mixture approach, I sample the mixing parameter ψ directly from its full conditional distribution in a Gibbs step, A9where A10for *i* = 0, 1, Υ = *u*, and τ* _{k}* =

*v*. Values

_{k}*a*may be saved at each iteration and used to construct a Rao-Blackwellized estimator for

*p*(

*M*

_{1}|

**Y**) (Suchard

*et al.*2003a).

Finally, the inferred number of HGT events *E _{k}* for the SPR model can be recovered after posterior simulation. The full conditional distribution A11where Υ =

*u*and τ

*=*

_{k}*v*. Since

_{k}_{uvk}≤ 1, A12where A13As the full conditional distribution of

*E*is bounded above, I can generate random draws from it using rejection sampling. Starting with a posterior sample , I draw one replicate

_{k}*E*

^{}

_{k}for each

*p*= 1, … ,

*P*. For each

*p*, I first generate

*E** from a Poisson distribution and

*U*from the uniform distribution. Then, if

*U*≤ (

**A**

^{E}^{*})

*/(*

_{uv}**P**)

*, where Υ*

_{uv}^{(}

^{p}^{)}=

*u*and , I set

*E*

_{k}^{(}

^{p}^{)}=

*E**. Otherwise, I reject the current proposal and begin again by regenerating (

*E**,

*U*).

### MCMC performance:

I run my MCMC chains for 1.1 × 10^{5} outer Metropolis-within-Gibbs cycles, discard the first 10^{4} cycles as burn-in, and subsample every 10 cycles. This process retains *P* = 10^{4} posterior samples with decreased autocorrelation. The total chain length and burn-in time appear moderately longer than required by examining time-series plots of the model log-likelihood during simulation.

To assess the performance of the MCMC sampler, I employ scaled SRQ plots (Mykland *et al.* 1995; Li *et al.* 2000; Suchard *et al.* 2002). SRQ plots are useful to demonstrate adequate sampler mixing within discrete model parameters. For the primary measures in this study, two important discrete parameters are the species tree Υ and the model mixture parameter ψ. In particular, I use SRQ plots to assess mixing when comparing the relative probabilities of two possible species trees and of differing stochastic models for HGT. In these SRQ plots, the local slope around a given point depicts the ratio of the relative posterior probability estimate based on the entire MCMC chain to an estimate based on a short segment of the chain around that point. Substantial deviation of the slope from one implies that the sampler is slowly mixing and, as a result, the chain is not sufficiently long to generate stable estimates. For continuous model parameters and Bayes factors based on the Savage-Dickey ratio, I assess convergence by comparing posterior estimates obtained from simulations of at least five independent chains with starting values drawn directly from the model priors.

## Acknowledgments

I thank the Lake lab, in particular Jon Moore and Jim Lake, for stimulating my interest in HGT, for many provocative discussions, and for providing the SPR adjacency matrices and prokaryote sequences used in this study. I also thank John Huelsenbeck for his insights into HGT and Janet Sinsheimer and Vladimir Minin for commenting on this manuscript. Fred Fox and the National Science Foundation grant 9987641-sponsored University of California, Los Angeles, Training Program in Bioinformatics graciously made possible the computing facilities to fit all 144 alignments simultaneously. The complete data set is available to interested readers at http://www.biomath.medsch.ucla.edu/msuchard/datasets.html. I am supported in part by National Institutes of Health grants GM08042 and GM068955 and U.S. Public Health Service grant CA16042.

## Footnotes

Communicating editor: J. Hein

- Received December 11, 2003.
- Accepted February 1, 2005.

- Genetics Society of America