| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 175, 1773-1785, April 2007, Copyright © 2007
doi:10.1534/genetics.106.066258
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,
,


,1
* Department of Biomathematics and ** Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California 90095, 
Department of Biostatistics, UCLA School of Public Health, Los Angeles, California 90095 and
Bioinformatics and Computational Biology Program,
Department of Statistics and
Department of Genetics, Cell and Development Biology, Iowa State University, Ames, Iowa 50011
1 Corresponding author: Departments of Biomathematics and Human Genetics, David Geffen School of Medicine, University of California, 695 Charles E. Young Dr., Box 951766, S. Los Angeles, CA 90095-1766.
E-mail: msuchard{at}ucla.edu
| ABSTRACT |
|---|
|
|
|---|
Rapid HIV mutation rates and infrequent recombination between genetically distinct viral genomes allow for recombination detection from evolutionary histories (phylogenies) of a recombinant and its putative parental sequences (AWADALLA 2003). Such phylogenetic-based recombination detection (HEIN 1990; SALMINEN et al. 1995; GRASSLY and HOLMES 1997; MCGUIRE et al. 1997; SUCHARD et al. 2002; HUSMEIER 2005) relies on the observation that genomic sequences experiencing infrequent recombination can be decomposed into breakpoint delimited blocks with distinct evolutionary histories (LI et al. 1988). We illustrate the idea behind all phylogenetic recombination detection methods with a simple example. Figure 1 shows a short multiple sequence alignment divided by recombination into two parts, such that a different phylogeny summarizes the sequence relationships in each part. The presence of alignment sites informative for phylogenetic reconstruction (shown in boldface type) is necessary for successful phylogenetic recombination detection.
|
Given the myriad phylogenetic methods for inferring recombination events in individual HIV sequences, mapping recombination hotspots appears to be a straightforward task. However, recent attempts of phylogenetic mapping of recombination hotspots in the HIV genome (MAGIORKINIS et al. 2003; ZHANG et al. 2005) run into major difficulties. First, phylogenetically informative sites are sparsely distributed, making estimation of recombination locations somewhat imprecise. Ignoring uncertainty about the number of recombination events and their locations within each recombinant leads to loss of power due to inefficient use of sequence data. Finally, the modest number of recombination events relative to the number of sites in individual alignments results in a sparse breakpoint distribution that prohibits direct estimation of site-specific recombination frequencies.
To address these issues, we propose a Bayesian hierarchical model that allows integration over breakpoint locations and stochastically interpolates site-specific recombination probabilities with the help of a smoothing prior shared by all recombinants. To specify the distribution of breakpoint locations, conditional on sequence data, we begin with a dual multiple change-point (DMCP) model (MININ et al. 2005). The DMCP model operates on a multiple sequence alignment of a putative recombinant and its "parental" strains and models recombination as a change-point process. We achieve information sharing among recombinants by assuming that homologous sites of all alignments have the same prior probability of being a recombination breakpoint. Estimation of such site-specific recombination probabilities is the key to identifying recombination hot-/coldspots. To handle the sparse breakpoint information, we recruit Gaussian Markov random fields (GMRFs), a popular class of distributions used to model temporal or spatial dependence (BESAG 1974; BESAG et al. 1991; RUE and HELD 2005). Normally distributed vector
is called a GMRF with respect to a graph
with nodes
and edges
, provided that Qij
0 if and only if
or i = j. To impose a biologically relevant correlation structure on site-specific recombination log odds (transformed probabilities), we use a GMRF prior on a linear graph
connecting adjacent sites in a multiple sequence alignment. Such spatial smoothing allows sites where recombination is not observed to borrow information from adjacent sites where recombination is observed.
We approximate the posterior distribution of all model parameters via Markov chain Monte Carlo (MCMC) simulation. Since the number of change points in individual DMCP models is random, we use reversible-jump MCMC sampling to move between spaces with different dimensions (GREEN 1995). On the population level, we explore a high-dimensional (of the order 103–104) space of recombination log odds via a block updating scheme using Metropolis–Hastings transition kernels with multivariate Gaussian proposals as implemented in the freely distributed GMRFLib library (RUE 2001; RUE et al. 2004). In contrast to typical spatial applications of GMRFs (ELLIOTT et al. 2000), we apply smoothing to probabilities of recombination breakpoints that themselves are random rather than directly observed as data. To our knowledge, this is the first use of GMRF priors in a random environment. We demonstrate the need for a nonlinear constraint on the GMRF to control the total number of breakpoints and provide a computationally efficient implementation of such a constraint.
We test our model through a simulation study, where recombination events are generated by permuting sequences in an alignment of primate mitochondrial DNA genes. The ability of the model to reconstruct several "true" recombination probability profiles is examined under different simulation conditions. Next, we apply our hierarchical model to 42 publicly available putative recombinants between HIV subtypes A and G that span the gag coding region of the viral genome. We find strong evidence for an
300-nucleotide recombination hotspot in the Capsid gene. In the DISCUSSION, we summarize our findings and propose further extensions to the smoothing prior on recombination locations.
| METHODS |
|---|
|
|
|---|
,
, the alignment is composed of nucleotide base names (A, adenine; G, guanine; T, thymine; C, cytosine) and gap characters (-). To eliminate unnecessary information in Y, we consider only columns where at least one of the recombinants possesses a nucleotide base. For each
, we create individual, recombinant-specific alignments Y(k) by preserving the rows of Y that correspond to recombinant k and its N(k) – 1 candidate parental sequences (possibly different for each recombinant) and removing the other rows. Sites where the recombinant sequence has a gap are not informative for recombination detection via the DMCP model and are removed from the individual alignments Y(k). Such gap removal establishes an identity between the lengths of putative recombinants and the number of sites in the individual alignments,
, and simplifies information sharing among individual data sets. We map individual alignments onto the master alignment with functions
![]() | (1) |
fk(j) for any i
j, the inverse fk–1 is defined on the range of fk that represents the set of sites in the master alignment where the kth recombinant has no deletions.
Dual multiple change-point model:
We assume that conditional on model parameters
(k), each alignment Y(k) is drawn independently from a DMCP model, i.e.,
![]() | (2) |
Columns
of each individual alignment Y(k) are assumed to evolve independently as a continuous-time Markov chain on the state space {A, G, C, T} (FELSENSTEIN 2004). For each site
, we parameterize the infinitesimal rate matrix
of the Markovian substitution process in terms of its stationary distribution
and a transition/transversion rate ratio
following HASEGAWA et al. (1985). To reduce the number of nuisance parameters in the model, we fix all
to the overall observed nucleotide frequencies in Y(k) (LI et al. 2000). This leaves us with one free parameter
defining the substitution matrix
. To complete the phylogenetic model specification, we need a bifurcating tree topology
describing the historical relationships among nucleotides, with branch lengths
representing the expected number of substitutions between the bifurcation events. We further reduce the number of free parameters in the model by integrating
out of the likelihood through assuming an exponential prior on each branch length
for all
. Therefore, the likelihood of site s in recombinant k is a function of three phylogenetic parameters
.
To model variation of the phylogenetic parameters along the columns of Y(k), we assume that the parameters are piecewise constant in s with jumps occurring at unknown change points. We first introduce a set of topology breakpoints
, where M(k) is the unknown number of recombination breakpoints for recombinant k, and
, for all
. Since topologies can attain only a finite set of values we require that
, for all
. Similarly we introduce a set of change points
for substitution process parameters and assume that
and
are constant between change points. In summary, our DMCP model for each recombinant k is defined by a set of parameters
(k) = (
(k),
(k), µ(k),
(k),
(k)), where
,
,
,
, and
, and the varying dimensionality of the parameter space is determined by M(k) and J(k).
Priors for nuisance parameters:
Since our interest in this article is the recombination breakpoints
(k), we collect all other parameters for each recombinant into a vector
(k) = (
(k), µ(k),
(k),
(k)) and refer to them as nuisance parameters. We define a prior distribution for nuisance parameters by assuming substantial prior independence, specifically
. We assume a noninformative prior for
over E(k) possible tree topologies, relating recombinant k with its potential "parents." The space of topologies permissible under the DMCP model is formed as described in MININ et al. (2005). Constraints on adjacent topologies are incorporated using a simple Markovian structure
![]() | (3) |
(k) is specified by first assuming that J(k) follows a truncated Poisson distribution with a predefined, constant intensity
and then giving equal prior probabilities to all possible draws of J(k) integers from the set
,
![]() | (4) |
for all individual alignments as putative recombinant sequences are derived from the same genomic region and therefore should have an approximately equal number of changes in evolutionary pressure. Substitution parameters are a priori log-normally distributed,
,
, where 
, 
,
µ, and
µ are either estimated in a hierarchical framework or fixed according to our prior knowledge about sequence variability in the genomic region under study. For more details on specifying the prior distribution for nuisance parameters
(k) see MININ et al. (2005).
Spatially smoothed prior for recombination locations:
To specify prior probabilities for recombination breakpoint locations, we first switch from their point-process representation to site-specific recombination indicators
, where
,
,
, and 1{·} is the indicator function. For clarity of presentation we ignore the fact that the first site of an alignment cannot be a topology breakpoint according to our definition. Such reparameterization allows us to introduce recombination probabilities
on the master alignment and then map them onto individual recombinants using functions (1) to define a prior distribution for breakpoint locations,
![]() | (5) |
![]() | (6) |
and define the total number of recombination breakpoints at site s,
, for
, then Equation 6 simplifies to
![]() | (7) |
Because in practice the total number of observed breakpoints is smaller than the number of sites S by one to two orders of magnitude, estimation of the common recombination probabilities p is unrealistic without further assumptions about their prior distribution. Since HIV recombination is mediated by the enzyme reverse transcriptase that processes nucleotides sequentially (NEGRONI and BUC 2001), we argue that recombination probabilities should have similar values at adjacent locations. To model such spatial dependency among components of p, we first obtain recombination log odds
, where
![]() | (8) |
![]() | (9) |
, where the precision matrix
![]() | (10) |
, where
, I is the S x S identity matrix, and
is a small positive constant. Note that the addition of a positive constant to the diagonal elements of Q preserves the precision matrix sparseness, but forces
to be diagonally dominant and therefore positive definite. The proper approximation introduces an additional term,
, to the exponent of density (9). In all examples, we use
= 10–6 such that this term
0.05, assuming
for all s.
In addition to providing spatial preferences for breakpoint locations, the vector of recombination probabilities p defines the prior distribution for the total number of breakpoints
for each alignment k. It is important to put more prior mass on small values of M(k) to avoid inferring spurious breakpoints from noisy sequence data. The original DMCP model assumes that M(k) is truncated-Poisson distributed with a rate chosen in such a way that Pr(M(k) > 0) is equal to a predefined constant, usually 0.5. Similarly, in our hierarchical formulation, we want to control the overall probability of at least one recombination breakpoint in all individual alignments by imposing certain constraints on p. We first note that our site-specific prior on R(k) imposes a Poisson-binomial distribution for M(k) with small probabilities of success, usually on the order
. Therefore, le Cam's theorem implies that the distribution of M(k) is approximately Poisson with rate
(LE CAM 1960). For some constant c, we can set
k = –ln(1 – c), so that
. Because restricting recombination probabilities for each recombinant individually is impractical, we impose our constraint on the population-level recombination probabilities,
. Since
, this population-level restriction implies a more conservative prior distribution for the number of breakpoints in each individual data set k with Pr(M(k) > 0)
c.
We complete our model specification by assuming a priori that
(
, β). Following BERNARDINELLI et al. (1995) we express our prior belief about
in terms of a ratio of recombination probabilities
. On the basis of in vitro HIV recombination detection experiments (MOUMEN et al. 2001; DYKES et al. 2004; GALETTO et al. 2004) we expect that site recombination probabilities should not vary more than sevenfold or equivalently that recombination log odds should not differ by >2. Since our smoothing prior implies that
, setting
= S – 1 ensures that even the most physically distant log odds do not deviate from each other by >2 with probability 0.95. Therefore, we fix the prior mean
/β = S – 1 and choose β to be a small constant (0.01 in the simulation study and 0.02 in the analysis of HIV recombinants).
Inference via MCMC simulation:
To approximate the analytically intractable posterior distribution of all model parameters
![]() | (11) |
In the first block, we simulate from the full conditional distribution of all individual alignment parameters. The hierarchical structure of our model immediately implies the conditional independence of
(k)s,
![]() | (12) |
![]() | (13) |
The second block of parameters consists of the recombination log-odds vector
and the GMRF precision
. Conditioning on the parameters of the individual alignments yields
![]() | (14) |
and trials
are as defined in Equation 7, and
![]() | (15) |
![]() | (16) |
To sample from distribution (14), we rely on the strategy introduced by RUE (2001) and KNORR-HELD and RUE (2002) and update (
,
) simultaneously. Following their scheme, we first propose a new value for the precision parameter
* =
u, where
is the current precision and u is a random variable with density Pr(u)
1 + 1/u, defined on the interval [1/U, U], U > 1. This proposal is symmetric and can be tuned by the constant U that controls the "length" of proposal jumps. Given a new value of the precision, we then generate a proposal for the vector of log-odds
* from a multivariate Gaussian distribution that approximates Pr(
| C, T,
*) near its mode, where
![]() | (17) |
'. Log concavity of density (17) guarantees at most one mode. Then, a second-order Taylor approximation of ln Pr(
| C, T,
*) around
' generates the proposal mean and precision matrix and concludes the Gaussian proposal construction. The proposed values (
*,
*) are accepted or rejected jointly with probability given by the Metropolis–Hastings acceptance ratio. The computational efficiency of this multivariate proposal follows from the special shape of density (17). Note that, during construction of the Gaussian approximation, it is sufficient to apply the Taylor approximation only to the function
. Since all mixed derivatives of this function are zero, the off-diagonal elements of
are equal to the off-diagonal entries of the Gaussian proposal precision matrix. Therefore, the multivariate normal proposals retain the same sparseness of Q and can be efficiently realized using fast methods of Cholesky decomposition for sparse matrices. For more details on approximating densities of the form similar to (17), see RUE (2001) and RUE et al. (2004).
Implementing prior constraints:
We now turn to the problem of incorporating the imposed restrictions on recombination probabilities into our MCMC algorithm. Implementing a proposal that approximates (17) well while satisfying nonlinear constraint (16) is difficult. However, if we can replace constraint (16) with a linearized form
for some vector
and scalar e, then we can use unconstrained Gaussian proposals as before to generate a candidate state
and recenter the proposal to satisfy the linear constraint via
![]() | (18) |
and
are obtained via a Taylor expansion of ln Pr(
| C, T,
*). Such recentering comes at minimal computational cost as the Cholesky factorization of
, needed in the unconditional proposal, can be reused to perform the algebraic operations in Equation 18 (RUE and HELD 2005).
To arrive at specific values of a and e, we linearize the function
around an arbitrary point
. Plugging this linearization into Equation 16 yields
![]() | (19) |
![]() | (20) |
. To make an intelligent guess about posterior support of recombination log odds, we generate a short "training chain" prior to running our MCMC sampler. During these training iterations, we alternate between sampling from the full conditional distributions defined by (12) and (14) with one heuristic modification that allows us to control the overall recombination probability implicitly. To understand the motivation behind this heuristic, we first point out that
, where the latter approximation holds when the number of gaps in the master alignment is small. Therefore, if a new state of
(i) is accepted at the ith iteration, then a new value of
should be close to
. To heuristically control the overall recombination probability via the binomial pseudolikelihood (15), we update the vector of trials such that
for all s at iteration i, where
x
denotes the largest integer that does not exceed x. After training runs are complete we set the approximation point v to an arithmetic average of simulated components of
.
In all following analyses we set the prior probability of at least one recombination c = 0.5, and therefore we aim at preserving the condition
. Table 1 shows posterior medians and 95% Bayesian credible intervals (BCIs) of the overall recombination probability,
, for all analyzed data sets. Although in each case the posterior distribution of the overall recombination is slightly shifted to the right from 0.693, this shift and the spread of the distribution are quite small. Therefore, we conclude that our linear approximation performs well.
|
| RESULTS |
|---|
|
|
|---|
, of this random variable. For each Di,
, we create a new sequence alignment by permuting the nucleotides of H and L in sites Di through 888. In these newly formed alignments, sites from 1 to Di – 1 should support the phylogeny (H, O, (S, L)), while the other portion of the alignment should favor the phylogeny (L, O, (S, H)), obtained by exchanging H and L in the original tree. We generate only 30 recombinants since this quantity represents well the number of recombinant sequences typically available for analysis. When sample size is small relative to the number of sites covered by putative recombinants, the strength of a hotspot plays a critical role in our ability to recover the region. Therefore, we examine performance of our model for different hotspot probability mass values A = 0.9, 0.7, 0.5, and 0.3. The top left plot in Figure 2 shows the artificially generated probabilities used to simulate recombination events. The remaining plots in the left column of Figure 2 depict posterior medians (solid lines) and 95% BCIs (shaded areas) of recombination probabilities estimated under different true values of hotspot strength A. Solid dots mark true recombination sites where sequences H and L begin their permutation. We see that our model successfully identifies hotspots in the presence of a strong signal. On the other hand, when A = 0.3, simulated breakpoint locations are hardly distinguishable from a random sample from all 888 sites, and our method aptly detects no hotspots. This conservative behavior of our estimation procedure is adequate and, moreover, desirable to avoid erroneous detection of recombination hotspots.
|
for each value of A. Note that when significant clustering of breakpoints is observed, the posterior mass of
concentrates closer to zero. We expect such behavior since greater variability in recombination log odds, supported by data, leads to a decrease of smoothness. Additionally, the bottom plot shows that the prior of
dominates the posterior when true breakpoints are distributed nearly uniformly.
Finally, we test the ability of our model to recover the strength of a hotspot A. For each value of A, Table 2 reports the true proportion of simulated breakpoints contained in the interval [401, 600], the posterior median and 95% BCI of the normalized probability masses,
and
. The normalization is necessary for comparison of the simulated and estimated recombination probabilities as the former sum to one by construction and the latter sum to
ln 2 due to the enforced constraint. Mass
consistently underestimates the strength of the hotspot with a 95% BCI covering the true value of A only when A = 0.7. However, if we expand the region by 50 sites in both directions, the posterior of
more accurately reflects the true strength of the hotspot. This indicates that uncertainty in estimated breakpoint locations leads to an overestimation of the size of the simulated hotspot region.
|
In the top plot of Figure 3 we show the locations of gene products Matrix and Capsid in the master alignment of gag and indicate the position of one of the HIV instability elements (INSs). INSs are RNA sequence motifs involved in post-transcriptional regulation of the HIV gene expression (MIKAÉLIAN et al. 1996). It is possible that INS primary or secondary structure promotes recombination. Below the gene map we depict the posterior medians and 95% BCIs of the population-level recombination probabilities. This recombination profile strongly suggests an
300-nucleotide-long hotspot near the beginning of the Capsid coding region. The posterior median of the GMRF
precision amounts to 418, and the 95% BCI of
is (225, 721).
|
A cluster of several breakpoints at the end of Capsid does not substantially elevate the corresponding population-level recombination probabilities. Recombination signal in this region comes only from six recombinants. All of these breakpoints are located at the very end of individual alignments and some of them represent noise, associated with topological uncertainty, rather than recombination events. Figure 3 demonstrates that the posterior support of all breakpoints in this cluster decreases during the joint analysis, resulting in either a shift of their estimates or an elimination of weakly supported breakpoints.
We also compare the joint and independent analyses with respect to their estimates of the total number of breakpoints in each recombinant. We plot the posterior mean number of breakpoints, obtained using both approaches, in Figure 4. Great variability in M(k) among individual recombinants highlights the importance of allowing flexibility in the number of breakpoints. Most data sets exhibit a slight increase in a posteriori supported number of breakpoints during the joint analysis, but the overall pattern remains unchanged between the two types of analysis. To investigate the cause of the increased support, we compare recombination profiles (data not shown) of all individual recombinants, obtained via the joint and independent approaches. We find that in all cases the increase occurs due to higher values of population-level recombination probabilities in the "hot" portion of gag boosting the posterior confidence in breakpoints located in this region that are weakly or moderately supported under the independent analysis with a flat recombination prior. Therefore, the informative recombination prior does not introduce false breakpoints, but rather amplifies the existing signal inside recombination hotspots. This amplification can be clearly seen by comparing the "skylines" in the bottom two plots of Figure 3.
|
, the time steps at which the Markov chain visits a predefined state (or a set of states), where n is the random number of total visits observed during an MCMC run of fixed length. MYKLAND et al. (1995) note that the behavior of a renewal process defined by regeneration times of a Markov chain may be used to test the performance of an MCMC sampler. The authors suggest plotting ti/tn against i/n. According to the law of large numbers for renewal processes, this scaled regeneration quantile (SRQ) plot should be close to a line passing through points (0, 0) and (1, 1). Since the total number of breakpoints is only a marginalization of the complete Markov chain state, regeneration times of M(k) are not independent and identically distributed (i.i.d.). However, LI et al. (2000) show that the same interpretation of SRQ plots remains useful even when regeneration times are not strictly i.i.d.
We evaluate the performance of our sampler on the HIV data set with 42 recombinants. For each
, we choose the posterior median of M(k) to be the renewal state for defining regeneration times ti. We show 42 superimposed SRQ plots in Figure 5, left. Since all SRQ plots in Figure 5 are concentrated around the line y = x, we conclude that our MCMC chains are running long enough to sufficiently sample the posterior distributions of the individual-level parameters.
|
We calculate PSRFs for the recombination log odds from five chains, each started with different values of
and
. We generate initial values
(0) from a uniform distribution over the interval (0, 10,000). We then sample a value for
from a normal distribution with mean
and variance of 2. Conditional on
(0) and
, we initialize the remaining recombination log odds through a random-walk realization,
, for
. Such a distribution of starting values for (
,
) should be overdispersed with respect to the posterior, as recommended by GELMAN and RUBIN (1992). Figure 5, right, depicts the PSRFs (solid line) with their 97.5% quantiles for the HIV example recombination log odds. The PSRF and its 95% quantile for ln
are 1.04078 and 1.09845, respectively. Close proximity of all estimated PSRFs to 1 suggests that all chains reach stationarity.
| DISCUSSION |
|---|
|
|
|---|
Breakpoints in the DMCP model require special attention as their total number needs to be individually controlled in the presence of noisy sequence information. The common prior distribution provides such oversight. We constrain the sum of recombination probabilities to impose an approximately Poisson distribution with a fixed rate on the total number of recombination events in each data set. Such a seemingly trivial modification considerably complicates our MCMC implementation as the modification changes the a priori correlation structure of the recombination log odds, which without constraints is dictated solely by the GMRF hyperprior. To overcome this difficulty, we introduce a linearized constraint for the recombination log odds that approximates our original restrictions on the recombination probabilities. The advantage of such a linear approximation is the ease and computational efficiency of incorporating it into our MCMC transition kernels. We demonstrate that our linear constraint achieves desirable behavior both for the recombination probabilities and for the total number of breakpoints in individual alignments.
The analysis of the HIV gag genomic region strongly suggests a recombination hotspot near the beginning of the Capsid coding region. Since local sequence motifs have been long suspected to promote HIV recombination (BALAKRISHNAN et al. 2001; MOUMEN et al. 2001, 2003; NEGRONI and BUC 2001; GALETTO et al. 2004), we examine this part of the HIV genome for the presence of known motifs. One of the HIV instability elements, denoted as INS2-M6 by SCHNEIDER et al. (1997), covers sites [564, 609] of our master alignment (see Figure 3). We hypothesize that either primary or secondary structure of this RNA segment promotes formation of a recombination hotspot in the Capsid coding region. This hypothesis grows even more promising in light of preliminary experimental results confirming an increased rate of in vitro reverse transcriptase strand transfer in the Capsid hotspot (S. CARPENTER, personal communication).
Selection of recombinant sequences for hotspot mapping can bias results and therefore should be performed with caution. For example, several sequences may be descendants of the same ancestral recombinant. Including such recombinants into the analysis would violate our assumption of independence among recombination events, leading to overcounting of breakpoints in some regions of the master alignment. Researchers should pay particular attention to circulating recombinant forms (CRFs) since by definition they may be overrepresented in a population sample. To check for this possibility, we examined CRFs with recombination between A and G subtypes in the gag coding region and found that no known CRF contributes breakpoint signal at the hotspot that we identified from the 42 HIV gag recombinants (data not shown). Another danger comes from the fact that individual recombinants usually cover different portions of the master alignment. Although site-specific trials T account for such uneven coverage, the breakpoint noise, often seen at the boundaries of individual alignments, may be amplified if many recombinants start or end in close proximity to each other.
Since the factors promoting HIV recombination in vivo are largely unknown, it is natural to capitalize on the flexible GMRF structure and incorporate covariates into the prior of recombination log odds, using a generalized linear model framework. Such an extension will not only improve estimation of hotspot locations by injecting additional information into the model, but also enable the testing of the role of specific sequence features in producing a nonuniform distribution of breakpoint locations along the HIV genome. Our model augmented with covariates should be superior to previous approaches that use phylogenetic recombination detection to test spatial association of recombination hotspots with local genomic RNA properties (MAGIORKINIS et al. 2003; ZHANG et al. 2005), as the hierarchical approach allows for integration over all breakpoint locations supported by molecular sequence data.
Finally, we outline future opportunities for bridging phylogenetic and coalescent-based methods for studying recombination. These two approaches are often considered competitors (AWADALLA 2003). In our opinion, phylogenetic and coalescent-based methods for studying recombination do not compete, but rather complement each other. Both frameworks provide sensible tools for analyzing recombination among sequences, but differ in the recombination/mutation rate ratio most appropriate for the chosen method. Moreover, it is not hard to envision a Bayesian model with a phylogenetic change-point likelihood controlling breakpoint locations and a coalescent-based prior forcing phylogenies to obey the laws of population genetics. This unified framework is particularly promising for studying recombination during HIV intrahost evolution as both phylogenetic and coalescent-based approaches have advantages to contribute when analyzing such sequence data.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
AWADALLA, P., 2003 The evolutionary genomics of pathogen recombination. Nat. Rev. Genet. 4: 50–60.[CrossRef][Medline]
BALAKRISHNAN, M., P. FAY and R. A. BAMBARA, 2001 The kissing hairpin sequence promotes recombination within the HIV-I 5' leader region. J. Biol. Chem. 276: 36482–36492.
BARLOW, K., I. TATT, P. CANE, D. PILLAY and J. CLEWLEY, 2001 Recombinant strains of HIV type 1 in the United Kingdom. AIDS Res. Hum. Retroviruses 17: 467–474.[CrossRef][Medline]
BERNARDINELLI, L., D. CLAYTON and C. MONTOMOLI, 1995 Bayesian estimates of disease maps: How important are priors? Stat. Med. 14: 2411–2431.[Medline]
BESAG, J., 1974 Spatial interaction and the statistical analysis of lattice systems (with discussion). J. R. Stat. Soc. Ser. B 36: 192–236.
BESAG, J., J. YORK and A. MOLLIÉ, 1991 Bayesian image restoration, with two applications in spatial statistics (with discussion). Ann. Inst. Stat. Math. 43: 1–59.[CrossRef]
CHIN, M., T. RHODES, J. CHEN, W. FU and W. HU, 2005 Identification of a major restriction in HIV-1 intersubtype recombination. Proc. Natl. Acad. Sci. USA 102: 9002–9007.
CHOISY, M., C. WOELK, J. GUEGAN and D. ROBERTSON, 2004 Comparative study of adaptive molecular evolution in different human immunodeficiency virus groups and subtypes. J. Virol. 78: 1962–1970.
DURALI, D., J. MORVAN, F. LETOURNEUR, D. SCHMITT, N. GUEGAN et al., 1998 Cross-reactions between the cytotoxic T-lymphocyte responses of human immunodeficiency virus-infected African and European patients. J. Virol. 72: 3547–3553.
DYKES, C., M. BALAKRISHNAN, V. PLANELLES, Y. ZHU, R. BAMBARA et al., 2004 Identification of a preferred region for recombination and mutation in HIV-1 gag. Virology 326: 262–279.[CrossRef][Medline]
ELLIOTT, P., J. WAKEFIELD, N. BEST and D. BRIGGS (Editors), 2000 Spatial Epidemiology: Methods and Applications. Oxford University Press, London/New York/Oxford.
FEARNHEAD, P., R. HARDING, J. SCHNEIDER, S. MYERS and P. DONNELLY, 2004 Application of coalescent methods to reveal fine-scale rate variation and recombination hotspots. Genetics 167: 2067–2081.
FELSENSTEIN, J., 2004 Inferring Phylogenies. Sinauer Associates, Sunderland, MA.
GALETTO, R., A. MOUMEN, V. GIACOMONI, M. VERON, P. CHARNEAU et al., 2004 The structure of HIV-1 genomic RNA in the gp120 gene determines a recombination hot spot in vivo. J. Biol. Chem. 279: 36625–36632.
GELMAN, A., and D. RUBIN, 1992 Inference from iterative simulation using multiple sequences. Stat. Sci. 7: 457–511.[CrossRef]
GHOSH, M., K. NATARAJAN, T. STROUD and B. CARLIN, 1998 Generalized linear models for small-area estimation. J. Am. Stat. Assoc. 93: 273–282.[CrossRef]
GRASSLY, N., and E. HOLMES, 1997 A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol. 14: 239–247.[Abstract]
GREEN, P., 1995 Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732.
GUO, H. G., M. S. REITZ, R. C. GALLO, Y. C. KO and K. S. CHANG, 1993 A new subtype of HIV-1 gag sequence detected in Taiwan. AIDS Res. Hum. Retroviruses 9: 925–927.[Medline]
HASEGAWA, M., H. KISHINO and T. YANO, 1985 Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160–174.[CrossRef][Medline]
HEIN, J., 1990 Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98: 185–200.[CrossRef][Medline]
HUSMEIER, D., 2005 Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models. Bioinformatics 21: ii166–ii172.[Abstract]
KALISH, M., K. ROBBINS, D. PIENIAZEK, A. SCHAEFER, N. NZILAMBI et al., 2004 Recombinant viruses and early global HIV-1 epidemic. Emerg. Infect. Dis. 10: 1227–1234.[Medline]
KAUPPI, L., A. JEFFREYS and S. KEENEY, 2004 Where the crossovers are: recombination distributions in mammals. Nat. Rev. Genet. 5: 413–424.[CrossRef][Medline]
KNORR-HELD, L., and H. RUE, 2002 On block updating in Markov random field models for desease mapping. Scand. J. Stat. 29: 597–614.[CrossRef]
LARGET, B., and D. SIMON, 1999 Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16: 750–759.
LE CAM, L., 1960 An approximation theorem for the Poisson binomial distribution. Pac. J. Math. 10: 1181–1197.
LI, W., M. TANIMURA and P. SHARP, 1988 Rates and dates of divergence between AIDS virus nucleotide sequences. Mol. Biol. Evol. 5: 313–330.[Abstract]
LI, S., D. PEARL and H. DOSS, 2000 Phylogenetic tree construction using Markov chain Monte Carlo. J. Am. Stat. Assoc. 95: 493–508.[CrossRef]
MAGIORKINIS, G., D. PARASKEVIS, A. VANDAMME, E. MAGIORKINIS, V. SYPSA et al., 2003 In vivo characteristics of human immunodeficiency virus type 1 intersubtype recombination: determination of hot spots and correlation with sequence similarity. J. Gen. Virol. 84: 2715–2722.
MCGUIRE, G., F. WRIGHT and M. PRENTICE, 1997 A graphical method for detecting recombination in phylogenetic data sets. Mol. Biol. Evol. 14: 1125–1131.[Abstract]
MCVEAN, G., P. AWADALLA and P. FEARNHEAD, 2002 A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160: 1231–1241.
MCVEAN, G., S. MYERS, S. HUNT, P. DELOUKAS, D. BENTLEY et al., 2004 The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584.
MIKAÉLIAN, I., M. KRIEG, M. GAIT and J. KARN, 1996 Interactions of INS (CRS) elements and the splicing machinery regulate the production of Rev-responsive mRNAs. J. Mol. Biol. 257: 246–264.[CrossRef][Medline]
MININ, V., K. DORMAN, F. FANG and M. SUCHARD, 2005 Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics 21: 3034–3042.
MOUMEN, A., L. POLOMACK, B. ROQUES, H. BUC and M. NEGRONI, 2001 The HIV-1 repeated sequence R as a robust hot-spot for copy-choice recombination. Nucleic Acids Res. 29: 3814–3821.
MOUMEN, A., L. POLOMACK, T. UNGE, M. VERON, H. BUC et al., 2003 Evidence for a mechanism of recombination during reverse transcription dependent on the structure of the acceptor RNA. J. Biol. Chem. 278: 15973–15978.
MYERS, S., L. BOTTOLO, C. FREEMAN, G. MCVEAN