- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Fujitani, Y.
- Articles by Kobayashi, I.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Fujitani, Y.
- Articles by Kobayashi, I.
Effect of DNA Sequence Divergence on Homologous Recombination as Analyzed by a Random-Walk Model
Youhei Fujitania and Ichizo Kobayashiba Department of Applied Physics and Physico-Informatics, Faculty of Science and Technology, Keio University, Yokohama 223-8522, Japan
b Department of Molecular Biology, Institute of Medical Science, University of Tokyo, Tokyo 108-8639, Japan
Corresponding author: Youhei Fujitani, Department of Applied Physics and Physico-Informatics, Faculty of Science and Technology, Keio University, Yokohama 223-8522, Japan., youhei{at}appi.keio.ac.jp (E-mail)
Communicating editor: N. TAKAHATA
| ABSTRACT |
|---|
A point connecting a pair of homologous regions of DNA duplexes moves along the homology in a reaction intermediate of the homologous recombination. Formulating this movement as a random walk, we were previously successful at explaining the dependence of the recombination frequency on the homology length. Recently, the dependence of the recombination frequency on the DNA sequence divergence in the homologous region was investigated experimentally; if the methyl-directed mismatch repair (MMR) system is active, the logarithm of the recombination frequency decreases very rapidly with an increase of the divergence in a low-divergence regime. Beyond this regime, the logarithm decreases slowly and linearly with the divergence. This "very rapid drop-off" is not observed when the MMR system is defective. In this article, we show that our random-walk model can explain these data in a straightforward way. When a connecting point encounters a diverged base pair, it is assumed to be destroyed with a probability that depends on the level of MMR activity.
MANY experimental studies have analyzed the relationship between the frequency of homologous recombination and the homology length that ranges from some hundreds of base pairs up to ~20 kbp (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]() |
(1) |
where c is the constant of proportionality. The linear function thus obtained, however, was later found to disagree with nonlinear dependence of the frequency on the homology length observed in a mammalian gene targeting system (![]()
|
In contrast with the MEPS theory, our "random-walk model" was shown to explain the data from both systems (![]()
![]()
![]()
![]()
The recombination frequency has been found to decrease as sequence differences are introduced into the homologous region; its logarithm appears to be reduced linearly with an increase of the divergence (the ratio of the number of diverged base pairs to the number of all base pairs in a region of homology between two DNA duplexes) for very long homologous regions (106107 bp) in bacterial systems (![]()
![]()
![]()
![]()
![]()
![]()
As described in the next section, these effects of the MMR system have been explained in terms of the MEPS theory, which has already failed to explain the nonlinear dependence of the recombination frequency on the homology length. Here we present an alternative explanation in terms of the random-walk model after a brief review of the original version of the random-walk model. Symbols we use frequently are listed in Table 1.
|
| PREVIOUS MODELS |
|---|
Assuming that a base pair at a particular position in a homologous region will be diverged with a probability equal to the divergence (D, 0
D
1), one can calcu-late the average recombination frequency to compare it with experimental data. We express this average over positions of diverged base pairs by putting the recom-bination frequency, denoted by
, between the angle brackets,
and
, in the following equations. The recombination frequency at D = 0 need not be averaged.
In the MEPS theory, initial enzymes are supposed to work only when they cling to a MEPS devoid of diverged base pairs; the recombination frequency is proportional to the number of ways of picking up a MEPS devoid of diverged base pairs from the homologous region (N bp in total; ![]()
![]()
![]() |
(2) |
where the superscript (M) indicates a result in the framework of the MEPS theory, c is the constant used in Equation 1, and
(M)(D = 0, N) is the recombination frequency at D = 0 given by Equation 1. When D << 1, because e-D
1 - D, we have
![]() |
(3) |
The reaction, thus initiated, may be aborted by the MMR system. The MMR system would attack a mismatch, which is produced at a diverged base pair as the heteroduplex elongates.
![]()
![]() |
(4) |
where the modified MEPS length, M
eps, depends on the level of MMR activity. Equation 4 implies that the logarithm is a linear function of D with the slope dependent on the level of MMR activity. As shown later, ![]()
![]()
e-ßD, the probability with which the MMR system is triggered is given by 1 - R0e-ßD. They introduced a factor f denoting the probability with which the reaction is aborted after the MMR system is triggered and expressed the averaged recombination frequency as a function of D, N, and f:
![]() |
(5) |
They fitted Equation 5 to their experimental data (N = 350) for the wild-type strains showing the very rapid drop-off to obtain f = 0.97. When f = 0, Equation 5 is equivalent to Equation 3, which can explain the data for the Mmr- strains showing no very rapid drop-off. Equation 5 gives different values to the recombination frequency between identical substrates in the wild-type strains, <
(M)(D = 0, N = 350, f = 0.97)>, and to that in the Mmr- strains, <
(M)(D = 0, N = 350, f = 0)>, which agrees with their data. ![]()
We feel that ![]()
| THE RANDOM-WALK MODEL |
|---|
Here we review the original version of the random-walk model (![]()
per site and neglect cases where more than one connecting point is produced in a relatively short identical region (n
<< 1). A "randomly walking" connecting point is assumed to be processed somewhere within the region. Here, "being processed" includes "being resolved to a recombinant" and "being destroyed" (i.e., "disappearing without yielding a recombinant"). We write k (0 < k
1) for the conditional probability of resolution given that a connecting point is processed. A connecting point is assumed to be destroyed whenever it encounters either end of the homology. This is the condition of a totally absorbing boundary (![]()
![]()
![]() |
(6) |
where pj(t) denotes the probability distribution of a connecting point at a (real) site j (1
j
n) at time t, and p*(t) is this probability distribution at an imaginary site * (Figure 3A). This site represents the state at which a homologous recombinant has been formed. The parameter g is the transition probability per unit time (or transition rate) of the random walk; h is the ratio of the probability with which a random walker (a connecting point) is processed per site per unit time to g. The assumption adopted here that g, h, and k are site-independent is appropriate when the homologous region is devoid of sequence divergence. We assume that the re-combination frequency is measured after a long enough time in the experiments.
|
|
Suppose first that a connecting point is produced at a real site m, and the initial condition is given by pj(0) = 0 for j
m and pm(0) = 1. The solution pj(t) of Equation 6 depends on m and the number of the sites n; we use a superscript (m;n) to express this dependence. As derived in Appendix A, the recombination frequency after a long enough time is given by
![]() |
(7) |
![]() |
(8) |
where
![]() |
(9) |
Here, sinh and cosh, as well as tanh and coth appearing below, are the hyperbolic functions. Because a connecting point is actually produced with probability
per site, the recombination frequency is given by
![]() |
(10) |
When h << 1, we have
![]() |
(11) |
![]() |
(12) |
as described in Appendix A and in ![]()
. The expression in the lower line of Equation 12 apparently coincides with the linear function given by Equation 1. One can see that the parameter h, named "relative probability of intermediate processing," is a key parameter here, instead of the MEPS length in the MEPS theory. As shown by ![]()
![]()
Expressed in terms of physics [see, e.g., chapters VI and X of ![]()
![]()
![]()
| THEORY FOR THE VERY RAPID DROP-OFF |
|---|
Here we explain why the very rapid drop-off was observed in ![]()
2
i(y - yi)2 as a measure of the goodness of fit, where yi is the data value (the natural logarithm of the recombination frequency) for the ith data-point and y is the value of a theoretical curve at the point. The results are summarized in Table 2.
|
|
|
As in the previous models (![]()
![]()
)3 =
of the frequency for zero divergence. Because 1/27 > (1/8)2, the frequency-drop from no diverged base pairs to one diverged base pair is more "rapid" than that from one diverged base pair to two diverged base pairs. It is probable that the random-walk model thus explains the very rapid drop-off. Actually, the recombination frequencies obtained by ![]()
Let us examine this scenario. Suppose that one connecting point is produced initially at the lth site (say, from the left end) of a homologous region with N sites. This region may be divided into some identical subregions by diverged sites, each of which plays the role of a totally absorbing boundary. Suppose that this lth site is an identical site (i.e., a site of an identical base pair), and we define Fl(m, n) (1
l
N, 1
m
n, 1
n
N) as the probability with which the connecting point is produced at the mth site of an identical subregion with n sites. The identical subregion lies between diverged sites (Figure 4A), lies between a diverged site and either end of the homologous region (Figure 4B), or coincides with the entire homologous region. In the first case, we have Fl(m, n) = D2(1 - D)n because n bp are identical with probability (1 - D)n and 2 bp at both ends are diverged with probability D2. In the second case, we have Fl(m, n) = D(1 - D)n because 1 bp at an end need not be diverged. Which case we have is determined by the relationship among l, m, n, and N as shown in Appendix B.
Noting that Equation 8 gives the probability of resolution of the connecting point considered above, we can express the averaged recombination frequency in the homologous region by
![]() |
(13) |
where we added the superscript + to indicate that this expression is valid when the MMR system is active enough. Note that
, defined by Equation 9, depends on only h. By setting D = 0 in Equation 13, we recover Equation A12 with n replaced by N.
The value of 
+(D, N)
/(k
) is independent of the k
value. Thus, when we plot ln
+(D, N)
against D, we can only shift the curve upward or downward by increasing or decreasing the k
value, respectively, with the curve shape remaining the same. The parameter h also influences the overall position of the curve because the intercept, i.e., the logarithm at D = 0, is given by the logarithm of Equation 12 with n replaced by N. The curve shape depends not on k
but on h.
We have two fitting parameters in Equation 13: h and the product k
. Curve fitting to ![]()
= 3.4 x 10-9 (
2 = 7.3). These values are consistent with ![]()
> 10-10). The fitted curve can follow the very rapid drop-off shown by the data (Figure 5). We replot ![]()
(M)(D = 0, N, f = 0), Meps, f, R0, and ß, of which the last four parameters are responsible for the curve shape. Their fit (
2 = 1.8) is better than ours.
The homology length (350 bp) is found to be comparable to
= 1.8 x 102, around which the shift in the dependence should occur as shown by Equation 12. Although we consider this, the calculated ratio of the frequency for one diverged base pair to that for zero divergence,
(D = 0, N = 350) = 0.71, appears to be large as compared with the one-eighth mentioned in the second paragraph of this section. The reason is as follows. The one-eighth corresponds with the case where the diverged base pair is at the center of the homologous region in the third-power dependence range. The average <
+ (D =
, N = 350)> is influenced not only by this case but also by the case where a diverged base pair is introduced near either end of the homologous region to give almost the same recombination frequency as
+(D = 0, N = 350).
Thus, the random-walk model can offer a very straightforward explanation for the presence of the very rapid drop-off in the wild-type strains (Mmr+). The same mechanism can explain the map expansion phenomenon, Rac > Rab + Rbc, where each term implies the recombination frequency between two markers indicated by the letters of the subscript and loci of the markers a, b, and c are arranged in this order (![]()
![]()
![]()
![]()
| THEORY FOR MMR-DEFECTIVE STRAINS |
|---|
Assuming that a connecting point is always destroyed at a diverged site unlike at an identical site, in the preceding section we were successful at explaining the very rapid drop-off. What we assumed is a kind of site dependence in the transition rates. Thus, we expect to explain the absence of the very rapid drop-off in ![]()
msh2
msh3; solid circles in Figure 5) by similarly assuming site dependence in the transition rates. We assume that, when the MMR system is defective, a connecting point is a little more likely to be processed and destroyed at a diverged site than at an identical site; the resolution step could be affected by mismatches themselves (![]()
![]()
![]()
As illustrated in Figure 6A, this model supposes that the potential felt by a random walker has the same "height" at the "hilltops." We assume that there are two kinds of heights of the valley bottoms: one for an identical site and the other for a diverged site (Figure 6A). The latter should be higher than the former because a connecting point is assumed to be a little more unstable at a diverged site. A random walker can reach a neighboring site after "climbing up" a lower "hill," i.e., with larger transition rate, when it starts from a diverged site than when it starts from an identical site [see, e.g., chapter X of ![]()
![]() |
(14) |
where gj, hj, and kj take the values g, h, and k, respectively, at an identical site, and take g', h', and k', respectively, at a diverged site (Figure 6B). Without diverged sites, Equation 14 is reduced to Equation 6 with n replaced by N.
|
As in Equation 7 and Equation 10, the recombination frequency is given by
![]() |
(15) |
where the superscript (RT) indicates the recombination frequency for a set of transition rates of the random-trap type, and p(m;N)*(
) is given by
![]() |
(16) |
where p(m;N)j(t) is the solution of Equation 14 under the initial condition pj (0) = 0 for j
m and pm (0) = 1. We have, from Equation 15 and Equation 16,
![]() |
(17) |
As shown later,
(RT) (N) is independent of g and g'. Because p(m;N)j (t) is a solution of the first three equations of Equation 14 and is independent of
and k,
(RT) (N) is invariant for any set of values of
, k, and k' as long as k
and k'/k remain fixed. This is also the case with its average
II(RT)(D, N)
; we can therefore regard h, k
, h', and k'/k as the parameters of 
(RT) (D, N)
. The shape of the curve of ln
(RT)(D, N)
depends not on k
but on h, h', and k'/k, as the shape of the curve of ln
+(D, N)
depended not on k
but on h.
We simulate the dynamics described by Equation 14 with a computer (VT-Alpha 433S8/3N, 433 MHz cpu; Visual Technology, Tokyo). Suppose that a random walker is now at an identical site. According to Equation 14, the probability of its jump to either of the neighboring sites in a short time
t is given by 2g
t, and the probability of its being processed in this short time is given by gh
t. Thus, on average, some action (i.e., jump to a next site or being processed) of the random walker at an identical site occurs in a short time
t =
. Similarly, a random walker at a diverged site takes some action in a short time
t' =
on average. One time step (Monte Carlo step) in our simulation is made to correspond with this time interval
t or
t' when the random walker is at an identical site or a diverged site, respectively. Thus, some action occurs at each time step in our simulation. A random walker jumps to one neighboring site with probability g/{g(2 + h)}, jumps to the other with probability g/{g(2 + h)}, and is processed with probability gh/{g(2 + h)} at each time step if it is at an identical site. If it is at a diverged site, the probabilities are g'/{g'(2 + h')}, g'/{g'(2 + h')}, and g'h'/{g'(2 + h')}, respectively. This rule is modified at either end of the homology. Because these probabilities are independent of g and g', we need not specify values of g and g' to calculate the recombination frequency. This point is shown analytically in Appendix C.
We have introduced a set of transition rates of the random-trap type to analyze the data for the Mmr- strains, but we should also be able to analyze data for the Mmr+ strains with Equation 14 HREF="#FD15">Equation 15Equation 16Equation 17. We first analyze the data of ![]()
and k'/k
0. This expectation is verified in Figure 5; the cross symbols, which are obtained numerically from Equation 14 with large h'/h and k' = 0, agree with the bottom solid curve obtained in the preceding section. This point is also discussed in the next section.
Let us now analyze ![]()
![]() |
(18) |
where
is defined by
(1 - D)h + Dh' and
is
of Equation 9 with h replaced by
.
![]()
values as the wild-type strains. Curve fitting to the data for the Mmr- strains results in the fitted values h = 2.2 x 10-3, k
= 8.4 x 10-9, and h' = 8.1 x 10-2 with
2 = 1.2 x 10 (Figure 5). The fitted k'/k value varies from 10-7 to 10-4 depending on the initial condition of curve fitting; the curve shape is insensitive to k'/k so long as it is not too large. This is expected because k'/k appears only in the first term in the first braces of Equation 18, which term is negligible as compared with the second term when k'/k is not too large. We also obtained simulation results with the same parameter values (Figure 5); the agreement between them and the fitted curve shows the validity of our decoupling approximation.
![]()
2 = 7.1). Their fit is better than ours, judging from the
2 value over the divergence range examined (0
D
0.26). Our curve is convex (i.e., its second derivative is positive) although the data appear to be concave as a whole; our curve deviates considerably from the data point at D = 0.26. Except for this data point, however, our curve can be fit to the data (
2 = 3.8) better than their line (
2 = 7.1).
| FOR LONGER SUBSTRATES |
|---|
![]()
(RT)(D, N = 350)>, changing the h' value or changing the k'/k value (Figure 7A and Figure B). Using the same sets of parameter values, we plot the logarithm for N = 3500 in Figure 7C and Figure D.
|
We find that the curves, which the decoupling approximation yields for h' = 2.0 x 10-3 and h' = 2.0 x 10-2 (i.e., the top two dashed curves in Figure 7A and Figure C), agree well with the corresponding simulation results. This is expected because we then have h' - h << 1 (h = 3.0 x 10-5). We again find that the simulation results tend to Equation 13 as h'/h
and k'/k
0 in each of Figure 7A&NDASH;D; the very rapid drop-off appears then.
We find that the corresponding curves for N = 350 and N = 3500 share almost the same shape. The curve shape is thus insensitive to N probably because the horizontal axis represents the divergence. At the same divergence, the average interval between two neighboring diverged sites is irrespective of the homology length. This average interval would mainly determine how frequently the connecting point encounters a diverged site and thus would mainly determine how the recombination frequency is reduced from that in the case of zero divergence.
Curve fitting of Equation 18 to ![]()
= 3.1 x 10-9, and h' = 1.9 x 10-3 (
2 = 6.0 x 10-1). The fitted k'/k value varies from 10-7 to 10-3 depending on the initial condition of curve fitting as in the preceding section. Line fitting to the data for the Mmr- strains gives the fitted intercept -3.6 and the fitted slope -1.7 x 10 (
2 = 3.8 x 10-1). These comparable
2 values demonstrate that our fit is as good as ![]()
|
The fitted h value gives
= 3.5 x 102, which is much smaller than N = 107. Unless h changes drastically enough to make
comparable to or much larger than N, the intercept is still given approximately by k
N as shown by the bottom line of Equation 12 with n replaced by N. The intercepts appear to be the same among the Mmr- strains, the wild-type strains, and the Mmr++ strains in Figure 8. We assume that the same k
value is shared among the three types of strains; we expect that their h values are not drastically different.
Judging from our analysis of the data of ![]()
![]()
values as obtained for the Mmr- strains (Figure 8). We find that the data point at D = 0.17 is not so far from the curve, but its overall agreement with the data is poor (
2 = 2.3 x 10). If we do a line fit as in ![]()
2 = 4.7 x 10-1 (Figure 8). This fit is much better than ours.
Let us fit Equation 13 to the data for the Mmr++ strains with h being the only fitting parameter. Using the 433 MHz machine to perform the summation over N = 107 in Equation 13, we obtain the fitted value h = 1.0 x 10-6 with
2 = 2.5 x 10 (Figure 8). The data for the Mmr++ strains appear to show the very rapid drop-off, which is followed by our curve. Attributing this tendency to saturation of the MMR proteins without its formulation, ![]()
2 = 3.0). In passing, if the extreme data point is included, these values are -5.9 and -7.1 x 10, respectively, with
2 = 2.9 x 10.
Our curves for the Mmr- strains and for the Mmr++ strains (the top and the bottom solid curves in Figure 8, respectively) appear to have the same intercept regardless of their different h values as expected. Comparing our curve for the Mmr++ strains with that for the wild-type strains (the middle curve in Figure 8), we find that the slope near D = 0 is steeper, i.e., the very rapid drop-off becomes more prominent, as h decreases. This can be explained qualitatively as follows. As D increases in Equation 13, the whole homologous region is separated by a greater number of totally absorbing boundaries and average length of an identical subregion becomes shorter. As
is larger, even if D is small, more identical subregions can be in the third-power dependence range of Equation 12. This dependence causes the very rapid drop-off as discussed in the second paragraph of THEORY FOR THE VERY RAPID DROP-OFF.
Although the substrates are very long (~107 bp), we have used the random-walk model with a single random walker. In other words, we still assumed N
<< 1 in this section as in Equation 6 and Equation 14. This is consistent with the fitted value of k
= 3.1 x 10-9 above.
| FURTHER DISCUSSION |
|---|
As mentioned in the Introduction, ![]()
, increases with the RecA concentration. As discussed, our curve of either ln
+(D, N)
or ln
(RT)(D, N)
is then lifted with its shape remaining the same. Thus, the random-walk model can also explain this SOS-induced change of the intercept in a very straightforward way.
Table 2 summarizes the results of the curve fits. The
2 values tell that the curves in our model cannot be fit to the data better than those in the previous models, except for the Mmr++ strains. However, this never means failure of our model. First, the previous models are based on the MEPS theory, which has failed to explain the nonlinearity between the recombination frequency and the homology length as discussed in the opening section. Second, the previous models cannot explain the very rapid drop-off well; ![]()
![]()
![]()
![]()
, which also determine the dependence of the homologous recombination on the homology length in Equation 11. We have mentioned an agreement between the estimates in Equation 11 and Equation 13 in the paragraph next but one to that containing Equation 13. In particular, how the logarithm drops very rapidly from the intercept is determined by only one parameter h. This parameter, relative probability of intermediate processing, is also the key to the relationship between the recombination frequency and the homology length. This very simple explanation for the very rapid drop-off is our main result. The very rapid drop-off is not observed in ![]()
We also assumed site dependence of the transition rates for the Mmr- strains of ![]()
![]()
Although we find that the very rapid drop-off becomes less prominent as a diverged site obstructs the homologous recombination less severely (Figure 7), our curve cannot be fitted to ![]()
![]()
We supposed that the MMR system, if active enough, detects mismatches to abort the homologous recombination as in ![]()
![]()
![]()
![]()
























, x,
, and
represent simulation results by use of 