- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Hospital, F.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Hospital, F.
Size of Donor Chromosome Segments Around Introgressed Loci and Reduction of Linkage Drag in Marker-Assisted Backcross Programs
Frédéric Hospitalaa Station de Génétique Végétale, INRA/UPS/INAPG, 91190 Gif sur Yvette, France
Corresponding author: Frédéric Hospital, INRA/UPS/INAPG, Ferme du Moulon, 91190 Gif sur Yvette, France., fred{at}moulon.inra.fr (E-mail)
Communicating editor: C. HALEY
| ABSTRACT |
|---|
This article investigates the efficiency of marker-assisted selection in reducing the length of the donor chromosome segment retained around a locus held heterozygous by backcrossing. First, the efficiency of marker-assisted selection is evaluated from the length of the donor segment in backcrossed individuals that are (double) recombinants for two markers flanking the introgressed gene on each side. Analytical expressions for the probability density function, the mean, and the variance of this length are given for any number of backcross generations, as well as numerical applications. For a given marker distance, the number of backcross generations performed has little impact on the reduction of donor segment length, except for distant markers. In practical situations, the most important parameter is the distance between the introgressed gene and the flanking markers, which should be chosen to be as closely linked as possible to the introgressed gene. Second, the minimal population sizes required to obtain double recombinants for such closely linked markers are computed and optimized in the context of a multigeneration backcross program. The results indicate that it is generally more profitable to allow for three or more successive backcross generations rather than to favor recombinations in early generations.
IN a backcross breeding program aimed at introgressing a gene from a "donor" line into the genomic background of a "recipient" line, molecular markers could be used to assess the presence of the introgressed gene ("foreground selection"; ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
As emphasized by ![]()
![]()
![]()
![]()
![]()
Gene introgression through recurrent backcrossing can be used in various circumstances: (i) in plant or animal breeding to improve the agronomic value of a commercial strain by introgressing a mono- or oligogenic trait (typically, a resistance trait) from a wild relative or from anotherless productivestrain; (ii) to transfer a transgenic construction from one (transformed) strain to another (nontransformed) strain; or (iii) to construct near-isogenic or congenic lines, e.g., for the detection of quantitative trait loci (QTL) and/or the validation of candidate genes for such QTL. In examples (i) and (ii), a drastic reduction of the length of the donor segment surrounding the introgressed gene is important if undesirable genes are located close to the introgressed gene, as might be the case if the donor strain is a wild genetic resource. If introgression takes place between two commercial strains, then a drastic reduction of the length of the donor segment is not always important. In example (iii), such a reduction is always important. In cases where the reduction is important, one would like the donor segment remaining in the backcross progenies at the end of the program to be as short as possible. This article examines background selection against the donor segment on the carrier chromosome.
The article is divided into two parts. In the first part, I derive various statistics describing the length of the intact segment between the gene of interest and a flanking marker, as a function of markers' positions and number of backcross generations performed. These are original analytical results relevant not only to marker-assisted backcross programs but also to many studies related to introgression. In the second part, I study the minimal population sizes that are necessary to obtain a selection objective, as a function of markers' positions and number of backcross generations performed. Here, selection is applied on two flanking markers, one on each side of the introgressed gene, and the selection objective is to obtain a backcross progeny that carries the donor allele at the locus of interest and recipient alleles at both flanking markers (such an individual genotype is called "double recombinant" herein, regardless of whether the two recombination events took place at the same or at different generations).
This simple selection scenario was chosen as a case study for several reasons. First, it is of direct practical interest because it is often used in real backcross breeding programs, for example, in plant breeding. Second, it is the simplest case study that permits the investigation of the effects of three main parameters: the positions of the flanking markers, the number of backcross generations performed, and the number of individuals genotyped at each generation. Third, the results derived in this context can be used to evaluate the efficiency of more complex selection scenarios (e.g., selection on several flanking markers on each side of the introgressed gene, which is also addressed here).
Questions relevant to the study of the efficiency of such a breeding scheme are as follows: (i) What is the efficiency of marker-assisted selection for the reduction of the donor segment lengthi.e., what is the length of this segment among double recombinants; (ii) what is the best position of the flanking marker to reach a given efficiency; (iii) how many individuals should be genotyped at each generation; and (iv) how many successive backcross generations should be performed?
| DEFINITIONS |
|---|
I consider a backcross breeding program aimed at introgressing a gene from a "donor" line into the genomic background of a "recipient" line. The program starts from an F1 hybrid between two homozygous parental lines (generation t = 0). I assume that the parental lines carry different alleles at each locus. At each backcross (BC) generation (BCt, t
1), only the genes carried by the chromosome inherited from the backcrossed parent are segregating. Thus, for simplicity I refer only to the haploid genotypes (haplotypes) on that chromosome and state that an individual "is of donor type" at a locus if this individual is in fact heterozygous donor/recipient at that locus or "is of recipient type" if in fact homozygous recipient/recipient at that locus.
I assume here that recombination is without interference and use the Haldane mapping function, giving the relationship between recombination rate r and corresponding map distance l as
![]() |
(1) |
where l is in morgans. For convenience, I use morgans throughout in the analytical derivations and convert into centimorgans in the numerical applications (in tables and figures).
Let T be the locus of the introgressed gene (or "target" gene). I assume that T is flanked by two marker loci M1 and M2, one on each side. At each generation individuals carrying the donor allele at locus T can be identified. If the introgressed gene is identified unambiguously by its phenotype, or by its genotype, or by the genotype of an intragenic marker (e.g., in the case of a transgene), then I define l1 (respectively l2) as the real distance from the introgressed locus T to the flanking marker M1 (respectively M2) on one side and L1 (respectively L2) as the real distance from the introgressed locus T to the chromosome end on the same side. If the introgressed gene is identified through markers (viz. "foreground selection markers," different from the "background selection markers" M1 and M2, and closer to the introgressed locus), then l1 (respectively l2) is the distance from the outermost foreground selection marker locus on one side of T to the background selection marker locus M1 (respectively M2) on the same side, and L1 (respectively L2) is the distance from this outermost foreground selection marker locus to the chromosome end on the same side. In any case, the only markers considered hereafter are the background selection markers M1 and M2. The efficiency of foreground selection was investigated elsewhere (![]()
![]()
| SIZE OF DONOR CHROMOSOME SEGMENTS AROUND INTROGRESSED LOCI UNDER MARKER-ASSISTED SELECTION |
|---|
In this article, the efficiency of marker-assisted selection is evaluated by its ability to reduce the length of the intact donor chromosome segment dragged along around locus T (donor segment). Assuming no interference, recombination events on each side of the introgressed locus are independent and can be treated separately. For simplicity I consider in this section only the length of the donor segment on one side of the introgressed locus, where marker M (standing for either M1 or M2) is at distance l (standing for either l1 or l2) and the chromosome end at distance L (standing for either L1 or L2).
If the introgressed gene is identified unambiguously (see DEFINITIONS) as is assumed here, then the total length of the donor segment is simply the sum of lengths on both sides. If the introgressed gene is identified through foreground selection markers (see DEFINITIONS), then the donor genome within foreground selection should also be taken into account in the computation of total donor segment length. If foreground selection markers are located close to each other, as is generally recommended to provide a good control of the introgressed gene (![]()
The expected length of the donor segment, without background selection on markers, was first derived by ![]()
![]()
![]()
![]()
![]()
Without background selection on the marker, let X(t) be the random variable corresponding to the length of intact donor segment on one side of the introgressed locus at generation BCt. From ![]()
![]() |
(2) |
the mean of X(t) is
![]() |
(3) |
where L is the distance to the chromosome end, and the variance of X(t) is
![]() |
(4) |
Here, I derive similar results in the case where background selection on the flanking marker M at distance l is applied. Define the following variables:
- t1, generation (BCt1) at which the marker is observed to be of recipient type;
- t2, generation (BCt2) at which the length of the donor segment is observed;
- Y(t1, t2), random variable, length of donor segment at t2 given that the marker is of recipient type at t1;
- gt1,t2(x), PDF of Y(t1, t2) at x;
Y(t1) = Y(t1, t1);
gt1(x) = gt1,t1(x);
- Y*(t1, t2), random variable, length of donor segment at t2 given that the marker is of recipient type "for the first time" at t1 (i.e., the marker was of donor type for t < t1);
- g*t1,t2(x), PDF of Y*(t1, t2) at x;
- PM(t1), probability that the marker is of recipient type at t1;
- P*M(t1), probability that the marker is of recipient type for the first time at t1;
- EY(t1, t2) and EY*(t1, t2), means of the random variables Y and Y*, respectively;
Y(t1, t2) and
Y*(t1, t2), standard deviations (i.e., square roots of the variances) of the random variables.
In the following I refer, if need be, either to the probability that a crossover occurs or to the probability that a recombination occurs. Under the assumption of no interference, the positions of crossovers along a chromosome follow a Poisson process. The probability that a crossover occurs in an infinitely small interval of size dx is equal to dx. The probability that no crossover occurs in an interval of size l is equal to e-l, etc. The probability that a recombination occurs in an interval (strictly speaking, between two edges of an interval) of size l is given by (1). An odd number of crossovers occurring in an interval provides recombination, and an even number of crossovers provides no recombination.
The intact donor segment is bounded by locus T on one end and by the location of the closest crossover on the other end. If several successive generations are considered, then the latter is the closest crossover among all crossovers that took place at different generations. The generation at which this bounding crossover took place is denoted tco.
Single-generation information:
I first study the distribution of the length of the intact donor segment in the most simple situation where all the information available was obtained at a single generation t1. The corresponding random variable is Y(t1) = Y(t1, t1).
The probability that the marker is of recipient type at t1 is
![]() |
(5) |
Let x be any chromosomal position between the locus T and the marker (0 < x < l). Extending the rationale of ![]()
In the interval ]0, x], for the length of the intact segment to be x two conditions are required: (i) At a given generation tco (1
tco
t1), a crossover must have occurred exactly in the infinitely small interval ]x, x + dx] (probability dx) and no crossover must have occurred in the interval ]0, x] (probability e-x); (ii) at any of the remaining generations t (1
t
t1; t
tco), no crossover must have occurred in the interval ]0, x] (probability e-(t1-1)x). The probability for the interval ]0, x] is then
![]() |
(6) |
In the interval ]x, l], the only possibility for the marker not to be of recipient type at generation t1 is that a recombination occurred in the interval ]x, l] at exactly the same generation tco as above (probability r[l-x]), and no recombination occurred in the interval ]x, l] at any of the (t1 - 1) remaining generations (probability (1 - r[l-x])(t1-1)). For the interval ]x, l], the probability that the marker is of recipient type at generation t1 is then
![]() |
(7) |
Another demonstration of (7) is provided in APPENDIX A.
Assuming no interference, the overall conditional probability is obtained by multiplying (6) by (7), combining for any tco
[1, t1] (i.e., multiplying by t1), and finally dividing by (5). Mathematically speaking, this probability is Pr(x < Y
x + dx). The PDF gt1(x) for Y(t1) is then simply obtained by differentiating with respect to x (i.e., dropping the term in dx):
![]() |
(8) |
This is a much simpler demonstration than that of ![]()
![]()
An example of the distribution of gt1(x) is given in Fig 1 for a marker located at distance l = 20 cM from locus T, at different BCt1 generations. It is seen that the distribution of gt1 is of decreasing exponential shape and skewed toward low x values. For a given marker position, both skewness (asymmetry) and kurtosis (peakedness) of the distributions increase in advanced backcross generations. For other marker positions (results not shown), the number and the effects of t2 on the shape of the distributions are the same, except that for a given backcross generation both skewness and kurtosis are reduced for markers closer to T. Conversely both skewness and kurtosis are increased for markers farther apart from locus T.
|
The mean EY(t1) of Y(t1) is then simply obtained from (8) by integrating for x along the chromosome up to the marker position
![]() |
(9) |
which gives
![]() |
(10) |
with
![]() |
(11) |
The standard deviation
Y(t1) of Y(t1) is obtained from
![]() |
(12) |
where gt1(x) is obtained from (8) and EY(t) from (10). A closed formula for (12) could be derived as was done above for EY, but the expression would be more complex and barely useful since numerical results can now be obtained directly from (12) with a mathematical software package [for example, many numerical results given in this article were obtained using Mathematica (![]()
Y is generally of the same order as the mean EY, corresponding to quite large variances of donor segment length.
However, whereas the mean is always a meaningful parameter for any distribution, the standard deviation might not be the most appropriate parameter in the present case, given the shape of the PDF (Fig 1). With such distributions, parameters like quantiles are more appropriate. For example, I study the 9th decile defined as the threshold
such that
![]() |
(13) |
In the tables below,
values were computed by solving (13) numerically.
I also computed the PDF, mean, and variance of intact segment length in the case where the marker M is of donor type. The corresponding results are given in Appendix B. These results are not used in this article because I wish to focus on the optimization of marker-assisted selection, where the objective is that the marker is of recipient type. However, these results were derived for the sake of generality and could be useful in other contexts, e.g., to compute "graphical genotypes" (![]()
Numerical values for the expected length of the donor segment on one side, EY(t1), computed using (10), are plotted in Fig 2 as a function of marker position l for different BCt1 generations. Expected lengths of donor segment with no background selection (3) for the same generations are given in the graph for comparison, in the case of a chromosome end at distance 100 cM from locus T. Fig 2 shows that marker-assisted selection (solid lines) is obviously very efficient in reducing the length of the donor segment, compared to its expected value when no marker-assisted background selection is applied (dotted lines), except for markers far apart from locus T in late BC generations. Marker-assisted selection is more efficient as markers are closer to T. Note, however, that the values in Fig 2 are given for an individual that is known to be of recipient type at the marker. The probability of obtaining such an individual then depends on the population size. This is addressed later but obviously the closer the marker is to locus T, the larger the population size has to be.
|
The donor segment is bounded by the closest crossover position among all successive meioses. Thus, for a given marker position, the expected length of the donor segment should be smaller in advanced BC generations, because the accumulation of meioses can only reduce the segment length compared to its value in BC1. But Fig 2 shows that the number of backcross generations (t1) has a visible effect only for markers relatively far apart from locus T (l
20 cM). For markers closer to T (<20 or 10 cM), the expected length of donor segment in advanced backcross generations (BC5 or BC10) is not visibly reduced below its value in BC1, i.e.,
l/2. Hence, it is likely that a relatively large portion of the unwanted donor genome will remain segregating in the backcross progenies, even after several generations of marker-assisted selection.
Another way to evaluate the effect of the number of backcross generations (t1) is to compute the minimum number of backcross generations needed to reduce the expected length of donor segment below a given threshold. This is done in Fig 3 where the threshold is expressed as a fraction c of the distance between T and the marker: minimal t1 values such that EY(t1)
cl are given in function of l for c =
,
,
, or
. This reinforces the conclusions drawn from Fig 2: In BC1, the expected segment length is approximately l/2 regardless of marker position. Reducing the expected segment length below l/2 requires unrealistically large numbers of backcross generations, unless the marker is quite far away from locus T.
|
I now investigate in more detail whether the accumulation of meioses could permit a better reduction of the donor segment length. If this has an important effect, it will provide an alternative to using closer markers at the expense of larger population sizes. The results in Fig 2 indicate that a moderate gain may be expected from advanced backcross generations, at least for distant markers. But, the only information considered so far in the calculations is that the marker is of recipient type at generation t1 and the donor segment length is observed at the same generation t1. This is not the most appropriate study of the effect of meiosis accumulation because, among the crossovers that take place between locus T and the marker, it does not permit one to distinguish between the (single) crossover that makes the marker return to recipient type and (possibly) other crossovers that would reduce donor segment length without affecting the genotype at the marker. To investigate this in more detail, slightly more complex situations must be considered.
The first situation considered is that when, after the (recipient) genotype of the marker has been observed at generation t1, the backcross breeding program is nevertheless pursued until generation t2 (t2
t1), and the length of donor segment is observed at t2. From generation t1 to t2, selection on the marker is no longer necessary, since the marker is then fixed for the recipient type allele. Only foreground selection for the introgressed gene is necessary. Still, additional gain in the reduction of the donor segment could be expected from crossovers taking place in this heterozygous part of the genome during meioses from t1 to t2. I now evaluate the amount of this possible additional gain.
The random variable corresponding to the length of donor segment at generation t2, given that the marker was observed to be of recipient type at t1 (t1
t2), is Y(t1, t2). The PDF gt1,t2 is calculated following the same rationale as above for gt1. Only the recombination events taking place at BC generations up to t1 affect the marker genotype, since the marker is fixed for recipient type after t1.
The probability that the marker is of recipient type at t1 is given by (5). For the interval ]0, x], the probability is simply e-t2x dx similar to (6). For the interval ]x, l], two cases must be considered. If the crossover in the infinitely small interval ]x, x + dx] took place before t1 (1
tco
t1; t1 possibilities), then the probability for the interval ]x, l] is the same as for gt1 in (7). If the crossover in the infinitely small interval ]x, x + dx] took place after t1 (t1 < tco
t2; t2 - t1 possibilities) then, for the marker to be of recipient type at t1, at least one recombination must have occurred in the interval ]x, l] at some BC generation before t1 [probability 1 - (1 - r[l-x])t1].
Finally, the PDF gt1,t2 for Y (t1, t2) is
![]() |
(14) |
The corresponding mean EY(t1, t2), variance
2Y(t1, t2), and ninth decile are defined similarly to (9), (12), and (13), respectively. These were computed numerically.
Multiple-generation information:
So far, the information considered in the derivations for Y(t1) or Y(t1, t2) was that a recombination occurred between T and M at some generation t
[1, t1], but the exact generation at which this recombination took place was considered as unknown. However, in practical situations, continuous selection on background markers is applied, as is generally recommended. In such cases, the marker genotypes are observed at every generation t
[1, t1], to pick out recombinant individuals as soon as possible. The exact generation at which the recombination took place between T and M is then known. To evaluate the efficiency of marker-assisted selection in such situations, different events and associated probabilities must be considered. The corresponding variables are indicated by parameters with asterisks.
The probability that the marker is of recipient type for the first time at generation t1 (i.e., the marker is of donor type for any t < t1) is
![]() |
(15) |
The random variable corresponding to the length of the donor segment at generation t2, given that the marker was of recipient type for the first time at generation t1 (t1
t2), is Y*(t1, t2). Note that obviously Y and Y* are identical at t1 = 1. Following the same rationale as above, we need to focus only on recombination events in the chromosomal segment ]x, l]. Three cases must be considered, tco being the generation at which the crossover bounding the donor segment has occurred.
If tco < t1 (t1 - 1 possibilities), then at generations t (t < t1; t
tco), for the marker to be of donor type, no recombination must have occurred in the interval ]x, l]. At generation t = tco, the crossover occurred in the infinitely small interval ]x, x + dx] so, for the marker to remain of donor type, a recombination must have occurred in the interval ]x, l]. At generation t = t1, for the marker to be of recipient type, a recombination must have occurred in the interval ]x, l].
If tco = t1 (one possibility), then at generations t (t < t1; t < tco), for the marker to be of donor type, no recombination must have occurred in the interval ]x, l]. At generation t = tco = t1, the crossover occurred in the infinitely small interval ]x, x + dx] so, for the marker to be of recipient type, no recombination must have occurred in the interval ]x, l].
Finally, if tco > t1 (t2 - t1 possibilities), then at generations t (t < t1), for the marker to be of donor type, no recombination must have occurred in the interval ]x, l]. At generation t = t1, for the marker to be of recipient type, a recombination must have occurred in the interval ]x, l].
Combining for all possible tco values, the PDF for Y*(t1, t2) is then
![]() |
(16) |
The corresponding mean EY*(t1, t2), variance
2Y*(t1, t2), and ninth decile for Y*(t1, t2) are defined as before and were computed numerically. Note, however, that when the marker genotype and the length of donor segment are observed at the same generation (t1 = t2) the PDF of Y*(t1) = Y*(t1, t1) is
![]() |
(17) |
and, in that particular case, the mean simplifies to
![]() |
(18) |
where coth and sech are the hyperbolic tangent and the hyperbolic secant functions, respectively.
Generally, the following relationship holds between g and g*,
![]() |
(19) |
giving
![]() |
(20) |
Numerical applications:
Numerical applications of the above derivations for the mean (E) and ninth decile (
) of Y*(t1, t2) and Y(t1, t2) are given in Table 1 for a marker distance of 50 cM from the introgressed gene and in Table 2 for a distance of 20 cM.
|
|
Setting the marker position at 50 cM (Table 1) is not the most realistic situation. However, it makes it easier to study the effects of the various parameters, because the results are more contrasting than for a shorter marker distance. It is thus given as an illustrative example. Results for a more realistic marker position (20 cM) are also provided (Table 2).
The results in Table 1 for multiple-generation information (Y*(t1, t2)) highlight the effects of the two parameters: t1, the BC generation at which the recombination occurred between the locus T and the marker, and t2, the BC generation at which the donor segment length is observed (or the total duration of the backcross program). Note that t1 and t2 values shown in the tables are not continuous (1, 2, 3, 5, 10).
The results for Y* in Table 1 at t1 = t2 indicate that the expected donor segment length is shorter in the case of a later recombination between the locus T and the marker. For example, if the marker is of recipient type for the first time in BC1 and the donor segment is also observed in BC1 (t1 = t2 = 1), then the expected segment length is 24.5 cM (Table 1), while if the marker is of recipient type for the first time in BC3 only and the segment length is also observed in BC3 (t1 = t2 = 3), then this length is 21.8 cM (Table 1). This is so because in the case of a distant marker (l = 50 cM in Table 1) double crossover events between T and the marker may occur at relatively high frequency before the recombination between T and the marker takes place (such double crossovers reduce donor segment length while the marker remains of donor type). Obviously, the frequency of double crossovers is much lower in the case of a shorter marker distance (e.g., 20 cM, Table 2); thus values for Y* at t1 = t2 in Table 2 are then hardly reduced with increasing t1.
However, the results in Table 1 indicate that the expected donor segment length is better reduced by crossovers occurring after the recombination between locus T and the marker occurred (i.e., after the marker returned to recipient type). For example, if segment length is again observed in BC3, but it is known that the marker was already of recipient type since generation BC1 (i.e., two additional BC generations were performed after the marker returned to recipient type), then the expected length of the donor segment is 18.1 cM (t1 = 1, t2 = 3, Table 1), compared to 21.8 cM for t1 = 3. There is also a gain on the ninth decile
(38.7 vs. 43.4 cM). In any case, the gain obtained by forcing early recombination between T and the marker is of moderate importance and would be obtained at the expense of increased population sizes (see next section).
Moreover, results for multiple-generation information (Y* values) are relevant to evaluate a posteriori the efficiency of selection once the BC program is completed and the genotypes of the individuals selected at the various generations are known. It is not relevant to the a priori design of a program before it is started, because it would not make sense to design a program such that the recombination between the locus T and the marker takes place, for example, in BC3 (t1 = 3) and not in BC2 or BC1. What makes sense is to allow a double recombinant to be selected as soon as possible, while keeping population sizes within affordable limits. To do so, one would design a program such that recombination between the locus T and the marker M takes place by a given generation. In other words, to keep with the above example one would require that recombination take place in BC3 or at any previous generation. In this case, single-generation information (Y) is relevant to predict a priori the efficiency of such a program.
The results for single-generation information (Y) in Table 1 indicate that the reduction of donor segment length obtained by forcing early recombination between T and the marker (t1 < t2) is even less important in the context of such an a priori prediction. For example, for a BC program designed to last at most three BC generations, if no other information is available (i.e., the recombination between T and the marker took place in BC3 or in any earlier generation), then the expected donor segment length is 19.4 cM (t1 = t2 = 3, Table 1). If it is known that the recombination took place in BC1, and two additional BC generations were performed afterward, then the expected donor segment length in BC3 is 18.1 cM as before (t1 = 1; t2 = 3, Table 1). Hence, the gain in the reduction of donor segment length provided by forcing early recombination is only 1.3 cM on average for a marker distance of 50 cM.
Additional BC generations also have a little impact on the distribution of Y values, as indicated by
values in Table 1: If no additional information is available, at t1 = t2 = 3, 90% of Y values are <40.8 cM, while if it is known that the recombination took place in BC1, at t1 = 1 and t2 = 3, 90% of Y values are <38.7 cM; i.e., the gain is only 2.1 cM.
For a more realistic marker position (l = 20 cM, Table 2), the numerical results indicate that the gain in the reduction of donor segment length, provided by forcing early recombination between T and M, is even smaller than for a distant marker. For example, for a program designed to last at most three BC generations, forcing the recombination between the gene and the marker to take place in BC1 would provide a gain of only 9.2 - 8.8 = 0.4 cM (t2 = 3, Table 2).
For even closer marker positions, (l
10 cM, results not shown), this gain tends to zero and the results for Y values at t1 < t2 are then hardly different from Y values at t1 = t2. Hence, the results for these situations can be simply taken from Fig 2.
As a conclusion to this section, in theory an additional gain in the reduction of donor segment length is always expected from allowing additional backcrosses even after a recombination between the locus T and the marker is obtained. But, the amount of this gain depends on the distance between the locus T and the marker and tends to zero for realistic (short) marker distances (l < 20 cM). For such short distances, the BC generation at which the recombination between the locus T and the marker takes place (t1) has little impact on donor segment length. Moreover, at short marker distances, the total duration of the program (t2) also has little impact on donor segment length. Hence, in such cases, the donor segment length mostly depends on the position of the marker, and not on the number of BC generations performed. Overall, it is generally more efficient to use closer markers (reduce l), than to allow additional BC generations for more distant markers. For short marker distances, reducing l has more impact on the reduction of donor segment length than increasing t2 (as was seen from Fig 2) or increasing t2 - t1 (see Table 2). However, it is important to note that the above results on the length of donor segment were derived conditionally on obtaining a recombinant genotype for either or both markers. Thus, the probability of obtaining such a recombinant was not taken into account in the optimization of the BC scheme. Yet, the number of BC generations does affect genotype probabilities in conjunction with the population size. Hence, the number of BC generations should be optimized with respect to the population sizes needed for obtaining double recombinants. This is addressed in the following section.
| MINIMAL POPULATION SIZES |
|---|
In the previous section I studied the length of donor segment among genotypes that are recombinant for either M1 or M2. Here, I focus on the probability of obtaining an individual that is double recombinant for markers M1 and M2 on both sides of locus T. In that case, even when assuming no interference, recombination on both sides of T cannot be treated separately.
In one single BC generation, the probability of double recombination is easy to calculate from the product of the probabilities of single recombinations on both sides of T. But, as noted by ![]()
![]()
A mathematical solution to this problem was first provided by ![]()
Let r1 and r2 be the recombination rates corresponding to distances l1 and l2, respectively. Without loss of generality, I assume hereafter l1
l2. The alleles at each locus are noted "0" for donor-type allele, and "1" for recipient type. Since the genotypes carrying a donor allele at locus T are the most interesting, I define only five genotypic classes at loci M1TM2: G1 = 101, G2 = 100, G3 = 001, G4 = 000, and G5 = *1*, the latter referring to the four possible genotypes carrying a recipient allele at locus T, regardless of the markers.
At each generation t a total of nt individuals (backcross progenies) are first screened for the presence of the donor allele at locus T and then possibly for the presence of the recipient alleles at markers M1 and/or M2. If no carrier of the donor allele at locus T is found in the population, then the backcross scheme is interrupted (failure). If one or more carriers are found, then among those a single individual is selected on the basis of its genotype at markers in the following order of priority: (1) G1 (double recombinant); (2) G2 (single recombinant); (3) G3 (single recombinant); or (4) G4 (nonrecombinant). Note that G2 is selected prior to G3 because I assume l1
l2. The selected individual is then backcrossed to the recipient parental line to provide the next BC generation.
If the introgressed gene is identified unambiguously (see DEFINITIONS), then the probability of transmission to a backcross progeny of the donor allele at locus T is P =
. This is assumed here in the numerical applications, but, for the sake of generality, I keep the literal P in the theoretical derivations. If the introgressed gene is identified by flanking markers (foreground selection markers; see DEFINITIONS), then the probability of transmission of the donor allele at locus T is <1/2 and must be calculated from the probability of transmission of those foreground selection markers (see an example in ![]()
![]()
![]()
For given markers' positions and total duration of the breeding scheme, the total cost of the experiment, which we want to minimize, depends directly on population sizes at each generation. In this context, optimal population sizes are then simply the minimal population sizes necessary to achieve the experiment successfully. I now calculate such minimal population sizes.
Analytical derivations:
The recursion equations of ![]()
t2 are computed such that this final probability is above a given threshold (99%). These calculations need slight improvements.
The recursions of ![]()
l2). Hence, under strategy A, the recursions should be computed as follows.
Let ht be the column vector of the frequencies at generation t of the five genotypic classes Gi defined above. These frequencies are given by
![]() |
(21) |
with
![]() |
(22) |
and
![]() |
(23) |
Let at[i] be the probability that with strategy A the individual selected at generation t is of genotype Gi. Let a't = {at[i]}1
i
5 be the vector of these probabilities. We have
![]() |
(24) |
with
![]() |
(25) |
The element At[i, j] of recursion matrix At at line i and column j is obtained by
![]() |
(26) |
where H[k, j] is the element of matrix H at line k and column j.
Finally,
|
(27) |
with the notation s12 = r1 + r2 - r1r2 introduced just to save space.
In the case of constant population sizes (nt = n,
t), the elements of vector at can be obtained directly as a function of n and t after transformation of matrix At to diagonal form (see an example in ![]()
![]()
Besides other possible considerations (e.g., selection on noncarrier chromosomes), the results of the previous section on segment length indicate that, once the marker is of recipient type, little gain on the reduction of the donor segment is expected from performing additional backcross generations, in particular for close markers. Hence, strategy A might not be the most realistic. I then consider a slightly different strategy (strategy B), where, if a double recombinant G1 is found at a given generation t < t2, then the backcross program is interrupted (success) rather than pursued until the initially defined generation t2. Also, the process is interrupted (failure) if no carrier of the introgressed gene is found in the population at any t < t2.
For strategy B, I define a new vector of probabilities b't = {bt[i]}1
i
5, such that bt[1] = ßt is the probability that an individual of genotype G1 is selected at generation t but not at any previous generation; ßt is then the probability of success at generation t; conversely, bt[5] =
t is the probability that no carrier of the introgressed gene is found in the population at generation t;
t is then the probability of failure of the BC scheme at generation t; and finally bt[i] for 2
i
4 is the probability that the individual selected at generation t is of genotype Gi, given that the individual selected at previous generation is of genotype Gj (2
j
4).
Again, we have
![]() |
(28) |
with
![]() |
(29) |
The recursion matrix Bt is identical to At, except that the first and last columns are set to 0:
![]() |
(30) |
Note that the events corresponding to the vector of probabilities bt do not constitute a complete set of events for t2 > 1. Rather,
![]() |
(31) |
For strategy B, the overall probability of success at generation t2 is
![]() |
(32) |
Correspondingly, the mean total number of individuals that need to be genotyped given that the BC scheme is successful at last in generation t2 was computed as
![]() |
(33) |
Following the rationale of ![]()
of individuals genotyped during the BC scheme is minimal:
![]() |
(34) |
Numerical applications:
Optimal population sizes nt for strategy B were computed numerically to fulfill both conditions in (34), i.e., find the population sizes that minimize
while keeping above 99% the probability that a double recombinant is obtained at last in generation t2. Such a computation is easy when population sizes are kept constant across BC generations (nt = n,
t). Yet, ![]()
is achieved when allowing different population sizes at different BC generations. In that case, finding the set of values {nt}t
[1,t2] that satisfy (34) is more difficult, in particular for t2 > 2. A computer program (F. HOSPITAL and G. DECOUX, unpublished data) was designed for the numerical optimization of population sizes in this case, using the "simulated annealing" algorithm (![]()
|
|
The definition of
in (33) takes into account only the cases where the BC scheme is successful. With the population sizes in Table 3 and Table 4, the probabilities
t of failure of the BC scheme at any generation t (no carrier of the introgressed gene in the population) are close to zero. Hence, the probability of nonsuccess at t2 [1 - St2
1% with the conditions of (34)] is mostly the probability of obtaining only a genotype G2, G3, or G4 (single- or nonrecombinant) at t2. In that case, the BC scheme has not failed, but simply needs to be pursued one or more additional BC generations.
For a single-generation program (t2 = 1, Table 3), population sizes become very large for short marker distances (l < 10 cM), which are the most relevant since using close markers is the best way to reduce donor segment length (see previous section). It is then better to perform at least two BC generations, because the probability of success in BC2 (ß2) is always higher than the probability of success in BC1, except for distant markers (l > 20 cM). Performing two BC generations with constant population sizes permits a drastic reduction of the total number of individuals that have to be genotyped, which confirms the intuition of ![]()
![]()
by slightly increasing ß2 with respect to ß1. In this case, population size in BC2 needs to be higher than in BC1. This is generally the case for any total duration of the BC scheme (t2): Optimal values for variable population sizes should increase in advanced BC generations, except for distant markers and high t2 values (e.g., t2 = 10, not shown).
Allowing BC schemes to last possibly more than two generations permits an even further reduction of the mean total number of individuals, though the gain on
is then less important than the gain from t2 = 1 to t2 = 2. The gain is nevertheless economically important, especially for close markers. Increasing total duration of the BC scheme t2 from 2 to 3 with either constant or variable population sizes provides a gain of
50% on
for l
20 cM, which may correspond to hundreds less individuals. It is worth noting that, with constant population sizes, this gain is barely obtained at the expense of lower probability of success in early generations: Probability of success in at most two generations (ß1 + ß2) with constant population sizes is close to 90% for t2 = 3 (Table 3) compared with 99% for t2 = 2 with much larger population sizes (about double). Again, using variable population sizes for t2 = 3 permits a further reduction of
(
100 individuals for l
5 cM). But, in this case, the reduction of
is obtained at the expense of a reduced probability of success in early generations. A decision must then be made between reduction of costs and reduction of duration of the breeding scheme, which is a matter for economical consideration not taken into account here (see also below). However, (ß1 + ß2) with variable population sizes is still close to 75% for t2 = 3 (Table 3).
The same tendencies are observed for even longer durations of the BC breeding scheme (e.g., t2 = 5, Table 4). Population sizes are always reduced, and even more reduced for variable than for constant values, except for distant markers. Note that for very distant markers and/or very long durations (e.g., t2 = 10, not shown), optimal population sizes are not reduced below a given threshold, because in those cases, while the probability of double recombinations increases, the probability of failure ("losing" the introgressed gene) becomes the most critical factor. Again, with increased duration of the program, the probability of success in early BC generations with either constant or variable population sizes decreases with respect to the probability of success in advanced generations. But, the important conclusion is that experimental costs are drastically reduced, even for very close markers. For t2 = 5 (Table 4) and variable populations sizes, a mean total number of only <500 individuals need to be genotyped for flanking markers as close as only 1 cM on each side of the introgressed gene. Moreover, using the optimal population sizes defined in Table 4 for the same marker distance of 1 cM and applying (33) for the first three generations only (BC1 to BC3) shows that the mean total number of individuals is then <210, with a corresponding probability of success of 65%. For l = 5 cM and t2 = 5,
is only 100 (Table 4).
The recursion equations of ![]()
![]()
![]()
![]()
Moreover, the present a priori approach provides an objective criterion to determine the number of individuals that have to be genotyped in the sequential approach, when the population size needed to obtain a double recombinant G1 at one given generation is too large. Note that the calculation of optimal population sizes at any intermediate stage of an already started BC scheme is also possible using our computer program (F. HOSPITAL and G. DECOUX, unpublished data).
Other selection scenarios:
The calculations in this section were derived within the framework of a breeding scenario where (i) only one individual is selected at each generation (on the basis of its genotype at flanking markers) and (ii) only one pair of flanking markers is considered.
Condition (i) is a limitation of the results on minimal population sizes, in particular for their application to breeding schemes where several individuals have to be selected at each generation (e.g., animals with low fecundity). Note, however, that minimal population sizes here were computed so that at least one individual with the desired genotype is obtained; thus the expected number of such individuals is always above one. Also, the present results for single selections could be used as a per-individual approximation to the multiple selections case. But, this would provide only a crude approximation. Exact derivation of minimal population sizes in the true multiple selection case is more complex and was not considered here. It could be feasible using the present derivations in conjunction with Equation 10 provided with no application by ![]()
Concerning condition (ii), it was shown previously (![]()
As an example, this was done to produce the results in Table 5. I consider a three-generations BC program (t2 = 3). Population size at each generation was fixed at 49 individuals, i.e., the (constant) minimal population




































