## Abstract

The expectation of the parental genome contribution to inbred lines derived from biparental crosses or backcrosses is well known, but no theoretical results exist for its variance. Our objective was to derive the variance of the parental genome contribution to inbred lines developed by the single-seed descent or double haploid method from biparental crosses or backcrosses. We derived formulas and tabulated results for the variance of the parental genome contribution depending on the chromosome lengths and the mating scheme used for inbred line development. A normal approximation of the probability distribution function of the parental genome contribution fitted well the exact distribution obtained from computer simulations. We determined upper and lower quantiles of the parental genome contribution for model genomes of sugar beet, maize, and wheat using normal approximations. These can be employed to detect essentially derived varieties in the context of plant variety protection. Furthermore, we outlined the application of our results to predict the response to selection. Our results on the variance of the parental genome contribution can assist breeders and geneticists in the design of experiments or breeding programs by assessing the variation around the mean parental genome contribution for alternative crossing schemes.

THE expected contribution of a parental line to the genome of an inbred line derived from a biparental cross is For inbred lines derived from a backcross, the expected genome contribution of the nonrecurrent parent is , where *t* is the number of backcross generations. Experimental studies showed a considerable variation in the parental genome contribution around these mean values (Heckenberger *et al*. 2006) but until now no theoretical concept for describing the variance of the parental genome contribution to homozygous inbred lines existed.

Inbred lines are developed for various purposes in genetic research and applied plant breeding programs, *e.g*., for direct use as line cultivars or as parents of hybrid and synthetic varieties. A theoretical concept for calculating the variance of the parental genome contribution to inbred lines can be used (1) in plant variety protection to test hypotheses on the mating scheme that was employed for inbred line development and (2) to assess and compare the variability in experimental and breeding populations generated with a certain mating scheme depending on the number and length of the chromosomes of the species under consideration.

Hill (1993) derived the variance of the parental genome contribution to heterozygous backcross individuals under the assumption of no interference in crossover formation. Employing his formula for the variance, he found that a normal approximation fitted well the probability distribution of the parental genome contribution obtained from computer simulations. Using the cattle genome as an example, he demonstrated that his results can be employed to determine approximate upper bounds for the parental genome contribution of the nonrecurrent stock.

Our objectives were to (1) derive the variance of the parental genome contribution to inbred lines developed by the single-seed descent (SSD) or double haploid (DH) method from biparental crosses or backcrosses adopting the approach of Hill (1993), (2) investigate with computer simulations the fit of a normal approximation to the probability distribution of the parental genome contribution, and (3) demonstrate the application of the formulas in the context of plant variety protection.

## THEORY

#### Assumptions:

We assume that the offspring are completely homozygous lines, derived without selection from a biparental cross of completely homozygous parents P_{1} and P_{2}. For all derivations, we assume absence of interference (Stam 1979) in crossover formation such that the recombination frequency *r _{uv}* between two loci on a chromosome with map positions

*u*and

*v*is calculated by Haldane's (1919) mapping function(1)

#### Variance of the parental genome contribution:

Meiosis on different chromosomes is stochastically independent. Hence, the variance of the genome contribution *Z* of parent P_{1} to the genome of a derived line can be written in terms of the variances Var(*Z _{i}*) for individual chromosomes as(2)where

*c*is the number of chromosomes,

*l*the length of the

_{i}*i*th chromosome, and the total length of the genome in Morgan units.

Following the approach introduced by Hill (1993) in the context of backcross populations, the variance of the parental genome contribution to a chromosome equals the expected covariance between two randomly sampled loci on the chromosome,(3)where *G _{u}* and

*G*are random variables taking the value 1 if the loci at map positions

_{v}*u*and

*v*carry the allele of parent P

_{2}and 0 otherwise, and

*D*is a random variable describing the linkage disequilibrium between two loci on the chromosome with probability density(4)

_{uv}Using the formulas for(5)given in Frisch and Melchinger (2006, Table 1 therein), *D*(*u*, *v*) can be calculated as(6)

We present formulas for *D*(*u*, *v*) for the following four mating systems (Table 1): (1) (F_{2})* ^{t}*-SSD lines, developed by

*t*(

*t*≥ 0) generations of random mating of a F

_{2}population and subsequent application of the SSD method for line development; (2) (F

_{1})

*-DH lines, developed by*

^{t}*t*(

*t*≥ 0) generations of random mating of a F

_{1}cross and subsequent inbred line development with the DH method; and (3) BC

_{t}-SSD and (4) BC

_{t}-DH lines, developed from a F

_{1}cross backcrossed

*t*(

*t*≥ 1) times to parent P

_{1}, with subsequent line development by the SSD or DH method.

Inserting *D*(*u*, *v*) (Table 1) into Equation 3 yields Var(*Z _{i}*). Analytical results for Var(

*Z*) are derived in the appendix and summarized in Table 2. Numerical results for Var(

_{i}*Z*) are given in Table 3. To check our derivations, we determined the results in Table 3 also with computer simulations using Plabsoft (Maurer

_{i}*et al*. 2004). The differences between simulated and analytically determined variances were < 0.001 if one million chromosomes were simulated.

#### Probability distribution of the parental genome contribution:

The probability distribution of the parental genome contribution is determined by the number and location of crossover events occuring during the meioses in inbred line development. We investigated the probability distribution assuming no interference in crossover formation (Stam 1979), employing properties of the Poisson process (*cf*. Karlin 1968).

For an individual chromosome, the probability that exactly *k* crossovers occur during all meioses in inbred line development can be obtained from the probability function of the Poisson distribution. If no crossover occurs (*k* = 0), then the genome contribution of parent P_{1} is either 0 or 1. In consequence, the probabilities *P*(*Z _{i}* = 0) and

*P*(

*Z*= 1) do exist and the random variable

_{i}*Z*is discrete for

_{i}*Z*= 0 and

_{i}*Z*= 1. If

_{i}*k*> 0 crossovers occur, then the length of chromosome segments between crossovers is exponentially distributed and the sum of lengths of chromosome segments is gamma distributed. In consequence,

*Z*is in the interval (0, 1) a mixture of linear transformations of the gamma distributions for different values of

_{i}*k*. For the entire genome, the distribution of the parental genome contribution is a convolution of the distributions for the individual chromosomes.

Analytical results for the exact probability distribution of the parental genome contribution could be derived by employing the above considerations. However, the resulting equations would be rather unwieldy and using them to derive important parameters such as quantiles directly from the density functions would require a heavy use of high quality numerical mathematics. Alternatively, we suggest employing our relatively simple equations for the variance (Table 2) and a normal approximation instead.

## DISCUSSION

#### Genetic model:

For all derivations we used the assumption of no interference (Stam 1979) underlying Haldane's (1919) mapping function. This is a simplified mathematical model and there exist more sophisticated models of crossover formation in meiosis, which fit experimental data better (McPeek and Speed 1995). Briefly, the advantages of the assumption of no interference are (1) mathematical simplicity, yielding equations that can be easily evaluated, and (2) that the results can be applied without knowing the exact amount of interference in the chromosome region under consideration. For a more detailed discussion concerning the use of the assumption of no interference see Frisch and Melchinger (2001).

Equation 3, defining the variance of the parental genome contribution in terms of the linkage disequilibrium *D*(*u*, *v*), and the formulas for *D*(*u*, *v*), in terms of the recombination frequency *r _{uv}* presented in Table 1, hold true irrespectively of the amount of interference. These formulas can be used with arbitrary mapping functions to derive the variance of the parental genome contribution under the assumption of interference. Presumably, analytical solutions as presented in the appendix cannot be derived for some mapping functions. In such cases, approximative solutions of Equation 3 can be obtained with numerical integration routines of mathematical software packages.

Compared with no interference, negative interference results in a greater number of chromosome segments with intermediate length and a smaller number of very long or short chromosome segments. Therefore, negative interference will result in smaller variances of the parental genome contribution than those presented in our results. The opposite is the case for positive interference.

#### Comparison with previous studies:

Hill (1993) derived the variance of the parental genome contribution to backcross individuals. Each backcross individual receives from the recurrent backcross parent one set of homologous chromosomes, for which the variance of the parental genome contribution is zero. Hence, the variance of the parental genome contribution to backcross individuals is entirely determined by the variance (following Hill's 1993 notation) of the parental genome contribution to the homologous chromosome set originating from the nonrecurrent parent. These homologous chromosomes are genetically identical to the chromosomes of DH lines derived from a backcross individual. In consequence, derived by Hill (1993) for backcross individuals equals Var(*Z _{i}*) for BC

_{t}-DH lines.

Wang and Bernardo (2000) derived the variance *V*(* _{k}X*) of marker estimates of parental genome contribution to F

_{2}- and BC

_{1}-SSD lines. They considered a finite number

*k*of marker loci per chromosome and employed Kosambi's (1944) mapping function The major difference to our approach is that Wang and Bernardo (2000) obtain

*V*(

*) by summing over a discrete number of marker loci, whereas we obtain Var(*

_{k}X*Z*) by integrating over an infinite number of genomic loci (Equation 3). The results on

_{i}*V*(

*) and Var(*

_{k}X*Z*) can be related as follows. Inserting

_{i}*D*(

*u*,

*v*) (Table 1) in Equation 3, but employing Kosambi's instead of Haldane's mapping function, yields In consequence,

*V*(

*) of Wang and Bernardo (2000) converges to Var(*

_{k}X*Z*) for large numbers of markers on a chromosome (assuming that the same mapping function is employed).

_{i}Heckenberger *et al*. (2006) estimated the parental genome contribution to 102 F_{2}-SSD and 11 BC_{1}-SSD maize lines with 100 SSR and 1017 AFLP markers. They determined the standard deviations of the parental genome contribution (Table 4) and compared their results with computer simulations. The observed standard deviations were not significantly different (χ^{2} test with α = 0.05) from the simulated values. The standard deviations determined with Equation 2 as well as those obtained with the model of Wang and Bernardo (2000) were in good agreement with the experimental and simulated values (Table 4). In conclusion, both theoretical models fit the data set of Heckenberger *et al*. (2006) well.

#### Numerical results:

The variance of the parental genome contribution to a chromosome depends on the expected number of crossovers occurring on the chromosome during inbred line development. A large number of expected crossovers results in many small chromosome segments, whereas few crossovers result in few long chromosome segments. With few long segments, the probability that chromosomes with very large or very small parental genome contributions do occur is greater and, therefore, the variance of the parental genome contribution is greater than for many small segments.

The number of crossovers expected per meiosis on a chromosome equals its length in Morgan units. Therefore, the variance of the parental genome contribution is smaller for long chromosomes than for short chromosomes. This trend can be observed irrespective of the employed breeding scheme for inbred line development (Table 3).

The total number of crossovers occurring on a chromosome during inbred line development depends on the total number of meioses and, hence, the employed breeding scheme. Intermating or backcrossing prior to employing the SSD or DH method results in an increased total number of meioses and, therefore, in a smaller variance of the parental genome contribution (Table 3). Generating DH lines comprises only one meiosis, whereas in the SSD scheme one meiosis occurs in each selfing generation. Therefore the variances of the parental genome contribution is greater for DH than for SSD lines.

#### Normal approximation:

A normal approximation is not expected to fit the distribution of the parental genome contribution for individual chromosomes well, because *Z _{i}* = 0 and

*Z*= 1 can occur with rather high probabilities, especially for short chromosomes or when inbreds are generated by the DH method. However, the genomes of important crops consist of many chromosomes (9 in sugar beet, 10 in maize, and 21 in wheat). Therefore, the random variable describing the parental genome contribution to the entire genome is a sum of independent random variables for the individual chromosomes. According to the central limit theorem (Shao 1999) the probability distribution of a sum of a large number of random variables converges to a normal distribution, irrespective of the type of distributions of random variables that are summed up. As a consequence, theory suggests that a normal approximation of the probability distribution of the parental genome contribution to the entire genome should fit the true distribution well.

_{i}To investigate the fit of the normal approximation, we used the software Plabsoft (Maurer *et al*. 2004) to simulate the parental genome contribution to (a) one chromosome of 1.6 M length and (b) a model of the maize genome consisting of 10 chromosomes each of 1.6 M length for the F_{2}-SSD and BC_{1}-SSD mating schemes. The normal approximations fit the simulated distributions of the parental genome contribution for individual chromosomes only poorly (Figure 1). In contrast, the fit was very good for the simulated distribution of the entire genomes for both F_{2}-SSD and BC_{1}-SSD lines. Hence, our formulas for the variances, together with a normal approximation, provide a good means by which to investigate the distribution of the parental genome contribution in many applications in genetics and breeding.

#### Application in plant variety protection:

An essentially derived variety is a cultivar or an inbred line, which is for the most part identical to one of its ancestors. Essentially derived varieties can be detected by comparing predictions of the parental genome contribution to inbred lines with threshold values. The variances of the parental genome contribution derived here can be employed together with the prediction method described in a companion article (Frisch and Melchinger 2006) to establish a test for detecting essentially derived varieties.

The first step of the test is to identify breeding schemes that are generally considered acceptable for inbred line development. For example, in wheat breeding in Europe, it is an accepted breeding scheme to cross a proprietary inbred line with a registered line cultivar of a competitor and to select a new line cultivar from the resulting population of F_{2}-SSD lines. In contrast, deriving inbred lines from backcross populations is not accepted.

Then the null hypothesis, “An inbred line was derived using an accepted breeding scheme,” is tested. The critical value for the test is determined from the quantiles of a normal approximation of the distribution of the parental genome contribution under the null hypothesis. For example, in wheat, the 0.99 quantile of the parental genome contribution to F_{2}-SSD lines is 0.638 (Table 5). As test statistic, the genome contribution of the parental line that is assumed to be plagiarized to the putative essentially derived variety is determined by using the prediction method of Frisch and Melchinger (2006). If the test statistic is greater than the critical value, then the null hypothesis is rejected and plagiarism is assumed. (Of course, the accused breeder always has the possibility to prove that an accepted method was employed, *e.g*., by disclosing the breeding records.)

For use as threshold values, we determined quantiles of the parental genome contribution for model genomes of sugar beet (9 chromosomes of 1.0 M length), maize (10 chromosomes of 1.6 M length), and wheat (21 chromosomes of 1.8 M length) by employing normal approximations (Table 5). The upper quantiles were considerably lower for long genomes than for short ones, *e.g*., the 0.95 quantile for F_{2}-SSD lines was 0.598 for wheat and 0.681 for sugar beet. Breeding schemes with intermating before inbred line development had slightly smaller 0.95 quantiles than the corresponding breeding schemes without intermating.

The upper quantiles for F_{1}-DH lines were considerably greater than those for F_{2}-SSD lines. For example, the 0.95 quantile for F_{2}-SSD lines of maize was 0.648, whereas for F_{1}-DH lines it was 0.672 (Table 5). Typically, the expectation of the parental genome contribution is the criterion that determines acceptance or nonacceptance of a certain breeding scheme for inbred line development. The F_{2}-SSD scheme is often suggested as an accepted breeding method for determining critical threshold values (*cf*. Heckenberger *et al*. 2006). If F_{2}-SSD lines are considered acceptable, then F_{1}-DH lines should also be considered acceptable, because both have an expected parental genome contribution of one-half. However, F_{1}-DH lines have a considerably greater variance of the parental genome contribution (Table 3) and, consequently, greater upper quantiles (Table 5). Therefore, the F_{1}-DH mating scheme seems in general more appropriate than the F_{2}-SSD scheme for determining threshold values.

The test described above can be modified by using alternative test statistics or/and alternative methods to determine critical values. Alternative predictors of the parental genome contribution for use as test statistics were discussed by Frisch and Melchinger (2006), and alternative methods to determine critical threshold values were proposed by Smith *et al*. (1995), Wang and Bernardo (2000), and Heckenberger *et al*. (2005).

Smith *et al*. (1995) suggested employing fixed threshold values and proposed a parental genome contribution of 0.9 as threshold for maize lines. Compared with using fixed values as thresholds, our method has the advantage that it is genetically justified. For F_{2} and F_{1} derived lines of maize, the 0.999 quantiles of the parental genome contribution ranged between 0.73 and 0.82 (Table 5). In consequence, employing 0.9 as threshold value results in a low power of detecting backcross-derived inbreds.

Wang and Bernardo (2000) suggested determining threshold values using the variance of marker estimates of the parental genome contribution. Compared with the method of Wang and Bernardo (2000), our method has the advantage that the threshold values (Table 5) are independent of the employed set of molecular markers.

Heckenberger *et al*. (2005) suggested determining threshold values with computer simulations. Our results on the quantiles of the parental genome contribution for F_{2}-SSD lines of maize were in good agreement with the corresponding results of Heckenberger *et al*. (2005). However, our method has the advantage that no computer simulations are required.

#### Application in selection theory:

Selection for parental marker alleles in backcross populations was investigated and a comprehensive selection theory was developed by Frisch and Melchinger (2005). That approach takes into account (a) the exact distribution of the parental genome contribution and (b) that selection for the parental alleles at marker loci is actually an indirect selection for the parental alleles at all loci of the entire genome. However, such theory is not available for inbred lines developed with the mating schemes considered in this study. Using a simpler mathematical model that neglects (a) and (b), the variances of the parental genome contribution can be employed to estimate the response to selection.

We consider a population of inbred lines analyzed for a large number of polymorphic molecular markers, which are covering the entire genome without larger gaps (*e.g*., one marker per centimorgan). Selection is carried out for the alleles of one parental line and the marker score is regarded as the target trait for selection. Under these assumptions, an approximate pre-test estimate of the response to selection *R* can be obtained adopting from standard selection theory (Falconer and Mackay 1996, p. 189, Equation 11.3),(7)where *i* is the selection intensity, *h*^{2} the heritability, and σ_{p} the square root of the phenotypic variance. Assigning a heritability of *h*^{2} = 1 for the markers and using the variance of the parental genome contribution as phenotypic variance we obtain(8)

#### Further applications:

In addition to the above applications, the results presented are of general interest for breeders and geneticists because they allow comparison of the distribution of the parental genome contribution for alternative mating schemes.

For example, an important goal in second-cycle breeding is the development of inbred lines that share the general characteristics with one parental line and are improved by specific characteristics of a second crossing partner. Such derived lines are then used as a replacement for parental lines in a breeding program. As a rule of thumb, the breeder may attempt to derive lines with a parental genome contribution of 3/4 from the parental line, which should be replaced by the derived line. The probability distribution of the parental genome contribution can help to assess the suitability of mating schemes to deliver such inbred lines. For sugar beet, the overlap of the probability density functions of the parental genome contribution to F_{2}-SSD and BC_{1}-SSD lines is considerable (Figure 2) and it is possible to select lines with a parental genome contribution of 70–75% from an F_{2}-derived population. In contrast, for wheat, F_{2}-SSD lines with parental genome contributions of 3/4 or more from one crossing partner do occur only with an extremely small probability (Figure 2). Therefore, in wheat a BC_{1}-derived population must be generated to be able to select lines with the desired parental genome contribution.

These examples demonstrate that our results can be used to assess the expected variation of the parental genome contribution in populations derived from planned crosses of parental lines, depending on the number and length of the chromosomes of the species. This information can help breeders and geneticists in the design of breeding programs and experiments.

### APPENDIX

We derive the variance of the parental genome contribution to a chromosome according to Equation 3 for four mating systems.

#### BC_{t}-DH lines:

Inserting *D*(*u*, *v*) for BC_{t}-DH lines (Table 1) into Equation 3 yields(A1)

With(A2)and(A3)we get(A4)

#### BC_{t}-SSD lines:

For BC_{t}-SSD lines we have (Table 1)(A5)

We consider the second indefinite integral in Equation A5 for the case *u* ≤ *v* and set(A6)and with logarithmic integration we get(A7)

Applying the same principle to the case *u* > *v* we get(A8)

We now consider the first indefinite integral in Equation A5. Adding to the numerator(A9)and applying(A10)we get(A11)

Hence, we get for a fixed value of *v*,(A12)where(A13)

For symmetry reasons(A14)

Employing the dilogarithm function (*cf*. Galassi *et al*. 2006)(A15)we get(A16)

Using this and(A17)yields(A18)

#### (F_{1})^{t}-DH lines:

For (F_{1})* ^{t}*-DH lines we have (Table 2)(A19)

Consider *u* ≤ *v* and set(A20)then(A21)

With integration by substitution we get(A22)

In analogy we get for *u* > *v*(A23)and therefrom for a fixed *v*(A24)

We have(A25)

For symmetry reasons(A26)and therefrom we get(A27)

#### (F_{2})^{t}-SSD lines:

Using the definition of *D*(*u*, *v*) from Table 1 and Equation A11 we get for a fixed value of *v*(A28)where ξ_{1}, ξ_{2}, ξ_{3}, ξ_{4}, and ξ_{5} are defined in Equation A13, and therefrom(A29)

## Acknowledgments

We thank Frank M. Gumpert for checking the derivations in the appendix and an anonymous reviewer for helpful comments and suggestions.

## Footnotes

Communicating editor: R. W. Doerge

- Received August 29, 2006.
- Accepted February 26, 2007.

- Copyright © 2007 by the Genetics Society of America