## Abstract

We consider recombinant inbred lines obtained by crossing two given homozygous parents and then applying multiple generations of self-crossings or full-sib matings. The chromosomal content of any such line forms a mosaic of blocks, each alternatively inherited *identically by descent* from one of the parents. Quantifying the statistical properties of such mosaic genomes has remained an open challenge for many years. Here, we solve this problem by taking a continuous chromosome picture and assuming crossovers to be noninterfering. Using a continuous-time random walk framework and Markov chain theory, we determine the statistical properties of these identical-by-descent blocks. We find that successive block lengths are only very slightly correlated. Furthermore, the blocks on the ends of chromosomes are larger on average than the others, a feature understandable from the nonexponential distribution of block lengths.

WITH the advent of dense genomic maps, in particular based on single-nucleotide polymorphism (SNP) data, the study of haplotypes has become central for modern analyses in population genetics (Buckler and Gore 2007; Carlton 2007; Frazer *et al.* 2007; Mott 2007; Jakobsson *et al.* 2008; Bryc *et al.* 2010). Here, the term haplotype refers to the series of alleles that an individual carries on a chromosome pair at a collection of (possibly many) loci and contrasts with single-locus genotypes that were the objects of many past studies. Haplotypic information can be used for association studies (Gold *et al.* 2008), for diversity studies (Lindblad-Toh *et al.* 2005), or for recognizing signals of positive selection using various measures of haplotype homozygosity (Sabeti *et al.* 2002; Zhang *et al.* 2006; Lencz *et al.* 2007; Tang *et al.* 2007; Curtis *et al.* 2008). Many approaches capitalize on the apparent “block” structure of haplotypes (Stumpf 2002; Cardon and Abecasis 2003; Wall and Pritchard 2003; Altshuler *et al.* 2005; Zheng and McPeek 2007). Various causes can be called upon to explain the apparent structuration of genomes in haplotype blocks (Tishkoff and Verrelli 2003; Zondervan and Cardon 2004; Pe’er *et al.* 2006), among which are recombination hotspots (Goldstein 2001; Jeffreys *et al.* 2001) and population structure (Pritchard *et al.* 2000; Grote 2007; Slate and Pemberton 2007). However, the situation is often complicated (Shifman *et al.* 2003; Yalcin *et al.* 2004; Cuppen 2005; Kauppi *et al.* 2005; Greenawalt *et al.* 2006; Moore *et al.* 2008). In particular, the theoretical properties of many of the objects mentioned above, *e.g.*, haplotype block lengths, remain largely unknown. Often, the distribution of blocks is declared “nonrandom” (Curtis *et al.* 2008) although the null hypothesis is not clearly specified.

The task of determining statistical properties of chromosomal block structures has arisen in many different contexts. These can be classified into two types according to the kind of populations considered and lead to different mathematical techniques. In the first class, one asks how the genome of one or more parents in a population gets broken up into blocks at successive generations and how different descendents may share *identical-by-descent* (IBD) blocks. The framework most generally taken allows for random mating between individuals, a stochastic number of offspring for each individual, and possibly population growth; because of this stochasticity, the mathematical theory of branching processes plays a key role. The second class on the contrary assumes complete knowledge of all genealogies and so is relevant only for controlled crosses. But because the corresponding framework is thus constrained, Markov chains can be used to follow the statistics of IBD blocks and even how the genomes of *all* founding parents get shared among descendants.

The mathematical treatments in the first class typically build on Fisher’s “theory of junctions” (Fisher 1949). Fisher defined a *junction* in a chromosome as a boundary point between segments descended by different routes from the founders. Once formed by crossovers, junctions can be inherited, just like point mutations. Fisher (1949, 1954, 1959) and Bennett (1953) investigated the expected number of chromosomal regions separated by junctions for different systems of inbreeding (repeated selfing, repeated sib mating, repeated parent–offspring mating, etc.). Stam (1980) extended Fisher’s theory of junctions to a random mating population of constant size and any number of generations. Furthermore, he was able to derive the “probability distribution of the heterogenic part of the genome” (and not just the expected number of fragments) by assuming that the fragments were exponentially distributed—a critical hypothesis that was justified by numerical simulations. Chapman and Thompson (2003) extended Stam’s work to the case of a subdivided population. They were also able to relax the hypothesis of an exponential distribution and showed that the IBD tracts of chromosomes followed a distribution not quite exponential, having in fact a fat tail. They also determined how these properties were affected by the population size. Analogous work by Baird *et al.* (2003) focused on the case of a formally infinite population; this simplifies the problem because related individuals never mate with one another. Furthermore, they worked in the approximation of allowing only 0 or 1 crossover at each meiosis, a case sometimes referred to as complete interference; then each individual can carry at most just one block from the reference founding parent. Within this framework, they were able to treat the problem exactly, deriving in particular the distribution of the number of descendants containing blocks and the first two moments of these block sizes.

The mathematical treatment of the second class was initiated by Donelly (1983). Numerous studies since have derived exact mathematical results on different kinds of pedigree systems (Slatkin 1972; Franklin 1977; Donelly 1983; Guo 1994; Bickeboller and Thompson 1996a,b; Stefanov 2000; Browning and Browning 2002; Cannings 2003; Dimitropoulou and Cannings 2003; Ball and Stefanov 2005; Walters and Cannings 2005; Rodolphe *et al.* 2008). Such studies map the IBD problem to that of a random walk on a pedigree-dependent graph. It is that Markovian framework which we use here in the context of recombinant inbred lines, a particular kind of pedigree that has the additional complication of allowing for an infinite number of generations. We provide a description that is mathematically rigorous but also of practical use.

Recombinant inbred lines (RILs) can be derived by either self-fertilization (plants) or brother–sister matings (animals). RILs have become a tool of choice for animal and plant studies [genetic maps, QTL detection, and association studies (Churchill *et al.* 2004; Churchill 2007; Crow 2007; Keurentjes *et al.* 2007; Yu *et al.* 2008)]. Moreover, such lines are fixed and provide ever-lasting replicable reference homozygous genomes; these are very useful to dissect complex traits and estimate epistatic effects or genotype × environment interactions (Bergland *et al.* 2008; Maccaferri *et al.* 2008; Alcazar *et al.* 2009). To produce a RIL, one typically starts with F_{1} hybrids derived from the cross of two homozygous parents, say *P _{A}* and

*P*. Offspring are generated from these F

_{a}_{1}and the process is repeated for many generations; this can be done by selfing [

*single-seed descent*(SSD)] or by full-sib mating (hereafter referred to as “SIB”). At each generation mean heterozygosity decreases and in fact the process tends toward homozygosity at all loci. Due to the formation of crossovers during meiosis, the genomes at each generation are mixtures of the two parental ones, in which closer loci have a higher probability of descending from the same parent

*P*or

_{A}*P*. The fixed genomes then form successions of blocks, each block being IBD to one of the two parents. In effect, we have a mosaic genome for the RIL, patching together pieces from each parent

_{a}*P*or

_{A}*P*. What is the mean length of blocks? We shall see that it is 0.5 M in SSD and 0.25 M in SIB if the chromosome genetic length is large. But one may also ask what is the block length distribution, what is the mean number of blocks on a finite chromosome, or even what are the analogous statistics before all loci are fixed. As genome coverage becomes dense or as one approaches a nucleotide-level description of genomes [be it for association studies or genomic selection (Meuwissen

_{a}*et al.*2001)], one is inevitably driven toward a continuous chromosome picture, requiring block-like descriptions. Here, we address the need to work at this level where blocks are the elements of interest. In particular, we show how to calculate block statistics in RILs, using a mix of combinatorial analysis and probability theory.

## Model and Methods

### Junctions

For all our work, we deal with diploid organisms and are concerned with the construction of RILs in both SSD and SIB. Each chromosome pair is subject to independent dynamics, so without loss of generality we can focus on the case of a single chromosome pair. Furthermore, the objects of study are the IBD blocks, hereafter simply referred to as blocks. The F_{1} is considered to be the zeroth generation (*g* = 0). To go from one generation to the next, we produce one offspring in the case of SSD and a brother–sister pair in the case of SIB. An offspring individual is the union of two gametes, each of which is produced by a parent through meiosis, during which there can be crossovers. (In the case of SSD, there is just one parent.) Once a locus is fixed, it stays so forever. In SSD this simply requires the two homologous chromosomes to have the same allele at that locus; in SIB, it requires all four chromosomes to have the same allele.

We measure continuous positions on the chromosome in morgans, with the leftmost end of the chromosome corresponding to the origin of our axis; *i.e.*, *x* = 0. Following the work of Fisher (1949, 1954, 1959), Bennett (1953), and Donelly (1983), a crossover is referred to as a “junction” and is identified with an arbitrarily precise point on the chromosome. We assume that crossovers arise without interference; then the production or not of a junction in the interval [*x*, *x* + *dx*] is independent of occurrences of junctions in any other interval. Here *x* denotes the genetic position, *dx* is infinitesimal, and junctions arise with density 1 along the chromosome. Figure 1 illustrates the use of these junctions when following successive generations under SSD. At one generation, consider the pair (*H*, *H*′) of homologous chromosomes. A meiosis takes place and results in an offspring chromosome (gamete) that is a mosaic of chromosome segments coming from either *H* or *H*′. A junction separates two adjacent segments. We use a binary label (0 or 1) to specify the origin (*H* or *H*′) of each segment. For any position *x*, having the list of its labelings for all generations *g* allows us to determine the IBD content at *x* as shown on the right-hand side of Figure 1. Note that the numbering of the junctions is done from left to right, *not* as a function of its occurrence in generations. The successive steps of the procedure are shown in Supporting Information, Figure S1: first one lays out the junctions and their numbering, then one introduces the binary labels across each junction, and finally one reconstructs the haplotypes (see File S1). As illustrated in Figure S2, the case of SIB mating is analogous, and again at each generation a chromosome consists of a mosaic of the founding parents’ chromosomes (see File S1).

So far, the introduction of junctions can be formulated for any pedigree system. Many previous studies have done this and mapped the IBD problem to that of a continuous-time random walk on a pedigree-dependent graph (Slatkin 1972; Franklin 1977; Donelly 1983; Guo 1994; Bickeboller and Thompson 1996a,b; Stefanov 2000; Browning and Browning 2002; Cannings 2003; Dimitropoulou and Cannings 2003; Ball and Stefanov 2005; Walters and Cannings 2005; Rodolphe *et al.* 2008). Here we consider RILs and then the Markovian framework’s pedigree-dependent graph is a hypercube. The sequence of binary labels at any given locus *x* specifies a unique vertex of the hypercube, and how this vertex changes as one moves along the chromosome determines the block structure. Note that some of this mathematical framework is close in spirit to that used for studying the coalescent in the presence of recombination; there the central object is the so-called ancestral recombination graph, and the problem (Wiuf and Hein 1999; McVean and Cardin 2005) is to describe how this graph changes with position along a continuous chromosome. This is a very difficult problem and so the authors of those studies derived relatively few exact results.

The application of this Markovian framework for SSD and SIB RILs requires considering all possible continuous-time walks on a high-dimensional hypercube. We do this in two steps. First, we enumerate by computer all possible discrete time random walks on that hypercube. Then we tackle the continuous waiting times of the original walks by analytical techniques. Finally, the numerical treatment of these analytical expressions is performed using Mathematica (Wolfram 1991). The C and Mathematica codes for these different tasks are provided in File S2.

### The continuous-time Markov process

When creating the generations 1 to *g*, 2*g* gametes are produced in the SSD mating scheme, and 4*g* gametes are produced in the SIB mating scheme. Denote this number by *N*_{c} as it is also the number of (new) chromosomes produced in the RIL construction (remember we follow only one chromosome pair). Rather than follow each gamete from one generation to the next, it is (more) useful to consider all *N*_{c} gametes *simultaneously*. This can be visualized by stacking the pairs of chromosomes for all generations on top of each other and then scanning the chromosome stack from left to right to see where the junctions appear in order of increasing *x*.

Of great importance is the fact that the junctions on these *N*_{c} gametes are *independent* in all respects: having a junction on one gamete in [*x*, *x* + *dx*] does not affect the probability of having another junction elsewhere, be it on the same gamete or on any other gamete. Because of this independence, one can think of the production of junctions among the whole set of gametes as being a “continuous-time” Markov process (Feller 1950), where *x* plays the role of time. For the interval [*x*, *x* + *dx*], a junction arises with probability *N*_{c}*dx*, and then if such an *event* is realized the junction is assigned randomly to one of the *N*_{c} gametes (each with probability 1/*N*_{c}). The operation is then repeated for the interval [*x* + *dx*, *x* + 2*dx*], and so forth. We thus have a Markov process where interevent intervals are independent and distributed as(1)while junction assignments to chromosomes are done equiprobably.

### Discrete and continuous-time random walks on the hypercube

We initialize the binary labels at *x* = 0 randomly and uniformly because segregation is unbiased. The continuous-time Markov process extends these labels from *x* = 0 toward increasing *x*. At any given *x*, and using a {0, 1} notation for each binary label, we call ℳ the map from the *N*_{c} dimensional hypercube ℋ > to the genotypes at generation *g*; this map can be thought of as a *coloring* of the vertices of the *N*_{c}-dimensional hypercube. There are as many colors as there are one-locus genotypes at generation *g*: 4 for SSD and 16 for SIB. Then the block pattern at generation *g* can be “read off” by examining the succession of vertex colors visited by the Markov walk on the hypercube. Note that having a junction appear at *x* corresponds to hopping to a random neighboring vertex on ℋ at that time, while the residence time on each vertex is exponentially distributed (*cf*. Equation 1).

For the block *statistics*, we want to find the probability that the walk on ℋ leads to a given pattern of successive colors. For this, we sum over all possible walks compatible with the desired pattern. The crucial point is that the continuous variables of the interjunction values affect the lengths of the blocks, but not the pattern of successive blocks. This allows us to decompose the problem of block statistics into two parts. The first comes from the discrete set of possibilities for the sequence of vertices visited on ℋ (the “topology” of the junctions); we use the master equation of the discrete-time random walk on the hypercube to track these sequences. The second is associated with the continuous nature of the junction–junction intervals, which involves summing over known probability distributions derived from Equation 1.

### Extracting block length distributions

Consider the simplest observable: the length of the first block along the chromosome. If the block is heterozygous, then its length distribution in SSD is that of the distance of the first junction and is given by Equation 1. Indeed, the locus *x* = 0 must be heterozygous at generation *g*, but in SSD, at the very first hop of our random walk on the hypercube, the heterozygous block will end, starting a fixed block. The situation is more instructive if the first block is homozygous (fixed). To calculate the length distribution of that first block, we first consider all possibilities for the different walks from the starting vertex (defined from the situation at *x* = 0). A walk will maintain the homozygous structure at generation *g* for perhaps a few hops and then one hop will change that. If *k* is the first hop that ends the block, we can collect together all the discrete time walks that have the same *k*. We thus define *P*^{(1)}(*k*) as the probability to perform *k* – 1 hops while staying at generation *g* in the same fixed state as that of *x* = 0 and to then terminate the block at the *k*th hop. *P*^{(1)}(*k*) is the sum of the probabilities of all discrete time walks on ℋ that are compatible with staying in the first fixed state during exactly *k* – 1 hops. Because the number of such walks grows exponentially with *k*, it is best to determine this quantity by recursions rather than by enumerations. This is precisely what is done when using the associated master equation. Each iteration of that equation updates a vector on the hypercube and generates the successive *P*^{(1)}(*j*). To obtain *P*^{(1)}(*k*) one has to perform *k* iterations of the master equation. File S1 specifies this master equation, the initialization of the vector iterated, and the relation between the iterated vector and *P*^{(1)}(*j*); the C programs for implementing these iterations are also provided (see File S2).

Given the *P*^{(1)}(*k*) probabilities, we can reintroduce the continuous times spent on each vertex of the hypercube to get the distribution of the length of the first bloc. Indeed, for SSD as well as for SIB, for all walks that contribute to this situation, *x* will go from 0 to *x _{k}* with

*x*distributed as a rescaled Gamma distribution,(2)as this the distribution of the sum of

_{k}*k*independent exponentially distributed variables. The distribution of ℓ

_{1}, the length of the first block (assuming it is fixed) is then given by(3)This result holds for an infinite chromosome. For a finite chromosome of length

*L*, we note that if ℓ

_{1}>

*L*, we have “stepped off” the finite chromosome. Thus, if the value of ℓ

_{1}(distributed as in Equation 3) is greater than

*L*, we see that on the finite chromosome the block is actually only

*L*long. Thus to adapt Equation 3 to a finite chromosome, we simply keep the distribution as is when ℓ

_{1}<

*L*while for all those values ℓ

_{1}≥

*L*we set ℓ

_{1}=

*L*. Mathematically, this generates a delta function at that point of weight given by the probability that ℓ

_{1}≥

*L*. This derivation corresponds to a simple truncation of a distribution, and it can be extended to other observables. These include the length of the

*n*th block, which requires calculating the probabilities

*P*

^{(}

^{n}^{)}(

*k*) of stepping off the

*n*th block after

*k*steps, and the joint distribution of lengths for different blocks. Details on such derivations are given in File S1. The Mathematica codes for performing sums like those in Equation 3 are also provided in File S2.

## Results

### Infinite and semi-infinite chromosomes

#### Mean block lengths:

As the number *g* of generations increases, alleles become fixed and since nothing changes thereafter the statistics of the blocks must have a limit at large *g*. In that situation, there is only alternation of IBD blocks homozygous of type *P _{A}* or

*P*. We focus on the statistics of such blocks, either at arbitrary

_{a}*g*or in the

*g*→ ∞ limit, approximated by taking

*g*large enough. However, our computational framework applies to arbitrary blocks, homozygous or heterozygous. From the practical point of view, we are limited computationally by the part of the algorithm that follows occupation probabilities on the hypercube ℋ: executing this master equation on the computer uses operations where

*N*

_{J}is the maximum number of hops of the walks; this restricts our study to 14 generations in SSD and 7 generations in SIB.

A simple statistic of blocks is their mean length 〈ℓ〉. This quantity is related to the density η of block extremities: on a very large chromosome of size *L*, the number of blocks *n* will satisfy *n*/*L* ≈ η while 〈ℓ〉 ≈ *L*/*n*. For SSD RILs the density η is known to be 2 while it is 4 in SIB. Such a result follows by considering a small interval [*x*_{1}, *x*_{2}] and asking that the interval be recombinant. In SSD this occurs with probability *R* = 2*r*/(1 + 2*r*), where *r* is the recombination rate per meiosis between *x*_{1} and *x*_{2} (Haldane and Waddington 1931). Taking *x*_{2} – *x*_{1} to be infinitesimal, we get *r* ≈ (*x*_{2} – *x*_{1}) (Haldane 1919) and so *R* ≈ 2(*x*_{2} – *x*_{1}); noting that recombination then implies the presence of a block extremity in this interval, we see that the density of block extremities is 2. Setting η = 2, one obtains directly 〈ℓ〉 = , valid at large *g* and for large chromosomes. An identical reasoning gives 〈ℓ〉 = for SIB since in this case *R* = 4*r*/(1 + 6*r*).

#### Distribution of block lengths:

Getting the *distribution* of a block length cannot be achieved by such shortcuts. Instead, we generalize Equation 3, again at very large *g* and for a very long (semi-infinite) chromosome. Starting from *x* = 0, the successive block lengths are ℓ_{1}, ℓ_{2}, … ; as the block number *n* increases, the distributions of ℓ* _{n}* tend toward a limiting distribution μ*(ℓ) that has no memory of the state at the chromosome’s origin. The computation of the distribution at any given

*n*, in direct analogy with what was done for ℓ

_{1}, requires calculating the probabilities

*P*

^{(}

^{n}^{)}(

*k*) of staying on the

*n*th block during

*k*– 1 hops and stepping off at the

*k*th one. Again, we use the master equation to compute these quantities iteratively;

*cf*. File S1. Then the distribution of ℓ

*is obtained by replacing the*

_{n}*P*

^{(1)}(

*k*) in Equation 3 by

*P*

^{(}

^{n}^{)}(

*k*). These successive distributions are displayed for SSD in Figure 2. Note that the convergence in

*n*is very rapid; only the first block is visibly different from the others. In File S1, we provide a parameter-free approximation to μ*(ℓ) that works quite well as shown in Figure S3. In the case of SIB RILs, the convergence with

*n*is much slower as shown in Figure S4.

A log–log plot of these distributions shows that they are not exponential, though in the tail they all are well approximated by an exponential; such a form in the tail is a consequence of the spectral decomposition of the Markov process (see File S1). Note that these distributions for ℓ_{1}, ℓ_{2}, … must all decay at the same asymptotic rate. Another important point is that the distribution of ℓ_{2} is slightly different from that of ℓ_{3}, proving that the successive lengths are *not independent*: the block lengths are *not* generated by a stationary renewal process.

Although we were not able to derive the analytic form of μ*, we nevertheless have(4)This can be proved by relating μ*(ℓ → 0) to double-recombinant frequencies as follows. We take two successive intervals *I*_{1,2} and *I*_{2,3}, each of length *dx* (infinitesimal), and ask what the probability is that the intervals are both recombinant. There is a probability 2*dx* in SSD and 4*dx* in SIB that the first interval is recombinant (recall that the densities of block extremities are respectively 2 and 4). Then given this first extremity, the probability that there is another extremity in the second interval is μ*(ℓ → 0)*dx*, leading to a total probability of 2*dx*μ*(ℓ → 0)*dx* in SSD and 4*dx*μ*(ℓ → 0)*dx* in SIB. However, this double-recombinant probability can also be computed (Martin and Hospital 2006) in terms of the three recombination frequencies *R*_{1,2}, *R*_{2,3}, and *R*_{1,3} associated with the locus pairs (1, 2), (2, 3), and (1, 3). In the limit of small *dx*, it is (2*dx*)^{2}3/2 for SSD and (4*dx*)^{2}7/4 for SIB. Identifying the different expressions, we get the claimed results.

Analogous studies can be performed at given values of *g*, including or not heterozygous blocks. Recalling that fixation arises rather rapidly in SSD, it is no surprise that the statistics of homozygous blocks converge quickly to a large *g* limit. For example, if one computes in SSD the length distribution of the first block when it is homozygous, one finds that it does not vary much for *g* ≥ 3 as illustrated in Figure S5. For completeness, we show the analogous result in SIB in Figure S6.

#### The first block is longer than the following ones:

The system does not follow a stationary renewal process. Nevertheless, it turns out that the large difference we see between the first and the remaining blocks is not so much due to the memory from one block to the next but to the nonexponential distribution of block lengths. Even in the presence of memory from block to block, one has the following general relation on a semi-infinite chromosome at large *g*,(5)where ℓ_{1} denotes the length of the first block and ℓ_{∞} denotes that of faraway blocks. The proof boils down to considering the blocks on the infinite line and taking the origin of the (semi-infinite) chromosome at random. It falls inside a block of length ℓ_{∞} with probability density proportional to ℓ_{∞} itself. Denoting as before by μ^{(1)}(ℓ_{1}) the probability density of the length of the first block, we have(6)(The reader can check that this is a normalized probability density.) Using this density, the computation of the first moment of ℓ_{1} leads directly to Equation 5. To interpret this result, note that Equation 5 implies that the relative difference (〈ℓ_{1}〉 – 〈ℓ_{∞}〉)/〈ℓ_{∞}〉 is equal to the [relative variance of μ*(ℓ_{∞}) . When μ*(ℓ_{∞}) is a pure exponential, this quantity vanishes and then 〈ℓ_{1}〉 = 〈ℓ_{∞}〉. Thus we have 〈ℓ_{1}〉 > 〈ℓ_{∞}〉 if and only if the relative variance of μ*(ℓ_{∞}) > 1, which is what we find to happen in this system. For instance in the SSD case, we have 〈ℓ_{1}〉 = 0.595, to be compared with 〈ℓ_{∞}〉 = . The first block is thus on average substantially larger than the others.

From Figure 2 one can see that μ^{(1)} is more spread out than μ*; μ* gives μ^{(1)} from Equation 6 so that μ^{(1)}(0) = 2 in SSD and 4 in SIB by direct computation using 〈ℓ_{∞}〉. Note that μ^{(1)}(0)*dx* can also be interpreted as the probability of having the first block end between *x* = 0 and *x* = *dx*; since that is the same as the density of junctions times *dx*, *i.e.*, 2 or 4 × *dx*, we indeed recover the result μ^{(1)}(0) = 2 for SSD and μ^{(1)}(0) = 4 for SIB.

#### The lengths of successive blocks are slightly correlated:

Even though junctions are independent, each junction affects the IBD property in its neighborhood. Two positions will have nearly independent IBD only when they are distant along the chromosome because only in that case will there be many crossover events separating them. It thus seems natural to expect that the successive block lengths will not be independent in contrast to the underlying Δ*x* separating junctions. In fact this must be the case given that we found earlier that on a semi-infinite chromosome the distribution of lengths is different for the second and the third block.

Our framework allows one to compute joint distributions and thus the linear correlation coefficients(7)where (resp. ) is the standard deviation of ℓ* _{n}* (resp. ℓ

_{n}_{+1}). The joint distribution of ℓ

*and ℓ*

_{n}

_{n}_{+1}is given bywhere

*P*

^{(}

^{n}^{,}

^{n}^{+1)}(

*k*,

_{n}*k*

_{n}_{+1}) is the probability that the

*n*th block ends after

*k*hops and the

_{n}*n*+ 1th after

*k*

_{n}_{+1}. From this distribution, the mean of the product ℓ

*ℓ*

_{n}

_{n}_{+1}is the sum of the probabilities

*P*

^{(}

^{n}^{,}

^{n}^{+1)}(

*k*,

_{n}*k*

_{n}_{+1}) times the average of ℓ

*times the average of ℓ*

_{n}

_{n}_{+1}(factorization), each of which is obtained from Equation 2. The linear correlation coefficient then reduces to(8)where (resp. ) is the standard deviation of

*k*(resp.

_{n}*k*

_{n}_{+1}). These quantities are directly obtainable from the probability

*P*

^{(}

^{n}^{,}

^{n}^{+1)}(

*k*,

_{n}*k*

_{n}_{+1}), as long as the number of generations is not too large. We have computed these quantities in SSD for a semi-infinite chromosome, for different

*g*’s and choices of block numbers. For instance, if we consider the first and second blocks, assumed to be fixed, the value of

*C*(ℓ

_{1}, ℓ

_{2}) is –0.0197 at

*g*= 2, –0.0125 at

*g*= 3, –0.00841 at

*g*= 4, … , with a trend that is compatible with a vanishing limit at large

*g*. We find the same trend for the following blocks too. Furthermore, at given

*g*,

*C*(ℓ

*, ℓ*

_{n}

_{n}_{+1}) rapidly converges to a limiting value as

*n*increases. This is illustrated in Table S1. The computer programs for obtaining these correlation coefficients are also provided (see File S2).

### Case of finite chromosomes

#### Length distributions:

Clearly on a finite chromosome of length *L* the distributions of block lengths are modified. The computations are more complicated, but remain feasible. As an illustration, consider the case where the chromosome has just two blocks. We use the *P*^{(1,2)}(*k*_{1}, *k*_{2}) probabilities, and for each (*k*_{1}, *k*_{2}) we impose the filter that ℓ_{1} < *L* while ℓ_{1} + ℓ_{2} > *L*. Then we have for the probability densities μ^{(1)}(ℓ_{1}) = μ^{(1)}(ℓ_{2}) and thus the distribution must be symmetric about *L*/2, and we also have 〈ℓ_{1}〉 = 〈ℓ_{2}〉 = *L*/2. The explicit expression for this density is(9)From this it turns out that μ^{(1)}(ℓ_{1}) has a minimum at ℓ_{1} = *L*/2: it is more likely to have one rather short and one rather long block than to have two blocks of approximately the same size.

#### A comparison with experimental RIL data:

It is appropriate to compare our theoretical computations with block statistics measured in experimental RILs. Since the block sizes are random variables, it is best to work with RIL data sets where (i) the block structure has been determined precisely, requiring high-density genotyping, and (ii) there are many lines, an easier task for SSD than for SIB crosses. Such a data set has been produced within the species *Arabidopsis thaliana* by Singer *et al.* (2006). These authors genotyped several hundred thousand loci via hybridization arrays, from which they determined block extremities in 100 SSD recombinant inbred lines derived from the crossing of Columbia and Landsberg homozygotes. Singer *et al.* provide the genetic map of their cross and the physical positions of the block extremities. From these data we determined the block lengths for each RIL and each of the five chromosomes. We display in Figure S7 the distributions of the first block length, for all five chromosomes. The solid line is the theoretical curve, corresponding to the infinite chromosome case but truncated to the genetic length of each chromosome. As was explained previously, if the (infinite chromosome) random variable ℓ_{1} is larger than the length *L* of the chromosome, one sees in practice a block of length ℓ_{1} = *L*; this happens with a finite probability that is represented in Figure S7, using a solid dot. The histograms are for the experimental data, and we have included the 95% confidence intervals for each bin. We see that the theoretical predictions agree well with the experimental values except for the last bin of chromosome 1 and chromosome 4. Interestingly, for both of these the segregation data exhibit significant distortion; such distortion, typically caused by loci under selection pressures, can affect recombination rates. It is thus satisfying that the agreement between theory and experiment is as good as it is in spite of this distortion.

#### Number of blocks:

Clearly the typical number of blocks will grow with the total length *L* of the chromosome. Furthermore, for large *g*, the density of block extremities is 2 in SSD (4 in SIB); thus at large *L* the mean number of blocks should grow as 2*L* in SSD (as 4*L* in SIB). It is also of interest to determine the *distribution* of the number of blocks.

Consider first the probability that the whole chromosome at generation *g* is in one single homozygous block. Starting at the left end of the chromosome, we must be in a fixed state: we choose it to be, for instance, the *P _{A}* genotype. This constraint is used to set the occupation probabilities of the random walks on ℋ before the first hop. Explicitly, at

*x*= 0 we introduce

*V*

_{i}^{(0)}= 0 on vertex

*i*if its color is incompatible with the

*P*genotype; otherwise

_{A}*V*

_{i}^{(0)}is a site-independent constant such that the total probabilities sum to 1. At each junction (hop), the master equation is used to update the vector of probabilities on the hypercube, and so for

*K*hops we have the vector . We iterate for up to a given total of

*N*

_{J}junctions. In practice we cannot take

*N*

_{J}= ∞ because of the numerical nature of the algorithm; so instead we take

*N*

_{J}sufficiently large so that only negligible probabilities are dropped in the truncation. As a ballpark estimate,

*N*

_{J}must be large enough to have

*N*

_{J}/

*N*

_{c}≫

*L*, and then each gamete can have many junctions per morgan. During the application of the master equation to the vector, hops terminating the block are stored and we keep in a file the probabilities Π

^{(1)}(

*K*) that the first block ends after

*K*hops, 1 ≤

*K*≤

*N*

_{J}. Then the probability

*p*

_{1}that there will be just one block in 0 ≤

*x*≤

*L*is given by(10)times the probability of fixing at

*x*= 0 (in SSD this is 1 – 1/2

*; in SIB it is given by a recurrence relation). Note that these integrals correspond to incomplete Gamma functions, allowing for a relatively efficient computation (see File S2). As mentioned before, in practice the sum over*

^{g}*K*is truncated to

*K*≤

*N*

_{J}and one must check that the error induced by this truncation is small enough. We find that the probability to have a single block decreases exponentially when

*L*grows. For instance in SSD, at large

*g*, the probability is 0.417 for

*L*= 0.5 M, 0.189 for

*L*= 1.0 M, and 0.088 for

*L*= 1.5 M.

The previous approach can be extended to compute the probability that the chromosome consists of *n* blocks as follows. For simplicity, consider directly the large *g* limit so we can ignore hops on ℋ leading to heterozygous genotypes at *g*. We use the master equation framework to keep track of the joint probabilities of being on a vertex of ℋ and of being in the *m*th block, for the number of hops *K* going from 0 to *N*_{J}. At each hop, there are transitions from vertex to vertex and potentially from block to block. If one hops to a vertex incompatible with the desired block pattern, the probability is set to 0. We keep track of the probabilities Π^{(}^{n}^{)}(*K*) that the *n*th block terminates at the *K*th hop. The probability of having *at most n* blocks when 0 ≤ *x* ≤ *L* is then given by the same formula as for one block (Equation 10) but replacing Π^{(1)}(*K*) by Π^{(}^{n}^{)}(*K*). (Note that *K* is the sum of the number of hops for blocks 1, 2, … , *n*.) Repeating the computation for *n* – 1, we obtain the probability of having at most *n* – 1 blocks in the region 0 ≤ *x* ≤ *L*, and taking the difference of the two results we obtain the probability of having exactly *n* blocks (see File S2). As an illustrative example, Figure 3 gives the probability distribution of *n* at large *g* in the case of SSD for a chromosome of length *L* = 1.5 M.

## Discussion

In the dense marker or continuous chromosome picture, genotypes appearing in a RIL form IBD blocks. The block structure is nontrivial in part because the consequence of a crossover at generation *g* depends on crossovers arising at previous generations. To study the statistics of such blocks, we used a labeling procedure that allows for a mapping onto a Markov process. Such a process reduces our problem of blocks to linear operations (associated with a master equation that we implemented in a C code), followed by relatively standard analysis involving sums and integrals (that we treated via Mathematica). The sources for these computations are provided with this work (see File S2).

We illustrated a number of properties of block statistics, highlighting in particular several closed-form results. Although the joint statistics of blocks are quite complex, we found that block-to-block correlations were very weak and thus genotype frequencies are well approximated by a stationary renewal process. In addition, the distributions of block lengths are not too far from exponential. Thus approximating the RIL case using exponential distributions will lead to qualitatively correct results with an accuracy of ∼20%. This level of accuracy should hold for other systems of crosses such as randomly mating populations, justifying the use of the exponential approximation in several previous studies (Stam 1980; Chapman and Thompson 2003).

We mainly stressed cases with complete fixation because the construction of RILs aims to have homozygous genotypes, but our formalism is applicable to both homozygous and heterozygous blocks. Note that within SSD RILs, as shown in *Results*, heterozygous blocks have an exponential distribution for their lengths; furthermore, heterozygous blocks *interrupt* the memory of the process in SSD; that is, two blocks separated by a heterozygous segment are independent. The reason is that there is only one way to be heterozygous in SSD (up to irrelevant exchanges of gametes at the same generation).

Clearly it is possible to extend our formalism to cases involving more than two parents; this may be of use when dealing with multiparental RILs that are being developed currently to have greater power in association studies (Churchill *et al.* 2004). We hope our results will stimulate work in this direction.

## Acknowledgments

We thank F. Rodolphe for discussing his results with us. This work has been supported by grants from the Agence Nationale de la Recherche: ANR-09-GENM-022-003 SingleMeiosis (to O.C.M.) and ANR-O6-BLANC-0128 (to F.H.).

## Footnotes

*Communicating editor: I. Hoeschele*

- Received December 24, 2010.
- Accepted July 12, 2011.

- Copyright © 2011 by the Genetics Society of America