## Abstract

Identity-by-descent probabilities are important for many applications in genetics. Here we propose a method for modeling the transmission of the haplotypes from the closest genotyped relatives along an entire chromosome. The method relies on a hidden Markov model where hidden states correspond to the set of all possible origins of a haplotype within a given pedigree. Initial state probabilities are estimated from average genetic contribution of each origin to the modeled haplotype while transition probabilities are computed from recombination probabilities and pedigree relationships between the modeled haplotype and the various possible origins. The method was tested on three simulated scenarios based on real data sets from dairy cattle, *Arabidopsis thaliana,* and maize. The mean identity-by-descent probabilities estimated for the truly inherited parental chromosome ranged from 0.94 to 0.98 according to the design and the marker density. The lowest values were observed in regions close to crossing over or where the method was not able to discriminate between several origins due to their similarity. It is shown that the estimated probabilities were correctly calibrated. For marker imputation (or QTL allele prediction for fine mapping or genomic selection), the method was efficient, with 3.75% allelic imputation error rates on a dairy cattle data set with a low marker density map (1 SNP/Mb). The method should prove useful for situations we are facing now in experimental designs and in plant and animal breeding, where founders are genotyped with relatively high markers densities and last generation(s) genotyped with a lower-density panel.

TWO alleles at a single locus are identical-by-descent (IBD) if they derive from a common ancestor without mutation. The probability for two alleles to be IBD (IBDp) is an information of high importance as it is used in several applications such as gene or QTL mapping (*e.g.*, Meuwissen and Goddard 2004), marker-assisted (Fernando and Grossman 1989), or genomic selection (De Roos*et al.* 2007; Calus*et al.* 2008), haplotype reconstruction, or imputation (Kong*et al.* 2008).

Estimation of IBDp may rely on linkage or linkage disequilibrium information, or on both. With linkage, the estimation of IBDp is based on Mendelian segregation rules, recombination, pedigree, and genotypes. With such information, IBDp are most often estimated between parents and offspring with the sole informative flanking markers (*e.g.*, Fernando and Grossman 1989; Wang*et al.* 1995; Pong-Wong*et al.* 2001). IBDp between more distant relatives are obtained by transmitting the parent–offspring information along the path between the relatives. If some individuals on this path are not genotyped, transmission of chromosomes can no longer be traced unambiguously. In addition, since the information relies only on flanking markers, it is rapidly eroded along the pedigree path.

Other methods estimate IBDp probabilities using multipoint linkage and segregation analysis and by considering all the pedigree links. Lander and Green (1987) proposed such a method considering each alternative gene flow pattern in a pedigree separately. However, it results in high computational costs when the pedigree increases. Abecasis*et al.* (2002) described an efficient algorithm for the analysis of dense genetic maps in pedigree. Their program, called Merlin, proved faster (and used less memory) than softwares based on the Lander–Green algorithm. Although very efficient, Merlin nevertheless becomes also impractical in large pedigrees with complex structures such as those used in animal and plant breeding.

Methods such as Loki (Heath 1997) or Morgan (Wijsman*et al.* 2006) consider multipoint linkage and segregation analysis using MCMC techniques to jointly estimate IBDp or haplotypes probabilities. Although fairly general and potentially using all the available information, these methods are also very demanding from a computational point of view—especially in problems facing large pedigrees and in contexts of high-density genotyping. Furthermore, these approaches are always faced with the problem of getting stuck in a local maximum region.

A common feature to all the methods based on linkage is that the definition of the IBDp refers to a common ancestor within the pedigree: regions will be declared to be IBD only if they can be shown to descend from one common ancestral region within the pedigree. The situation is different with linkage disequilibrium methods (*e.g.,* Meuwissen and Goddard 2001; Hernandez-Sanchez*et al.* 2006), where IBDp are computed with reference to unobserved ancestors of the studied population. The assumptions are that these common ancestors were present *T* generations before the current population and that the studied mutations are old and were already present within these ancestors. The size of the haplotypes shared IBD by two gametes is relatively small and small haplotypes are used to estimate IBDp. As a consequence, such methods require higher marker densities than for linkage approaches, focus on relatively small segments surrounding the region-of-interest, and basically ignore pedigree relationships. With current marker maps (*e.g.*, 50K SNP chips for cattle or maize), these methods also have a high computational burden.

In plant and animal breeding, as in experimental designs, new more cost-effective genotyping strategies are being implemented. Indeed, important founder individuals are genotyped at high density whereas other individuals are genotyped at lower density. For example, in the nested association mapping (NAM) design (Yu*et al.* 2008) or in multiparent advanced generation inter-crosses (MAGIC) (*e.g.*, Threadgill*et al.* 2002; Kover*et al.* 2009), the recombinant inbred lines are phenotyped and genotyped at low density whereas founder lines are genotyped at high density. In dairy cattle, important sires could be genotyped using dense panels (or even sequenced), while younger animals (*e.g.*, those without phenotype) can be genotyped at lower density (*e.g.*, Habier*et al.* 2009). In such designs, estimation of IBDp between founder and nonfounder individuals might be particularly useful for applications such as QTL mapping, genomic selection, or marker imputation. The IBDp estimation method should then rely on pedigree information and linkage (because some animals are genotyped at lower density), be computationally efficient, and handle complex pedigrees with overlapping generations where genotyped founders are spread over several generations.

In this article, we propose an alternative and efficient method with which to describe the IBD process for a whole chromosome between target individuals and their genotyped ancestors, taking benefit of the marker density and using recombination probabilities and known pedigree.

## MATERIALS AND METHODS

#### Estimation of IBD probabilities:

The approach proposed here is a linkage method using information from genotyped ancestors in the pedigree even in situations where some individuals in the pedigree are not genotyped. It does not estimate IBDp with nonancestors.

Any chromosome of today's individuals is a mosaic of ancestral chromosomes. In theory, except for mutation events, any piece of chromosome could be traced back to a corresponding piece of an ancestral chromosome in the individual's pedigree. Inferring the ancestral origin of genotyped chromosomes can be done using probability statements: each locus on a chromosome has a computable probability to descend from ancestral chromosomes taken in a subset *C* of the possible chromosomes set Ω.

##### The set of chromosomes Ω:

This set contains all the chromosomes in a given pedigree. For a pedigree containing *n* (genotyped or not) individuals, the dimension of Ω is 2 × *n*.

##### The subset C(c) of potential parental chromosomes of a given chromosome c:

For each target chromosome (TC) *c* in the pedigree, the potential parental chromosomes (PC) of *c* constitute a subset *C*(*c*) of Ω. The members of *C* can be obtained as follows (an example is provided in supporting information, File S1): starting from *c* and going up the pedigree (*i.e.,* backward in time), we obtain the two possible (*i.e.,* parental) origins of *c* in the previous generation. If this parent is genotyped, these two chromosomes are included in (and make up) *C*(*c*) and the process ends. If not, this process is repeated for each of the two parental chromosomes until:

either one finally reaches genotyped chromosomes, which are then included in

*C*(*c*), oreither one reaches an ungenotyped chromosome for which no genotyped PC exists. This ungenotyped chromosome is then included in

*C*(*c*) as a “phantom” chromosome.

Any locus on the TC is IBD with the corresponding locus on one of these chromosomes. In terms of probabilities, the consequence is that when the sum is taken over all chromosomes making up *C*(*c*), where *P*(*c*(*k*) ≡ *x*(*k*)) denotes the probability for TC *c* to be IBD with *x* at a given locus *k*.

##### Modeling inherited parental chromosomes along the chromosome:

According to the way the ancestral chromosomes set *C* has been defined, the process leading to the formation of a mosaic of pieces of the members of *C* can be seen as follows:

The beginning of the chromosome can correspond to any of the chromosomes

*c*in_{i}*C*with a probability that can be estimated (see*Initial state probabilities*).Considering any pair of positions

*k*and*k*+ 1 along the chromosome, there is a possibility to switch from*c*in position_{i}*k*(noted*c*(_{i}*k*)) to another member*c*of_{j}*C*in position*k*+ 1, and the probability of these switches can also be computed. These “transition probabilities” depend on the couple (*c*,_{i}*c*) and on the distance between the positions_{j}*k*and*k*+ 1.

A Markov process, where PC stands for states, can be used to approximate this process (Thompson 1994). In that case, the PC inherited (*i.e.,* state) at locus *k* + 1 is dependent on the PC inherited at locus k, and is assumed conditionally independent on PC inherited at previous loci:Consequently, the genetic process can be approximated through the stochastic process of a Markov chain: this reduces to first, describing the prior probability to be in any given state (PC) and second, to computing the transition probabilities between PC.

In the case of crossover interference, the Markov property is not valid. In addition, previous studies showed that in some particular pedigrees, the Markov assumption is not appropriate (*e.g.,* Mcpeek and Sun 2000; Broman 2005) but that the Markov approximation may nevertheless be successful to describe the IBD process (Thompson 1994; Leutenegger*et al.* 2003; Broman 2005). In our study, the Markov approximation might be applicable because simplified pedigrees are used: only IBDp with the closest ancestors are estimated and chromosomes are modeled independently.

##### Initial state probabilities:

The prior probability of any given PC *c _{i}* is simply , where

*G*is the number of generations between the PC

_{i}*c*and the TC

_{i}*c*(see examples in File S1).

##### State transition probabilities:

Transition between PC can potentially take place anywhere along the chromosome, but information is available only for discrete positions (*i.e.,* markers), and so we can use these markers as indicators of possible transitions between PC (as later described). The needed transition probabilities are then conditional probabilities:

As already mentioned above, this probability is the probability that the TC *c* is IBD with the PC *c _{j}* in position

*k*+ 1 when the TC is IBD with PC

*c*in position

_{i}*k*.

Two situations exist:

If

*i*=*j*, this means that no recombination occurred between these two positions for the*G*generations between this PC and the TC. The probability is:(1)_{i}

where *r* is the recombination fraction between positions *k* and *k* + 1.

If

*i*≠*j*, we can decompose the situation into three necessary events. The first two events lead to a chromosome where positions*k*and (*k*+ 1) originate from PC*i*and*j*, respectively. The last event is then similar to the previous situation:On the pedigree paths from

*c*and_{i}*c*to_{j}*c*, there must be at least one common ancestor where the two paths merge. We note this ancestor “a” (see examples in File S1). The probability for the PCand PC_{i}to proceed down to this ancestor at positions_{j}*k*and*k*+ 1, respectively, is , where*G*_{a}is the number of generations between the ancestor a and the TC.After reaching this common ancestor, the two chromosomes need to recombine in the next meiosis, leading to the needed hybrid chromosome

*m*in the next generation, with*m*(*k*) =*c*(_{i}*k*) and*m*(*k*+ 1) =*c*(_{j}*k*+ 1). The probability for this recombination is*r*. In addition, this recombinant haplotype is transmitted to the next generation with probability 0.5, resulting in a total probability of 0.5 ×*r*for this event.For the remaining (

*G*_{a}− 1) generations, the hybrid chromosome must be handed down without recombination. The probability to do that is .

As a consequence, we have the following joint probability:

The needed conditional probability *P*[*c*(*k* + 1) ≡ *c _{j}*(

*k*+ 1)|

*c*(

*k*)≡

*c*(

_{i}*k*)] can be obtained by dividing this joint probability by

*P*[

*c*(

*k*) ≡

*c*(

_{i}*k*)], which is simply . To summarize, the transition probability is then:(2)

##### Estimating the probability of one sequence of parental chromosomes:

When modeling the IBD process along the chromosome, *N _{p}* positions of TC are considered simultaneously. Since we have

*N*

_{PC}parental chromosomes, each locus must

*a priori*be IBD with one of these PC, and without further (

*i.e.,*marker) information, we have

*N*

_{PC}possible parental origin for each of the positions, which amounts to a total of

*N*= possible sequences of PC. All these sequences are not equally likely, and the probability for

_{s}*c*to correspond to any particular sequence

*s*= {

*pc*

_{1},

*pc*

_{2}, …, } can be computed using the Markovian property given above,

where *pc _{k}* corresponds to the

*k*th PC in the sequence. Among these

*N*sequences, share the same PC in position

_{s}*k*(say,

*c*(

_{i}*k*)). Consequently, IBDp between TC

*c*and PC

*c*at position

_{i}*k*can be estimated by summing the probabilities over all these sequences where

*pc*(

_{k}*k*) equals

*c*(

_{i}*k*),(3)

where *s ^{j}* is the

*j*th possible sequence, is the

*k*th PC in the

*j*th sequence, and

*I*(

*a*=

*b*) is an indicator function equals to 1 if

*a*equals

*b*and 0 otherwise. The number of sequences increases exponentially with the number of positions and the number of ancestral chromosomes to be considered, which potentially renders this computation very demanding. Fortunately, the use of hidden Markov models (HMM) can provide efficient computation of such probabilities.

##### Conditioning IBD probabilities on observed haplotypes:

Chromosomes are not observed but genotypes are. Using these genotypes, haplotypes can be obtained for TC and genotyped PC (*e.g.*, Windig and Meuwissen 2004; Druet and Georges 2010) and can be used to infer IBDp. The available genotypic information allows conditioning of the sequence probabilities from the previous section on the obtained haplotypes. *P*(*s ^{j}* |

*h*) would be used instead of

^{c}*P*(

*s*) in Equation 3, where

^{j}*h*stands for a haplotype (on chromosome

^{c}*c*):(4)

Probabilities such as *P*(*h ^{c}* |

*s*) are obvious to compute: if, for all the PC making up the sequence

^{j}*s*, the corresponding alleles are identical to alleles of the TC (

_{j}*h*), then

^{c}*P*(

*h*|

^{c}*s*) = 1; otherwise

^{j}*P*(

*h*|

^{c}*s*) = 0 (computation of

^{j}*P*(

*h*|

^{c}*s*) is based on nonmissing and phased markers only). To allow for genotyping errors and/or mutations, these probabilities can be replaced by where ε is a small value (typically, 0.001) and

^{j}*N*

_{d}is the number of differences between alleles of

*h*and alleles of the PC in the sequence

^{c}*j*.

For phantom PC, both genotypes and haplotypes are unknown. Therefore, we replace known or estimated haplotypes by the probability that phantom PC carries allele *i* at marker *m*, taken to be equal to the frequency of allele *i* at marker *m*.

##### Extending to inbreeding:

In case of inbreeding, several paths might lead from the same PC to the TC (see File S1 for examples in cases of inbreeding). To deal with this problem, we first derive IBDp along each path individually, and then sum over all paths belonging to the same PC. The set of parental chromosomes *C*(*c*) is extended to the set of distinct paths *P*(*c*) leading from the PC to the TC. In the absence of inbreeding, *C*(*c*) and *P*(*c*) are identical and PC appear in only one path, but when inbreeding is present some of the PC will be present in several paths. Transition probabilities are computed for all pairs of paths in *P*(*c*) and haplotypes from the PC at the top of the path are used in Equation 4. In addition, paths from PC to TC might share more than one common ancestor in which recombination could occur. Equation 2 considered that paths merge in a single ancestor. To take into account the possibility of several common meioses between paths *p _{i}* and

*p*, Equation 2 can be generalized to estimate transition probabilities between path

_{j}*p*at marker

_{i}*k*and path

*p*at marker

_{j}*k*+ 1,(5)

where *c*(*k*) ←*p*(*k*) indicates that *c* was inherited through path *p* at position *k*, *M _{i}* is the number of meioses in path

*p*that are not on the path

_{j}*p*.

_{i}*M*

_{s}represents the number of common meioses that result in the transmission of the same chromosome (no crossing over) while

*M*

_{o}represents the number of common meioses that result in the transmission of the complementary chromosome (crossing over). The sum of

*M*

_{s}and

*M*

_{0}is equal to the number of common meioses between the two paths and the sum of

*M*,

_{i}*M*

_{s}, and

*M*

_{o}is equal to the number of meioses in path

*p*.

_{j}In absence of inbreeding, there is a unique path from PC to TC (there is a one-to-one relationship between path and PC in that case) and Equations 1 and 2 are identical to Equation 5. Indeed, when *i* = *j* (Equation 1), *M*_{s} = *G _{i}* (

*M*=

_{i}*M*

_{o}= 0) and when

*i*≠

*j*(Equation 2),

*M*=

_{i}*G*−

_{j}*G*

_{a},

*M*

_{o}= 1, and

*M*

_{s}=

*G*

_{a}−1.

##### A hidden Markov Model to model the IBD process along a chromosome:

HMM (*e.g.*, Rabiner 1989) allow efficient computation of sequence probabilities conditionally on haplotypes (Equation 4). This result can be achieved without estimating the probability of all sequences by using the forward–backward algorithm (Baum and Egon 1967). Actually, HMM have already been used to model IBD processes along a chromosome in various applications (*e.g.,* Lander and Green 1987; Guo 1994; Mott*et al.* 2000; Leutenegger*et al.* 2003; Scheet and Stephens 2006; Thompson 2008). Rabiner (1989) described HMM as unobserved Markov chains (hidden Markov chains) and observed sequences of “emitted” symbols. The symbol observed at position *k* of the sequence depends only on the state of the hidden Markov chain at the same position. The Markov chain is fully characterized by a set of states and corresponding sets of initial probabilities (to be in any of the states) and transition probabilities (between any ordered pair of states). As mentioned earlier, our description of the transmission of pieces of chromosomes from a set of PC to a TC closely resembles this definition of Markov chain: the *N*_{PC} parental chromosomes could be used as states. Note also that PC are not directly observed but “emitted” haplotypes can be obtained by genotyping and haplotype reconstruction and that the alleles of the haplotypes depend only on the PC (hidden state) inherited at the corresponding position.

In this study, we thus propose use of a HMM to identify the origin of a chromosome segment among the genotyped PC in the pedigree (see File S2 for more details on the algorithm). The initial states and transition probabilities of our HMM are defined by and Equations 1 and 5, respectively. The emission probability at marker *k* is equal to 1.00 − ε (ε) if the allele in TC is equal (different) to the allele in the PC for that marker, where ε is the probability of error (*e.g.*, genotyping errors, mutations). When the PC is a “phantom” PC, the emission probabilities are equal to the frequency of the observed marker allele.

The model implemented in this study relies on known haplotypes for TC and genotyped PC eventually obtained by other programs (*e.g.*, Druet and Georges 2010). We discuss this assumption below. Paternal and maternal chromosomes are modeled independently.

#### Simulation study:

##### Data sets:

The method was tested with simulations based on haplotypes from real data sets from dairy cattle, maize, and *Arabidopsis thaliana*. For each species, genotypes from chromosome 1 were used.

The dairy cattle set consisted of 4732 animals genotyped on a custom-made 60K Illumina panel described in Charlier*et al.* (2008). Data were provided by CRV (The Netherlands) and are available upon request. The 2691 BTA1 SNP spanned approximately 160 Mb. The pedigree file contained 13,163 individuals. A subset of 3648 target chromosomes was selected to study the estimation of IBDp. These TC correspond to individuals without any genotyped descendent and with male ancestors carrying PC genotyped over three to eight generations (and no genotyped female ancestor). One generation of genotyped male ancestor corresponds to a genotyped sire and an ungenotyped dam. Two generations of genotyped male ancestors carrying PC corresponds to genotyped sire and maternal grandsire, while the dam and maternal grandam are ungenotyped. For each additional generation of genotyped male ancestors carrying PC, the sire of the (ungenotyped) mate of the oldest genotyped sire is added. For instance, in the first example in File S1, male ancestors carrying PC (individuals 2, 6, 9, and 10) are genotyped for four generations, whereas status of individual 4 is irrelevant because individual 2 is genotyped. A subset of 1000 individuals represented the PC: the 516 individuals with at least one genotyped descendent and 484 individuals chosen at random are to be used as phantom PC in the simulation. Haplotypes from all individuals were obtained using Beagle (Browning and Browning 2007) and DAGPHASE (Druet and Georges 2010).

The genotypes from 19 “founder” accessions of the MAGIC in *A. thaliana* (Kover*et al.* 2009) were downloaded from http://gscan.well.ox.ac.uk/arabidopsis. We simulated the MAGIC lines as described in Scarcelli*et al.* (2007) by intermating 19 founders lines for five generations producing 342 F5 outcrossed families. More precisely, in the first generation, a complete diallele cross (the 19 × 18 = 342 possible F_{1} crosses) was simulated. The generations F_{1}–F_{4} were then randomly intercrossed. In each generation, two plants from each family were selected to be paternal and maternal parents. The 342 final lines were obtained by selfing F_{5} families for six generations. Genotypes from 275 markers on the first chromosome (30 Mb) were used.

Data from the NAM design in maize (Yu*et al.* 2008; Mcmullen*et al.* 2009) were used to simulate the last design. In the NAM design, 25 founders are crossed with the B73 line to produce recombinant inbred lines (RILs). Since deducing the origin is relatively easy in that case (based on the SNPs, which are polymorphic across the two founder lines), we changed the design to a complete diallele cross among the 26 founders and further crossed the F1 for three additional generations (as described for the MAGIC data set). Genotypes for the 175 SNP on chromosome 1 (∼200 Mb) were downloaded from http://www.panzea.org/lit/data_sets.html#NAM_map.

For the three populations, data were simulated by using real pedigree for dairy cattle and described pedigree for plants and by transmitting the haplotypes from genotyped ancestors to offspring using Mendelian segregation rules and recombination probabilities (assuming 1 cM = 1 Mb). In dairy cattle data, phantom PC haplotypes were selected randomly from the 484 individuals mentioned above. Furthermore, with this data set, a subset of 164 markers (the SNP with the highest MAF was selected every megabase) was used to test the estimation of IBDp at a lower marker density.

In dairy cattle, haplotypes of really genotyped animals were assumed known while in plants, all haplotypes of founder and last-generation individuals were assumed known. Pedigrees were also considered as known. To test the impact of missing genotypes on the estimation of IBDp, 1 or 5% of marker alleles were randomly erased in the dairy cattle data set. Similarly, we randomly generated 0.01, 0.1, and 1% genotyping errors in the dairy cattle data set to study the effect of genotyping errors (the error rate ε used by the model was equal to 0.001 in all cases) on estimation of IBDp. Finally, estimated haplotypes [using Beagle (Browning and Browning 2007) and DAGPHASE (Druet and Georges 2010)] were also used in the dairy cattle data set to compare estimated IBDp obtained with known and inferred haplotypes. In this situation where haplotypes are inferred, when the paternal or maternal origin of a TIPC was not known because the corresponding individual had no genotyped parent, we identified the TIPC as the chromosome that transmitted the marker alleles to the TC. All studies were repeated 100 times.

##### Estimation of accuracy of IBDp:

The quality of IBDp estimation can be described by the distribution, over all TC and markers, of the estimated IBDp between the TC and the truly inherited parental chromosome (TIPC). This distribution indicates whether the model correctly predicts the origin of the chromosome. Ideally, the IBDp should be one for all positions.

For each haplotype and each marker position, the model estimates N_{PC} IBDp (one for each PC in the set). IBDp with each PC were assigned to percentile classes (100 classes: 0–0.01, 0.01–0.02, … , 0.99–1.00). When the corresponding PC was (not) the TIPC, one (in)correct estimation was added in the class. Then, the percentage of correct estimations was computed within each class. It should be noted that the expected proportion of TIPC in percentile *q* is *q*, meaning that a plot of the observed proportion of TIPC *vs.* the estimated IBDp should be a straight diagonal line (see Figure 3). For the dairy cattle data, results were given separately for genotyped TIPC and phantom TIPC.

#### Application: marker imputation with low density chip in dairy cattle:.

To illustrate a potential application of the method, we mimicked imputation of markers from a dense SNP panel with genotypes from a low-density SNP chip. Using the dairy cattle data described above, we conserved all the genotypes from the 1000 animals representing the PC in the simulation study, while for the remaining 3732 individuals, we conserved only 164 markers per animal (the same as in the simulation study).

Haplotypes of reference and target individuals were reconstructed using Beagle (Browning and Browning 2007) and DAGPHASE (Druet and Georges 2010) as described in Zhang and Druet (2010).

The method described in this study was then used to estimate IBDp between TC and PC. Missing alleles were then predicted by combining IBDp and alleles observed in PC aswhere is the probability that haplotype *x* at marker *k* carries allele *m* (1 if yes, 0 if no, when haplotype is known) and is the estimated IBDp between PC *x* and TC *c* at marker *k*. For ungenotyped ancestors (phantom PC), the frequency of the marker allele was used instead of the observed allele. Allelic probabilities of both chromosomes from an individual were then combined to obtain genotype probabilities. These probabilities were used to compute estimated number of allele “1” per genotype (ranging from 0 to 2). The number of errors per genotype was equal to the absolute value of the difference between the real and estimated numbers of allele 1 (the proportion is obtained by dividing the number of errors by the number of imputed alleles). The efficiency of prediction was then obtained by averaging this statistic over all imputed genotypes and markers. This value indicates whether the model correctly predicts marker alleles.

## RESULTS

#### Estimation of IBDp in the three simulated designs:

Distributions of IBDp for TIPC with the four simulated data sets are presented in Figure 1. The curves indicate that a large majority of the estimated IBDp are above 0.99 and very few at low values for all simulated designs, stressing the ability of the method to identify with high probability the TIPC. The mean IBDp for TIPC was equal to 0.9798 and 0.9415 in the *A. thaliana* and maize designs, respectively. For the *A. thaliana* design, IBDp of TIPC was higher than 0.99 and 0.90 for 91.3 and 95.9% of the situations, respectively. For the maize design, with lower density but closer founders (in terms of generations), these values dropped respectively to 74.0 and 88.1%. For the dairy cattle design, mean IBDp was equal to 0.9800 for genotyped TIPC and only to 0.5511 for phantom TIPC resulting in an overall mean IBDp of 0.9567 for TIPC (5.4% of the genome of TC was inherited from phantom PC). With 1% (5%) missing markers, IBDp were almost identical to the ones obtained with the full markers set: 0.9565 (0.9560) for overall mean IBDp for TIPC and to 0.9799 (0.9796) when considering only genotyped TIPC. With genotypes errors, overall mean IBDp for TIPC and IBDp for genotyped TIPC slightly decreased to 0.9551 and 0.9783 (0.01% errors), 0.9442 and 0.9667 (0.1% errors), and 0.9433 and 0.9657 (1% errors), respectively. Values obtained with estimated haplotypes instead of known haplotypes were in between results obtained in missing genotypes and genotyping errors scenarios: 0.9531 for overall mean IBDp for TIPC and 0.9762 IBDp for genotyped TIPC. Returning to the situation with no missing markers and no genotyping errors, but decreasing the number of markers from 2691 to 164, the average IBDp dropped to 0.9456 (genotyped TIPC), 0.6294 (phantom TIPC), and 0.9418 (all TIPC) (Figure 1). Of genotyped TIPC, 88.1% (94.8%) and 78.6% (90.1%) had IBDp >0.99 (0.90) with the high (low)-density maps. For phantom TIPC, estimated IBDp were low more frequently, clearly indicating a lack of power to detect phantom TIPC. For these TIPC, emission probabilities are equal to allele frequencies while for genotyped TIPC, emission probabilities are equal to 0.001 or 0.999 according to the allele actually observed on the PC. By chance, emission probability associated to a genotyped PC can be higher than the product of allele frequencies, leading to an incorrect assignment. In addition, PC closer (in terms of the number of generations) to TC will have higher transition probabilities than phantom PC, often separated from TC by more generations.

When more genotyped PC are present in the set of PC, the probability of obtaining by chance one PC with emission probability higher than the product of allele frequencies increases. This is illustrated in Table 1 where estimation of IBDp is summarized for individuals with an increasing number of genotyped PC. Indeed, with more genotyped PC, mean IBDp associated to phantom TIPC decreases. In addition, the table indicates that proportion from the genome inherited from genotyped PC increases as expected with a higher number of genotyped PC but that mean IBDp estimated for genotyped TIPC slightly decreases. This is explained by the fact that additional genotyped TIPC are separated from the TC by more generations, resulting in smaller transmitted segments (the expected size of the segment is 1/(1 + *G*) M, where *G* is the number of generations between the TC and PC) with more transitions and fewer markers per segment to identify the origin. However, since the proportion of genotyped TIPC increases, the overall mean IBDp increases too (with the high-density map). With the lower-density map, the overall mean IBDp of TIPC remains relatively constant because IBDp for phantom TIPC have a stronger decrease. The most favorable configuration for IBDp estimation occurs when a sire or a dam is genotyped because 100% of the two TIPC are genotyped and transmitted fragments are large. In these cases, the mean IBDp of TIPC is equal to 0.9949 and 0.9851 with the high- and low-density maps, respectively. For one animal, the mean IBDp of TIPC is the mean of the IBDp estimated for the paternal and maternal chromosomes, resulting in 0.9443 (0.9395) to 0.9656 (0.9432) mean IBDp when male ancestors carrying PC are genotyped from three to eight generations (see material and methods for more details) for the high (low)-density map (see Table 1).

Results of the bovine data set clearly indicate that estimation of IBDp is improved when transmitted chromosome segments are larger (or with less crossing over along the chromosome). Figure 2 shows that mean estimated IBDp are low around crossing over and increases with distance from crossing over. When the exact position of a crossing over cannot be identified, the method estimates intermediate IBDp for the two concerned PC. With increased marker density, the influence of the crossing over is reduced because more markers are available to precisely localize crossing over. These events greatly contribute to the proportion of intermediate (or low) estimated IBDp for TIPC.

Figure 3 shows that percentage of TIPC within each class of estimated IBDp linearly increases from 0 to 1 with IBDp with both plant designs. For each class, the proportion of PC corresponding to the TIPC is equal to the estimated IBDp, indicating that these probabilities are well calibrated. For the dairy cattle data, this is no longer true and we observe some high IBDp associated to noninherited PC. However, when considering only portions of chromosomes inherited from genotyped PC, the concordance between estimated IBDp and proportion of TIPC is correct while for regions inherited from phantom PC only, estimated IBDp are poorly calibrated.

#### Computational requirements:

Computation times and memory requirements were measured on a computer with Intel Xeon “Harpertown” L5420 at 2.50 GHz. The requirements are a function of number of TC (1200 for the maize design, 684 for the *A. thaliana* design, and 1824 for the dairy cattle design), the number of different paths to PC per TC (8 for the maize design set, 1024 for the *A. thaliana* design, and maximum 15 for the dairy cattle design), and the number of markers (175 for the maize design, 275 for the *A. thaliana* design, and 2691 for the dairy cattle design). Computation times to estimate IBDp at all marker positions and for all TC were equal to 41 sec, 7 min 53 sec, and 7 hr 13 min 22 sec, while memory requirements were equal to 129 Mb, 398 Mb, and 140 Mb for the maize, dairy cattle, and *A. thaliana* designs, respectively.

#### Application: marker imputation with low density chip in dairy cattle:

Table S1 summarizes allelic imputation error rates obtained in dairy cattle for TC genotyped with the low-density map. The method was used to predict markers inherited from PC genotyped on the higher-density marker panel. As for the estimation of IBDp probabilities, the efficiency of the method increased when higher proportions of PC were genotyped. When the expected proportion of the genome inherited from genotyped ancestors increased from <0.25 to 1, the error rate decreases from 0.2647 to 0.0056. The majority (73.4%) of the individuals with TC in our data sets had more than 87.5% of their genome inherited from genotyped ancestors (corresponding to three genotyped male ancestors carrying PC or more) and 13.6% had both their parents genotyped. The error rates were below 0.05 for animals with at least three generations of genotyped male ancestors carrying PC and the overall mean imputation error rate per animal was equal to 0.0375: the proportion of correctly imputed alleles (0.9625) was above the mean estimated IBDp for TIPC (0.9418) despite the fact that haplotypes were estimated and that genotyping errors might be present in the data set. Some genotypes might be correctly imputed by chance: although the correct PC was not identified, the incorrect PC with high estimated IBDp carried the same marker allele. Therefore, when the method incorrectly identifies the inherited parental chromosome because another PC is very similar (identity-by-state for a stretch of markers), the error has limited consequence on marker imputation or even QTL detection and genomic selection because the incorrect PC most likely carries the same alleles as the TIPC. Similar results were found by Zhang and Druet (2010) using the same method but for the whole genome.

## DISCUSSION

#### Accuracy of IBD probabilities estimation:

We herein present a method aimed at estimating IBDp between a TC and a set of PC. Since the method relies on concordance of haplotypes from TC and PC, it will be more efficient when more markers per inherited segment are available. The size of these segments is proportional to the inverse of the number of generations from the PC to the TC. Therefore, the method will be more efficient for closer PC and for higher marker densities. On a simulated data set based on real founder haplotypes and pedigree structures, the method assigned high IBDp to the TIPC in a large majority of the cases and low IBDp to the other PC, even for marker densities as low as 1 marker/Mb and for founders separated from the TC by up to 10 generations. However, for some regions, lower IBDp were assigned to the TIPC. Such regions, where both TIPC and incorrect PC have intermediate IBDp, correspond to portions for which the origin is more difficult to determine: close to a crossing over (as illustrated by Figure 2) or for which several ancestors have very similar haplotypes. With lower-density maps, these problems occur more often because it is more likely that two PC have similar haplotypes and because less informative markers are available to precisely determine where crossing over take place. Therefore, the estimated IBDp of TIPC decreased with lower-density marker maps. Figure 3 indicates that estimated IBDp are well calibrated and are equal to the probability that a PC is a TIPC. Therefore, it is possible to identify regions in which the TIPC cannot be determined precisely.

Results also showed that the estimation of IBDp was less efficient for regions inherited from a phantom PC. The method should therefore ideally be applied to designs with limited number of phantom PC, especially when these are close ancestors of the TC. Phantom PC are described on the basis of allele frequencies. A model describing phantom PC with haplotypes (such as HMM of Scheet and Stephens 2006 or Browning and Browning 2007) might turn out to be more efficient although more computationally intensive.

The impact of PC incorrect assignments might be limited for marker allele prediction. Indeed, in regions where the method cannot precisely determine the haplotype origin, the method proposes several origins that are very similar to each other and to the TC. Due to this similarity, the method is not able to determine the correct origin but thanks to this similarity, the correct marker allele is predicted anyway. Therefore, the method proved to be efficient for prediction of marker alleles, even with lower map densities.

#### Consequences of approximations of the model:

In this study, we assumed that haplotypes are known. In inbred species (such as some plants or mice), haplotypes are indeed known. Furthermore, Druet and Georges (2010) showed that with the current marker densities, haplotype reconstruction can be achieved very efficiently in livestock species, especially for important parents with many offspring, like tested sires in artificial insemination contexts. For genotyped (eventually at lower density) target individuals, haplotypes can be partially reconstructed on the basis of homozygous markers and familial information such as a genotyped parent.

When using inferred haplotypes with the dairy cattle design, IBDp of TIPC remained high, showing that the method was still efficient when haplotyping can be performed as accurately as in dairy cattle. In cases where haplotyping reconstruction generates many errors, it is likely that estimation of IBDp will be less accurate, particularly when errors occur in TC. Since the efficiency of the method was less affected by missing genotypes than by genotyping errors, we advise reconstructing portions of haplotypes that can be inferred very accurately (for instance, thanks to homozygous markers or mendelian segregation rules) and consider other markers as missing.

Results showed that the method remained efficient with missing genotypes and genotyping errors under normal conditions. Indeed, we tested call rates as low as 95% whereas, for instance, with genotyping arrays used in cattle, call rates are above 99% on average. Genotyping errors rates were set as high as 1% while use of genotyping arrays generally results in <0.1% error rates.

The effect of missing genotypes is to reduce marker density since we do not have information for some markers. Reduction of marker density by a few percentage has a small impact on estimation of IBDp.

Results, on simulated and real data, proved that the Markov approximation of the IBD process along the chromosome between a TC and its PC was efficient in our designs.

#### Comparison to other methods:

Since our method models the entire chromosome and allows direct estimation of IBDp between haplotypes separated by more than one generation, it should achieve better results than traditional methods based on linkage equilibrium, which rely mostly on genotypes of close relatives and flanking markers such as those of Wang*et al.* (1995) or Pong-Wong*et al.* (2001). However, our method is less suited to very sparse marker maps. For moderate- and high-marker densities our method has reasonable computational costs, which might not be the case for multipoint linkage methods using MCMC techniques (Heath 1997).

Other multipoint linkage methods such as those proposed by Lander and Green (1987) or Abecasis*et al.* (2002) consider all the pedigree relationships. These methods therefore make better use of all the available information. However, Merlin (Abecasis*et al.* 2002) was not computationally efficient on the data set used in this study (Merlin could not cope with our large pedigree). In comparison, our method uses only information from ancestors, ignores other relatives, and models both chromosomes separately. Consequently, it works in subpedigrees including only the TC and their ancestors, which makes it computationally efficient. Our results showed that for designs in which haplotypes of important parents can be reconstructed precisely (on the basis of LD or information from progeny), the IBDp between TC and PC can be efficiently computed using our method and can be useful for transferring information (genotype, QTL, etc.) from important founders to other individuals in the population.

The IBDp estimated in our method are distinct from IBDp estimated through methods based on LD, such as the method developed by Meuwissen and Goddard (2001). Indeed, these methods use small haplotypes of dense markers to identify IBD relationships due to distant common ancestors. Our method identifies IBD relationships within a pedigree and only a few generations separate the haplotype from its ancestors. Since, for closer relationships, animals share longer chromosome segments, our method can work with lower-density maps than methods based on LD can.

#### Applications:

Our method is particularly suited to describing the transmission of information from recent ancestors to current individuals, even with low-density maps (*e.g.*, 1 SNP/Mb). Burdick*et al.* (2006) already used this technique for marker imputation in human pedigrees with Merlin (Abecasis*et al.* 2002). Such applications are becoming more common in animal and plant breeding or in experimental designs. For instance, MAGIC designs (*e.g.*, Threadgill*et al.* 2002; Kover*et al.* 2009) rely on genotyping of founders and last generations and are now developed in several organisms for the genetic dissection of complex traits. In addition, recent strategies aim to genotype founders with high-density marker panels and use this genomic information to infer genotypic information on other individuals genotyped on lower-density panels. This strategy has, for example, been implemented in the NAM in maize (Yu*et al.* 2008; Mcmullen*et al.* 2009). In livestock species, genomic selection (Meuwissen*et al.* 2001) is currently applied by genotyping reference animals on medium-density SNP chips (∼60K SNPs in dairy cattle). Effects of the SNPs are then estimated and can be used to predict the genetic value of other individuals, which must be genotyped for the same markers. Habier*et al.* (2009) proposed genotyping some animals with lower-density marker panels and using statistical methods to transfer information from reference individuals to these animals. In dairy cattle, most breeding companies are studying the possibility of implementing such a strategy to reduce genotyping costs and extend genomic selection to a larger fraction of the population. With the advent of higher-marker density panels (for example, a bovine SNP chip with more than 750,000 SNPs has been released in 2010) and of high-throughput sequencing data, such strategies will become even more important. Our method would be fitted to such new paradigms.

The program CHROMIBD implementing the current method is available from the authors upon request.

## Conclusions

We present an efficient method for estimation of IBDp between a target chromosome and its genotyped ancestors. The method assigned high IBDp to the truly inherited parental chromosome and was still efficient when marker densities were ∼1 Mb/SNP or when founders were separated from target chromosomes by up to 10 generations. The proposed method proved well adapted for situations that will become more common in the near future, with founders genotyped on high-density maps and other individuals genotyped for lower-density marker arrays, as proposed in livestock species and in experimental designs such as the nested association mapping design in maize.

## Acknowledgments

Tom Druet is Research Associate from the Fonds de la Recherche Scientifique (FNRS). The authors thank CRV (http://www.crv4all.com), the Wellcome-Trust Centre for Human Genetics, and PANZEA for access to their data. We acknowledge Laurence Moreau and Mathieu Gautier for their comments and suggestions on this work. This work was funded by grants of the Service Public de Wallonie and from the Communauté Française de Belgique. We acknowledge University of Liège (SEGI and GIGA bioinformatics platform) for the use of NIC3 and GIGA-grid supercomputers.

## Footnotes

Communicating editor: H. Zhao

- Received February 13, 2011.
- Accepted March 3, 2011.

- Copyright © 2011 by the Genetics Society of America