## Abstract

Maximum likelihood methods for the estimation of linkage disequilibrium between biallelic DNA-markers in half-sib families (half-sib method) are developed for single and multifamily situations. Monte Carlo computer simulations were carried out for a variety of scenarios regarding sire genotypes, linkage disequilibrium, recombination fraction, family size, and number of families. A double heterozygote sire was simulated with recombination fraction of 0.00, linkage disequilibrium among dams of δ = 0.10, and alleles at both markers segregating at intermediate frequencies for a family size of 500. The average estimates of δ were 0.17, 0.25, and 0.10 for Excoffier and Slatkin (1995), maternal informative haplotypes, and the half-sib method, respectively. A multifamily EM algorithm was tested at intermediate frequencies by computer simulation. The range of the absolute difference between estimated and simulated δ was between 0.000 and 0.008. A cattle half-sib family was genotyped with the Illumina 50K BeadChip. There were 314,730 SNP pairs for which the sire was a homo-heterozygote with average estimates of *r*^{2} of 0.115, 0.067, and 0.111 for half-sib, Excoffier and Slatkin (1995), and maternal informative haplotypes methods, respectively. There were 208,872 SNP pairs for which the sire was double heterozygote with average estimates of *r*^{2} across the genome of 0.100, 0.267, and 0.925 for half-sib, Excoffier and Slatkin (1995), and maternal informative haplotypes methods, respectively. Genome analyses for all possible sire genotypes with 829,042 tests showed that ignoring half-sib family structure leads to upward biased estimates of linkage disequilibrium. Published inferences on population structure and evolution of cattle should be revisited after accommodating existing half-sib family structure in the estimation of linkage disequilibrium.

TRADITIONAL methods for gene mapping are based on linkage, which requires a family structure because loci are mapped by tracing inheritance of marker alleles in progeny from at least one ancestor. The DNA markers of choice were microsatellites because they were abundant and very informative. Linkage maps of microsatellites were developed for farm animal species with a half-sib structure such as cattle (Da and Lewin, 1995; Ma *et al.* 1996; Kappes *et al.* 1997; Barendse *et al.* 1997; Våge *et al.* 2000). In recent years, a revolution has been initiated in human genetics with the large-scale DNA sequencing of the HAP MAP project (2007), which allowed the discovery of vast amounts of single nucleotide polymorphism (SNP). SNP sequences were used in arrays allowing interrogation of the human genome from thousands to over a million SNPs. The biggest interest in humans is the application of this technology for identification of variants that are associated with genetic diseases in the so-called case-control studies.

The development of SNP arrays in human genetics was followed by animal geneticists. There are commercially available arrays for over 50 or 60 thousands SNPs for the cow, sheep, and swine. The statistical treatment of SNP arrays in animal populations is carried out without consideration for the breeding structure currently present in farm animals but lacking in experiments for case-control studies in human populations. Contrary to human populations, animals at the farm are highly related because of the use of artificial insemination (AI) and intensive breeding. For example, dairy bulls with high estimated breeding values might have over a million daughters (http://www.crv4all.com/eng/halloffame/Sunny_Boy_HOFame.pdf). Analysis of linkage disequilibrium (LD) between pairs of SNPs in cattle populations has been carried out either using the expectation maximization (EM) algorithm for unrelated individuals (*e.g.*, Sargolzaei *et al.* 2008) or using the most likely of phased haplotypes (*e.g.*, McKay *et al.* 2007; Qanbari *et al.* 2010). The first method ignores that the contribution of haplotypes from sires to progeny exceeds its true counts in the population because each offspring receives one haplotype from their sire. The second method ignores that marker informativity might cause a systematic increase or decrease of informative sire haplotype counts (and consequently informative haplotype counts from dams) depending on the genetic distance between markers. Consequently, bias in the estimation of LD using half-sib data might occur.

The objective of this article is to develop maximum likelihood methods for the estimation of linkage disequilibrium between codominant DNA markers in half-sib families. It is shown that severe biased estimation may occur after ignoring half-sib relationships. The methods are tested via Monte Carlo computer simulation. Comparison of alternative methods of estimation of linkage disequilibrium is carried out after genotyping a half-sib family with 36 calves with the Illumina 50K BeadChip.

## Theory and Methods

AI is in widespread use in cattle with the most common situation being a sire having a single progeny from a number of dams. Three situations are possible when estimating second-order linkage disequilibrium (disequilibrium considering two loci) between two DNA markers: (a) the sire is a homozygote at the two loci, (b) the sire is a homozygote at one locus and a heterozygote at the other, and (c) the sire is a heterozygote at the two loci. For the following derivations, assumptions are: (1) recombination fraction is known without error, and (2) the linkage phase (combination of alleles at two loci on the two homologous chromosomes in diploid individuals) in the sire is known. The impact of departures from these assumptions is addressed in the *Discussion*.

### Double homozygote sire

Let the sire have genotype *TTMM* at two SNPs, *T/t*, and *M/m*. Offspring might have genotypes *TTMM*, *TTMm*, *TtMM*, *TtMm* indicating that haplotypes *TM*, *Tm*, *tM*, and *tm* were inherited from dams, respectively. Therefore, the haplotypes in half-sibs are fully informative and linkage disequilibrium can be estimated directly from haplotype counts. Thus, for alleles *T* and *M* at two loci, the disequilibrium can be estimated by substituting haplotype and allelic frequencies into *k*th allele, *k*th and *t*th alleles at the two loci, respectively. In addition to allele frequencies, only one parameter for the linkage disequilibrium, δ, needs to be estimated since *i*th family is derived in *Appendix A*,*L _{i}(*δ

*|nG)*is the maximum likelihood function of the disequilibrium parameter, δ, conditional to the haplotype counts,

*nG*.

The value of the second derivative with respect to the disequilibrium parameter is

### Sire is homozygote at one locus and heterozygote at the other

A full and a reduced model are developed in this section. A full model estimates all unknowns (linkage disequilibrium, and allele frequencies for the marker for which the sire is heterozygote) simultaneously. The reduced model estimate only linkage disequilibrium assuming that allele frequencies are known without error (or estimated in a previous step).

#### Full model for estimating LD in a homo-heterozygote sire:

Let the sire have genotype *TTMm* at two SNPs, *T/t*, and *M/m*. The likelihood equation for the *i*th family is*n _{j,i}* are the genotype counts (

*nG*) from offspring from the

*i*th sire family (

*j*=

*TTMM*,

*TTMm*,

*TTmm*,

*TtMM*,

*TtMm*, and

*Ttmm*), and

*j*th genotype among progeny. These probabilities can be obtained after adding the corresponding frequencies for all possible matings (Table 1):

*N*is the size of the

_{i}*i*th half-sib family. Equations 2 can be solved iteratively after giving a starting value to the haplotype frequencies and by estimating in each iteration

#### Reduced model for estimating LD in a homo-heterozygote sire family:

In a reduced model, allele frequencies are not estimated simultaneously with haplotype frequencies but are assumed to be known. The estimate of linkage disequilibrium is*Appendix B*.

The disequilibrium estimated in the reduced model gives slightly different estimates than the disequilibrium estimated using a full model but has the advantage of faster computation when a large number of SNPs are tested. The approximated sampling variance of the estimates of the disequilibrium parameter for the *i*th family is*Appendix A*.

### Sire is heterozygote at two SNPs

Equations for a full and a reduced model follow. A full model estimates allele and haplotype frequencies simultaneously whereas a reduced model works first estimating allele frequencies and then haplotype frequencies. The full model has better statistical properties but the reduced model has faster computation and, therefore, is practical for large-scale testing of disequilibria among SNPs.

#### Full model for estimating LD in a double-heterozygote sire family:

Let the sire have genotype *TtMm* at two SNPs, *T/t*, and *M/m* and linkage phase (*TM/tm*). As before, *n _{j,i}* are the genotype counts from offspring from the

*i*th sire family (

*j*=

*TTMM*,

*TTMm*,

*TTmm*,

*TtMM*,

*TtMm*,

*Ttmm*,

*ttMM*,

*ttMm*, and

*ttmm*). The recombination fraction is

*c*, which is assumed to be known without error. The likelihood equation for data of the

*i*th half-sib family is

Likelihood Equation 3 can be solved by applying the EM algorithm,*N _{i}* is the size of the

*i*th half-sib family. Using initial values of the haplotype frequencies and iterating over Equation 4 will converge to ML estimates of haplotype frequencies. Linkage disequilibrium is estimated by

If the linkage phase of the sire is *Tm/tM* then the EM equations are

#### Reduced model for estimating LD in a double-heterozygote sire:

A reduced model can be used after assuming that allele frequencies at the two DNA markers are known without error. It makes easier and faster estimation of linkage disequilibrium and its sampling variance. It can be solved by making use of the EM algorithm as described in Equation 4 but using as input parameters estimates of allele frequencies of *M* and *T* (as given by Gomez-Raya 2001):*c* = 0 for the reduced model is a positive root between 0 and 1 of the quadratic: *Appendix B*.

As shown in *Appendix A*, the reduced model provides a simpler approximated sampling variance of the estimates of the disequilibrium parameter for the *i*th family by

### Estimation of LD Across multiple half-sib families

In most instances, genotype information is available for multiple half-sib families (*e.g.*, data from a granddaughter design project). The likelihood equation to estimate LD across half-sib families is*L(δ*, *f _{T}*,

*f*is the likelihood for the

_{M}|nG)*i*th half-sib family conditional to genotype marker information (

*nG*) and

*nf*is the number of families. Note that depending on the sire genotype, allele frequencies for

*T*and

*M*(double homozygote) or

*M*(homo-heterozygote) do not need to be estimated. The EM algorithm can be applied to multiple families by iterating on the four haplotype frequencies:

*TM*if the sire is double homozygote, homo-heterozygote, or double heterozygote, respectively. The frequencies for the other haplotypes are as found in Equations 2 and 4 for homo-heterozygote and double heterozygote sires, respectively. Equation 5 can be solved iteratively after giving a starting value to the haplotype frequencies and by estimating in each iteration

The estimation of the sampling variance for linkage disequilibrium in multiple half-sib families can be carried out by:*Appendix A*.

### Hypothesis testing of LD in multiple half-sib families

Testing if linkage disequilibrium is different from 0 can be carried out by a likelihood-ratio test. For the *i*th half-sib family the likelihood-ratio test is*i*th family under the null hypothesis (δ = 0) and under the alternative hypothesis with

A likelihood-ratio test across families is^{2} with 1 d.f. Here *δ* is estimated across all families by the EM algorithm (Equation 5).

### Bias in estimating LD in half-sibs after ignoring the family structure

In this section, approximate bias for estimating LD in half-sib families using the method of Excoffier and Slatkin (1995) for unrelated individuals and maternal informative haplotypes is derived algebraically. Only sires that are homo-heterozygotes and double heterozygotes might produce progeny in which haplotypes cannot be fully inferred from the genotypes.

#### Sire homo-heterozygote: Method of Excoffier and Slatkin (1995) for unrelated individuals:

Assuming genotype TTMm in the sire, the expected frequency of haplotype TM among half-sib progeny can be approximated by*N _{i}* comes from the contribution of the

*TM*haplotype from the sire and

*N*from the contributions of the dams. The total number of haplotypes in the offspring is

_{i}f_{TM}*2N*. The approximated expected frequencies of alleles

_{i}*T*and

*M*are computed following the same rules:

Consequently, the bias after using this method is approximated by

#### Sire homo-heterozygote: Estimation of LD using informative maternal haplotypes in half-sib families:

Half-sib progeny from heterozygote sires might not be informative. For example, haplotype *TM* inherited from dams will be informative only in progeny with genotypes *TTMM*.

Therefore, the expected frequency of haplotype *TM* among progeny will be estimated by

#### Sire double heterozygote: Method of Excoffier and Slatkin (1995) for unrelated individuals:

Assuming linkage phase *TM/tm* in the sire, the expected frequency of haplotype *TM* among half-sib progeny can be approximated by*N _{i}(1 − c)* and

*N*are the sire and dams contributions of haplotype

_{i}f_{TM}*TM*among the offspring.

Similarly, the expected frequencies of alleles *T* and *M* are approximated by

#### Sire double heterozygote: Estimation of LD using informative maternal haplotypes in half-sib families:

The only informative haplotypes that can be traced up to their mothers are from progeny with genotypes *TTMM*, *TTmm*, *ttMM*, and *ttmm*. It is because markers are biallelic and only homozygote progeny can be used to trace inheritance when the sire is a heterozygote. If allele frequencies in the dam population are known, then determining the haplotype with the highest probability is feasible. Nevertheless, for intermediate allele frequencies the probability of inheriting either allele is 0.5. For the calculations below, only informative progeny is used.

Assuming linkage phase *TM/tm*, the expected frequency of informative *TM* haplotypes among progeny is*T* and *M* are*c* = 0.5, the above expression reduces to *D _{TM}* and the method of informative maternal haplotypes is unbiased.

For 0 < *c* < 0.5, the bias for using only maternal inherited haplotypes is approximated by

### Monte Carlo computer simulation

A Monte Carlo computer simulation was carried out to validate methods for estimating LD proposed in this article as well as to compute power. Three scenarios were simulated corresponding to the three possible situations regarding the genotype of the sire: double homozygote, homo-heterozygote, and double heterozygote. In addition, a multifamily situation was also simulated.

#### Sire double homozygote:

A random generator from the uniform distribution was used to assign progeny with the haplotypes *TM*, *Tm*, *tM*, and *tm* according to their probability (frequency): *f _{TM}*, then the offspring inherited haplotype

*TM*from his dam. If the drawing of the uniform distribution was between

*f*and

_{TM}*f*

_{TM}_{+}

*f*then the offspring inherited haplotype

_{Tm}*Tm*from his dams. Assigning offspring to other haplotypes was done following the same rule.

#### Sire homo-heterozygote:

A random generator from the uniform distribution was used to assign progeny with the genotypes *TTMM*, *TTMm*, *TTmm*, *TtMM*, *TtMm*, and *Ttmm* according to their probability (frequency): *TTMM*. If the drawing of the uniform distribution was between _{+} *TTMm*. Assigning other genotypes to offspring was done following the same rule.

#### Sire double heterozygote:

A random generator from the uniform distribution was used to assign progeny with the genotypes *TTMM*, *TTMm*, *TTmm*, *TtMM*, *TtMm*, *Ttmm*, *ttMM*, *ttMm*, and *ttmm* according to their probability (frequency): *TTMM*. If the drawing of the uniform distribution was between _{+}*TTMm*. Assigning other genotypes to offspring was performed following the same rule.

Subroutines in Fortran 90 were written to estimate linkage disequilibrium with the half-sib methods (HS) described in this article as well as the method of Excoffier and Slatkin (1995) (ES) for unrelated individuals, and by making use of maternal informative haplotypes (MIH). Family sizes of 36 and 500 were used to test the methods in small and large families. Empirical power was computed by sorting within each simulation set according to the likelihood-ratio estimate and finding the percentage of replicates that gave a value higher than the value of the χ^{2} with 1 d.f. at a significance level of 0.01.

#### Multifamily estimation of linkage disequilibrium:

A total of six families with sizes 94, 77, 106, 81, 79, and 100 half-sib progeny resembling the sire Norwegian cattle population were simulated (after pooling selected and culled bulls in Table 1 of Gomez-Raya *et al.* 2002). The allele frequencies were intermediate, recombination fraction was 0, 0. 25, or 0.50, and linkage disequilibrium ranged from 0 to 0.25. The sires were simulated as if they were coming from a population with the same linkage disequilibrium and allele frequencies as used to generate the half-sib progeny. To do so, the two haplotypes at each sire were generated following the same principles as above with probabilities according to the simulated frequencies: _{joint} under the null hypothesis (simulated δ = 0) in the situation for *c* = 0.

### Genome analyses of LD in a beef cattle half-sib family

A half-sib family consisting of 36 calves from commercial beef cattle at the Gund Ranch in Nevada was used to illustrate and to compare alternative methods for estimation of linkage disequilibrium. The first step was to determine paternity of the calves at the ranch. A set of 25 microsatellites (BMS410, BMS499, BMS650, BMS1244, BMS1634, TGLA227, BMS601, BMS1789, BMS2005, ILSTS081, BMS1315, BMS1226, BMS2573, ILSTS058, TGLA126, CSSM66, SPS115, TGLA53, BM1824, BM2113, ETH3R, TGLA122, INRA023, ETH225, ETH10) was used to assign paternity that was carried out using Cervus software. Total DNA from ear notches of calves and sires was purified using the manufacturer’s instructions (Qiagen, CA). The DNA was diluted with AE buffer to 10 ng/μl and stored at −4° prior to genotyping. Primers were diluted to 50 μM and stored at −4°. A primer mix was prepared containing 2 μl of each 50 μM primer set. Each PCR reaction contained a total volume of 15 μl consisting of 1.5 μl of each primer mix, 2 μl water, 4 μl DNA, and 7.5 μl PCR multiplex mix (Qiagen). Gradients were performed to determine the optimal temperature for primer annealing. Amplification was carried out with a TC-512 Thermal Cycler (Techne). The initial denaturation step was performed at 95° for 15 min, followed by 35 cycles of 30 sec at 94°, 1 min and 30 sec at the optimum annealing temperature, and 1 min at 72° with a final extension of 30 min at 60°. Subsequently, 1 μl of PCR product was added to 199 μl water to make a 1:200 dilution. One microliter of this dilution was added to 10 μl of a formamide solution containing 1 ml formamide and 5 μl of ladder and denatured for 5 min at 95°. Genotyping was performed with the Applied Biosystems (ABI) Prism 3730 DNA analyzer.

The Illumina bovine 50K BeadChip was used with bull 302 and his 36 calves to compare methods for estimating LD in half-sib families. The genotyping was carried out at the Core Lab of the University of Colorado, Denver. Only SNPs with a call rate >0.80 in at least 24 calves and MAF of 0.10 or more were used. The data were also filtered for SNPs that were not consistent for inheritance from sire to progeny. If a SNP was not consistent for one progeny then the SNP information was discarded for the entire family. Only pairs of SNPs within the same chromosome and within a distance of 50 Mb or less were used for estimating linkage disequilibrium. For the double heterozygote sire, recombination fraction and linkage phase was estimated using the methods proposed by Gomez-Raya (2001). Only SNPs with a recombination fraction of 0.30 or less were used for SNPs in which the sire was double heterozygote. Estimation of disequilibrium was performed using the half-sib method as well as the method of Excoffier and Slatkin (1995) and by making use of maternal informative haplotypes. For comparison of alternative estimation methods of linkage disequilibrium the statistic *r*^{2} *=* δ^{2}/{(*f _{T} (1 − f_{T})f_{M}* (1

*− f*)} was used. This statistic is widely used and ranges from 0 to 1, which facilitates comparison among methods. The absolute value of the difference between estimates of either ES or MIH and estimates HS were also used to evaluate discrepancies between methods.

_{M}## Results

Table 3 shows simulation results for estimating linkage disequilibrium in a half-sib family from a homo-heterozygote sire with 36 or 500 progeny and dam allele frequencies of 0.5 at both SNPs. For these allele frequencies, the maximum possible linkage disequilibrium, δ, is 0.25. The method proposed in Equation 3 of this article (HS) yields identical estimates to the true (simulated) values of linkage disequilibrium with large family size (500). There was very little bias when the family size is small (36). For the examples in Table 3, the bias using the HS method is <3% . On the other hand, estimates are biased when using the method of Excoffier and Slatkin (1995), which becomes just half of the true disequilibrium at δ = 0.25. The approximation for predicting the expected value for the estimates of linkage disequilibrium using the method of Excoffier and Slatkin (1995) agreed well with the simulation results but tends to underestimate it. On the other hand, the use of maternal informative haplotypes is unbiased, as shown in Table 3 and as proven in the corresponding section of this article.

Table 4 shows simulation results for estimating linkage disequilibrium in a half-sib family from a double heterozygote sire for varying recombination fractions and linkage disequilibrium parameters. The allele frequencies at the two loci were 0.5. Each simulation set was analyzed with the EM algorithm developed in this article (HS) as well as the method of ES and by using MIH from dams. The HS method is asymptotically unbiased with average estimates of disequilibria very close to the simulated (true) parameters for large family sizes (500). For small family sizes (36) the estimates of linkage disequilibrium are slightly downward biased. The method of Excoffier and Slatkin (1995) is severely biased upward at low recombination fractions but becomes biased downward at high recombination fractions. The use of only maternal informative haplotypes to estimate disequilibrium is upward biased at low recombination fractions but becomes unbiased when the markers are unlinked. The approximated expected disequilibrium was very close to what was observed in the simulation for both the method of Excoffier and Slatkin (1995) and when using informative maternal haplotypes from dams. Figures 1 and 2 show expected bias in estimating disequilibrium in a half-sib family from a double heterozygote sire for two scenarios regarding allele frequencies at the DNA markers: *f _{T} =* 0.5,

*f*0.5 and

_{M}=*f*0.4,

_{T}=*f*0.1. For a low recombination fraction, bias is negative but becomes positive as recombination fraction increases. The effect is more pronounced for loci at intermediate allele frequencies than for loci with allele frequencies closer to fixation.

_{M}=In many instances, interest is on the amount of progeny needed for detecting linkage disequilibrium. Empirical power for half-sib families in which the sire was a double homozygote, homo-heterozygote, or double heterozygote is shown in Table 5. The standard deviations among replicates for the same simulation sets are given in Table 6. The simulation results are for varying family sizes and true (simulated) linkage disequilibrium parameters. Disequilibrium (δ) of 0.10 was detected with groups of 100 offspring in most situations. The most powerful situation is when the sire is a homozygote at two loci and all haplotypes are informative. Power in a double heterozygote sire family reduces with genetic distance but it is nearly as powerful as the double homozygote for fully linked loci. Power in a homo-heterozygote sire family is always lower than power in a double homozygote sire family. Standard deviation among replicates follows the same trend as power (Table 6). The double homozygote and the double heterozygote (at *c* = 0) sire families had the lowest variation (Table 6). Variation among replicates increases with increasing recombination fraction in double heterozygote families. The estimates of disequilibrium for homo-heterozygote had more variation than double homozygote families. There was good agreement between the observed standard deviation among replicates and the average of the estimates of sampling standard deviations of δ obtained in each replicate.

The simulation results for estimating LD using multiple sire families are given in Table 7. There was good agreement between simulated and estimated linkage disequilibrium across the range of simulated disequilibrium parameter and recombination fraction. The range of the absolute difference between estimated and simulated δ was between 0.000 and 0.008. A Q-Q plot of the quantiles of the observed distribution of LRT_{joint} under the null hypothesis against quantiles from a γ-distribution with shape = 0.5 and scale = 2 is depicted in Figure 3. This gamma distribution is a χ^{2}-distribution with 1 d.f. The cumulative distribution of LRT_{joint} showed larger variation than a χ^{2}-distribution with 1 d.f.

Linkage disequilibrium was also estimated in a cattle half-sib family using the Illumina 50K BeadChip. There were 0.00189% inconsistencies between genotypes of sire and calves. Table 8 shows the overall estimates of *r*^{2} using HS, ES, and MIH for those situations in which the sire was a homo-heterozygote. There were 314,730 SNP pairs for the entire autosomal genome with average estimates of *r*^{2} of 0.115, 0.067, and 0.111 for HS, ES, and MIH methods. The ES method is downward biased since estimates by this method were around half of their value using the half-sib method. The maternal informative haplotype estimates were slightly lower than those obtained by the half-sib method, which might be due to bias because of reduced family size (noninformative offspring is neglected from these analyses).

Table 9 shows overall estimates of *r*^{2} using HS, ES, and MIH for SNPs for which the sire was a double heterozygote. There were 208,872 SNP pairs. The results using real data support earlier findings showing that the methods of Excoffier and Slatkin (1995) and maternal informative haplotypes were upward biased. The average estimates of *r ^{2}* across the genome were of 0.100, 0.267, and 0.925 for HS, ES, and MIH methods.

Figure 4 shows average estimates of *r*^{2} for the three methods of estimation across the entire genome when the distance between the two SNPs is between 10 and 50 Mb. A total of 829,042 SNPs pairs were tested and the estimates are a pool of all three possible situations regarding sire genotypes: double homozygote, homo-heterozygote, and double heterozygote. This figure shows again that using either ES or MIH methods give estimates upward biased of linkage disequilibria.

## Discussion

Early studies estimating linkage disequilibrium in populations with a half-sib structure were carried out using microsatellites (Farnir *et al.* 2000; Odani *et al.* 2006). These studies used maternal alleles and estimated the most likely haplotype inherited in sons from dams. Although the methods derived in this article are for biallelic loci such as SNPs, a multiallelic marker can always be reduced to a biallelic one after pooling alleles into two groups. As shown by Gomez-Raya (2001) for a double heterozygote sire, the amount of informative progeny in a half-sib family depends on the recombination fraction between the two markers. Thus, the frequency of informative progeny is *c*([1 − *f _{t}*)(1 −

*f*) + (1 −

_{M}*f*)(1 −

_{T}*f*)], and (1 −

_{m}*c*)[(1 −

*f*)(1 −

_{t}*f*) + (1 −

_{m}*f*)(1 −

_{T}*f*)] for recombinants and nonrecombinants, respectively. Genotypes among offspring that are informative for tracing inheritance from sires are also informative for tracing alleles inherited from dams (with unknown genotypes). For example, for allele frequencies

_{M}*f*0.1, the frequency of informative recombinant and nonrecombinant progeny is 0.18

_{T}= f_{M}=*c*and 0.82(1 −

*c*), which means that the closer the markers are, the lower the proportion of informative recombinants among progeny. Therefore, bias in estimating linkage disequilibrium occurs because of the altered proportion of haplotypes that are informative at varying genetic distances. Nevertheless, Farnir

*et al.*(2000) used not just the informative haplotypes but the most likely haplotype. Sires carrying alleles at low frequency would allow identification of haplotypes with a higher probability, which will reduce the magnitude of the bias as shown in this article. In another study, also investigating linkage disequilibrium with microsatellites in cattle, Tenesa

*et al.*(2003) made use of the method of Excoffier and Slatkin (1995) to estimate linkage disequilibrium. As shown in this article, estimates of LD using that method for unrelated individuals might lead to severe biased estimation when applied in animals with a half-sib structure.

Improvement in the sequencing methods in the last years allowed for the discovery of vast amounts of SNPs in the human and animal genomes (*e.g.*, International Hap Map Consortium 2007). A following step has been the construction of LD maps for the human (Maniatis 2002) and animal genomes (*e.g.*, Khatkar *et al.* 2006). LD maps are based on: (a) estimation of a linkage disequilibrium parameter, ρ, which has the same maximum absolute value as the statistics *D′* of Lewontin (1964), and (b) use of a model of decay of disequilibrium leading to equations of Malecot’s model for isolation by distance (Malecot 1964). Thus, the value of *D′* is δ*/D*_{Max} with *D*_{Max} = min{ *f _{T}* (1 −

*f*),

_{M}*f*(1 −

_{M}*f*)}. Construction of LD maps are carried out estimating ρ between adjacent SNPs and by using composite maximum likelihood for all pairs of adjacent SNPs. Inferences of the decay of disequilibrium over time are made by

_{T}*L*is a parameter that reflects the residual association at a long distance (

*d*),

*M*is the association at zero distance, and ε is the exponential decline in LD due to recombination over generations. In human genetics, estimation of ρ is performed using unrelated individuals and the Excoffier and Slatkin (1995) algorithm. The construction of LD maps in species with a half-sib family structure like cattle would require methods for the estimation of disequilibrium (δ) as proposed in this article. If δ is biased then ρ should also be biased. If the bias depends upon the distance between the adjacent SNPs as shown here then inferences on population structure and the evolution of the cattle population may not be fully correct. Khatkar

*et al.*(2006) carried out a LD map of bovine chromosome 6 using bulls from the Australian Holstein–Friesian. They estimated average coancestry by 0.012 using available pedigree information. Assuming that pedigrees were complete, coancestry was rather small but still might lead to bias in the estimation of the disequilibria currently present in the Australian dairy population.

The square of the correlation of alleles at two loci (*r*^{2}) has been widely used in animals with a half-sib structure to estimate linkage disequilibrium (McKay *et al.* 2007; de Roos *et al.* 2008; Hayes *et al.* 2008; Prasad *et al.* 2008; Sargolzaei *et al.* 2008; Bovine Hap Map Consortium 2009; Kim and Kirkpatrick 2009; Qanbari *et al.* 2010). Most of these studies identify phased haplotypes using available information from pedigrees. Haplotypes that could not be phased out were generally ignored. As shown in this article, the proportion of haplotypes that are informative might vary with genetic distance leading to biased estimation of linkage disequilibrium, δ, which would also lead to biased estimates of *r*^{2}. The magnitude of the bias depends on how much information from pedigrees can be used for phasing haplotypes and on the distances between the SNPs in the LD analyses. Many of the above studies made inferences about the population structure based on *r*^{2}. However, estimates of *r*^{2} might be biased to a different extent for different cattle breeds having a different breeding structure (more or fewer half-sibs families of different sizes). Comparison of *r*^{2} estimated without consideration of the breeding structure in different animal populations should be taken with caution. On the other hand, inferences on past population sizes based on Sved’s (1971) equation *E*(*r*^{2}) = (1 + *4N*_{e}*c*)^{−1} (where *N*_{e} is the effective population size) might also be inaccurate if *r*^{2} has been estimated in half-sib families neglecting noninformative haplotypes.

Assumptions of this study were that linkage phase of the sire and recombination fractions were known without error. The linkage phase can be accurately estimated using the same data if progeny groups are not small (>25) and recombination fraction is not too high (<0.30). For other situations, such as those arising by the use of SNP arrays, linkage phase in the sire can be inferred for each of two adjacent SNPs when they are apart at small distances. Reconstruction of haplotypes for all SNPs for each of the two homologous chromosomes of the sire is then feasible. In the same way, the assumption of known recombination fraction will hold for most situations found in practice when SNPs are adjacent, *i.e.*, *c* = 0.

The methods developed in this article are for estimation of linkage disequilibrium present in the dam population and contributing to the half-sib progeny since sire haplotypes are ignored in the computations. In most circumstances, this disequilibrium is the most relevant since sires in dairy and beef cattle are likely related and the number of sire haplotypes is rather small. Nevertheless, if many sire families are available, then haplotype frequencies from sires and dams (estimated among half-sib progeny) can be pooled to obtain a joint estimate of linkage disequilibrium across sexes.

The results of the simulations showed that the proposed method for estimating disequilibrium works well for relatively small family sizes and in multifamily situations. The distribution of likelihood-ratio tests when simulating the null hypothesis showed that it had more variation than a χ^{2} with 1 d.f. This is because likelihood equations for multiple sire families make use of a different number of parameters depending on the sire genotype: double homozygote (δ), homo-heterozygote (δ and *f _{T}*), and double heterozygote (δ,

*f*, and

_{T}*f*). In practical terms, resampling or simulation methods may be needed for hypothesis testing. The power figures for multifamily situations are also affected, being lower than those reported for this article.

_{M}The methods derived in this article were designed for estimating second-order linkage disequilibrium in half-sib families. The same methods and principles used in this article can be applied to the estimation of third- or higher-order linkage disequilibria. These methods may also be incorporated into a more general situation in which pedigrees are incomplete but much information comes from half-sib families. Nevertheless, if genotype information is available only from males (*e.g.*, genotyping information from a granddaughter design), then little information may be gained by incorporating maternal grandsire in the estimation of haplotype frequencies.

The conclusion of this article is that estimation of linkage disequilibrium in populations with a breeding structure of half-sib families must incorporate that structure in their estimation to provide unbiased estimates of the linkage disequilibrium. Inferences on population structure and evolution of cattle or sheep should be based on linkage disequilibria after accommodating the existing half-sib family structure in these populations.

## Acknowledgments

I am very grateful to Dr. David Thain and Jon Wilker for the cattle samples at the Gund ranch and to Veronica Kirchoff for the genotyping of microsatellites. Samples for DNA testing were taken from Nevada Agricultural Experiment Station project no. NEV05339, DNA Paternity Testing and Genetic Improvement of Free Range Beef Cattle. The author is also grateful to Luis Alberto Garcia Cortes and Wendy M. Rauw for criticism of the manuscript. I thank two anonymous reviewers for suggestions that improved the manuscript.

## Appendix A: Sampling Variances of the Estimates of Linkage Disequilibrium

The sampling variance of the estimates of the disequilibrium parameter for the *i*th family is

### Sire Double Homozygote

The genotype of the sire is *TTMM*. To obtain an estimate of the sampling variance it is better to use a full-likelihood equation in which all sources of information are used to estimate δ,*n _{TM,i}*,

*n*,

_{Tm,i}*n*, and

_{tM,i}*n*are the number of offspring inheriting haplotypes

_{tm,i}*TM*,

*Tm*,

*tM*, and

*tm*and

*K*is a constant.

Taking natural logarithms in the above equation after ignoring the constant, *K*, gives

### Sire Homo-heterozygote

Let the sire have genotype *TTMm* at two SNPs, *T/t*, and *M/m*. The likelihood equation for the *i*th family is*j*th genotype as described in the text. Ignoring the constant and taking natural logarithm of the above expression gives*M/m* are not used and, therefore, do not provide information for estimating disequilibrium.

### Sire Double Heterozygote

Let the sire have genotype *TtMm* at two SNPs, *T/t*, and *M/m* and linkage phase (*TM/tm*). As before, *n _{j,i}* are the genotype counts from offspring from the

*i*th sire family (

*j*=

*TTMM*,

*TTMm*,

*TTmm*,

*TtMM*,

*TtMm*,

*Ttmm*,

*ttMM*,

*ttMm*, and

*ttmm*). The recombination fraction is

*c*, which is assumed to be known without error. The likelihood equation for data of the

*i*th half-sib family is

*j*th genotype as described in the text. In the reduced model, ignoring the constant and taking natural logarithm of the above expression gives

## Appendix B

### Reduced Model for Estimating LD in a Homo-heterozygote Sire Family

In a reduced model, allele frequencies are not estimated simultaneously with haplotype frequencies but are assumed to be known. This likelihood equation assuming known allele frequencies in the dam population is

### Reduced Model for Estimating LD in a Double Heterozygote Sire

Following Gomez-Raya (2001), allele frequencies of *M* and *T* are estimated from the same data by*e.g.*, contiguous SNPs in high density arrays) the recombination fraction between the SNPs is 0, and Equation 4 reduces to

## Footnotes

*Communicating Editor: H. Zhao*

- Received December 7, 2011.
- Accepted February 19, 2012.

- Copyright © 2012 by the Genetics Society of America