Abstract
The idea of traitbased linkage analysis in halfsibs is extended by comparing the frequency of parental marker haplotypes in animals with different phenotypes. This article first presents the likelihood of observing different classes of paternal haplotypes in a halfsib family, where only family members of a certain phenotype (e.g., affected) are genotyped and are fully informative. The likelihood function is then generalized to multiple phenotypic categories. A linear predictor allows for discontinuous as well as for continuous phenotypes and other explanatory variables. Finally, how to incorporate not fully informative offspring and how to analyze super sister families are shown. Maximumlikelihood estimates of all parameters can be found by a NewtonRaphson algorithm, which mimics an iteratively weighted leastsquares procedure. The method allows for any multilocus feasible mapping function and, among others, for situations with selective or nonselective genotyping, single or multiple traits, and continuous or categorical traits. No parameters are required to describe the mode of inheritance and the method copes with virtually any family size. Fields of applications are therefore mapping experiments in species with a high reproductive capacity, such as cattle, pigs, horses, honey bees, trees, and fish.
THE term “traitbased” analysis was coined by Lebowitz et al. (1987) for a certain type of linkage analysis. The basic idea is to compare marker genotypes or marker allele frequencies of different phenotypes as opposed to a comparison of phenotypic means of different marker genotypes. The latter type of analysis has been termed “marker based” (Lebowitzet al. 1987) and includes wellknown techniques such as, e.g., maximumlikelihood or regression interval mapping (Lander and Botstein 1989; Haley and Knott 1992; Martinez and Curnow 1992).
Lebowitz et al. (1987) stated that traitbased analysis may be a useful alternative when interest is centered on a single quantitative trait only and markerbased analysis is not applicable as for polygenic resistance traits, where only a part of the population survives exposure to the stressor. Henshall and Goddard (1999) showed that the probability of a paternal Q allele in daughters of a sire heterozygous Qq at a quantitative trait locus (QTL) shows a logistic relationship with the trait influenced by the QTL if the error within QTL genotype is normally distributed. They used logistic regression for the estimation of QTL effects in selectively genotyped halfsib families, treating marker genotype of progeny (first vs. second paternal marker allele) as the binary response variable and phenotype as the independent variable. Estimates for the QTL effect were unbiased by selecting animals for genotyping from the extreme tails of the distribution of phenotypes. Henshall and Goddard (1999) also demonstrated that multipletrait QTL detection in halfsib families is a relatively simple matter of performing multivariate logistic regression, no matter if selective genotyping has been applied or not.
With halfsib experiments, in treating genotype of the offspring as response, the binary response variable (paternal Q or paternal q) can be observed directly, provided there is complete linkage with the marker under consideration and all progeny are fully informative. The numbers of paternal Q and q alleles cannot be directly observed and counted in situations with recombination, if the marker allele inherited from the sire cannot be determined unequivocally for a part of the sibship or if a pair of flanking markers is available. In these cases it will often be possible to resolve the marker genotypes of the offspring into parental marker haplotypes (or sets of possible parental haplotypes compatible with the marker data) and these haplotypes may alternatively be taken as observations. The distributions of these observations are, however, no longer binomial, but, in the general case, a mixture of different multinomial and binomial distributions.
This article deals first with the likelihood to observe different classes of paternal haplotypes in a halfsib family, where only family members of a certain phenotype (e.g., affected) are genotyped and are fully informative for either two flanking markers or a single marker. The likelihood function is then generalized to multiple phenotypic categories. A linear predictor on the logscale allows for discontinuous as well as for continuous phenotypes and other explanatory variables such as, e.g., sex or age of onset. Finally it is shown how not fully informative offspring can be incorporated, how super sister families with a degree of relationship of 0.75 can be analyzed, and that any multilocus feasible mapping function can be used. Maximumlikelihood estimates of all parameters can be found by a NewtonRaphson algorithm, which can be formulated as an iteratively weighted leastsquares procedure. Application to various experimental designs and hypothesis testing in different situations is discussed.
THEORY
Halfsib setting: In livestock species such as, e.g., dairy cattle, large paternal halfsib families are available for gene mapping purposes. For a dichotomous trait with unknown mode of inheritance a simple Mendelian analysis is not possible. However, the probability of paternal halfsibs to share a certain paternal chromosome segment given the marker data and phenotypes may be evaluated.
As a starting point a single paternal halfsib family with a sire heterozygous Dd at a disease susceptibility locus and two bracketing markers is considered. It is assumed that the marker haplotypes of the sire are known or could be reconstructed with high certainty from a large number of genotyped offspring. Only affected progeny are genotyped.
For the sake of simplicity it is further assumed that all progeny are fully informative and it is therefore possible to observe the paternal haplotype of each offspring unambiguously. In this case we can discriminate between six distinct types of paternal haplotypes: nonrecombinant 11 (type 1) or 22 (type 4) and recombinant 12 (type 2) or 21 (type 3) if a marker bracket is considered and 1 (type 5) or 2 (type 6) if only a single marker is available.
Likelihood of observing paternal haplotypes: The probability of an offspring to express the undesired phenotype if this offspring has inherited a paternal d allele at the susceptibility locus is p_{k} compared to cp_{k} for an offspring with a paternal D allele. The parameter c (c ≥ 0) is the relative change in the disease probability caused by the D allele and depends on the mode of inheritance, the gene frequency of D, the penetrance of all genotypes at the susceptibility locus, and the genetic background. Under the null hypothesis of no linked susceptibility locus c has a value of one. The allele substitution effect in terms of the risk to be affected is then p_{k}  cp_{k}. Using Bayes’ theorem the conditional probability of observing a particular paternal marker haplotype in a halfsib, given the marker haplotypes of the sire and the affected phenotype of the individual, can be derived from the probability tree in Figure 1. The probability h_{1} of observing a 11 haplotype in an affected halfsib is
The likelihood function for a single halfsib family with N_{a} affected offspring is then
The maximum of the loglikelihood at a certain map position can be found by a NewtonRaphson algorithm using the first and second derivatives with respect to c:
Multiple phenotypic categories: The analysis is not necessarily restricted to a single phenotypic category and the inclusion of unaffected animals may be of interest. The probabilities of having inherited one of the six classes of paternal haplotypes if an animal is unaffected are shown in the first column of Table 2. These formulas can be rewritten by introducing the ratio c_{2} = (1  cp_{k})/ (1  p_{k}) and replacing (1  cp_{k}) with (1  p_{k})c_{2}. This yields formulas with the same structure as for affected animals, but with c replaced by c_{2}, as can be seen in detail in the second column of Table 2. It is obvious that c_{2} is also equal to 1 under the null hypothesis. The loglikelihood function l for observations from both categories is composed of contributions of the affected animals and contributions of the unaffected animals,
More generally, if there are r discrete phenotypic categories, the number of halfsibs in each category is N_{r}, and a parameter c_{r} is defined for each category in a similar manner as shown above (with c_{1} = c), the loglikelihood can be written as
Log odds and linear predictor: Another method of observing the problem is to consider the ratio of the two probabilities p(paternal Dmarkers, phenotype) and p(paternal dmarkers, phenotype) for affected halfsibs. If an affected individual has paternal haplotype 11 these probabilities are
Estimation of β: As already mentioned the loglikelihood is a sum of r contributions of single animals or, equivalently, contributions of r groups of animals sharing the same set of explanatory variables (Equation 10). Each contribution has index j.
Extension to partially informative progeny: So far only the best scenario has been considered, where all progeny are fully informative. As an example for the treatment of partially informative progeny one member of the abovementioned paternal halfsib family may have genotypes 1, 1 and 1, 2 at the first and second marker locus, respectively. Its paternal haplotype is therefore either 11 or 12. If the paternal haplotype is 11 then the corresponding maternal haplotype must be 12. Assuming linkage equilibrium between markers, the probability of observing such an individual is h_{1}f_{11}f_{22}, and the alternative of an individual with paternal haplotype 12 and 11 on the maternal chromosome has a probability of h_{2}f_{11}f_{22}, where f_{11}, f_{21}, and f_{22} are the population frequencies of the first allele at the first marker and of alleles 1 and 2 at the second marker, respectively. After defining
In paternal halfsibs with known haplotypes for the sire and ungenotyped dams randomly chosen from the population the weights are functions of the recombination rates between the markers and the population frequencies of the marker alleles. For a partially informative halfsib all possible haplotypes can be enumerated. On each of these possible haplotypes the two markers flanking the trait locus can be treated as they are in fully informative offspring. One of the h_{i} values given in Table 1 must be appropriate, i.e., h_{1}, h_{2}, h_{3}, or h_{4} for the paternal allele combinations 11, 12, 21, or 22, respectively. The matching h_{i} value for a certain member of the set of all possible haplotypes is denoted by h_{g}. When considering several adjacent markers on a chromosome at a time one can express the probability of observing a certain haplotype with index g as
Note that the recombination rate θ_{k} between the markers closest to the trait locus does not contribute to the first product in (38), because it is assumed that the trait locus is between two markers. If the trait locus is exactly at a marker, then the recombination rates of all marker intervals have to be used and lr_{g} becomes
Arbitrary mapping functions: On a certain haplotype, which has been observed in a fully informative halfsib, there are the hidden trait locus and, e.g., two visible flanking markers. These three loci define two intervals on the chromosome. The probability that a crossover has occurred in both intervals can be written as p(1, 1) and the probability that both intervals are free from a crossover as p(0, 0), where 0 and 1 denote no crossover and crossover in an interval, respectively. When a 11 haplotype has been observed and Haldane’s mapping function is applied, then p(1, 1) becomes θ_{1}θ_{2} and p(0, 0) = (1 θ_{1})(1 θ_{2}). In the formula for the probability of observing a 11 haplotype in an affected halfsib (h_{1} in Table 1), a_{1} can therefore be expressed as p(1, 1) and b_{1} as p(0, 0). An examination of Table 1 shows that all a_{i} and b_{i} values can be interpreted as Haldane crossover probabilities. The six different probabilities h_{1}h_{6} in Table 1 can therefore be rewritten by using general expressions for crossover probabilities, which include those derived from the Haldane mapping function or any other suitable mapping function as special cases (Table 3). For known map distances the probability distribution of crossover combinations in a linkage group can be computed as described by Schnell (1961).
In fully informative halfsibs with two flanking markers there are always exactly two possible crossover combinations for a certain observed paternal marker haplotype. The first crossover combination corresponds to a gamete carrying the susceptibility allele D and has probability b_{i} and the second combination corresponds to a chromosome segment with a d allele and has probability a_{i}. As already mentioned above, in halfsibs, which are not fully informative, all possible marker haplotypes at a linkage group of adjacent markers have to be considered. On a linkage group with M marker loci and one trait locus between two markers there are M chromosome segments between loci and 2^{M} different crossover combinations; 50% of these correspond to paternal gametes with a d allele and 50% to paternal gametes with a D allele at the trait locus. These gametes and their crossover probabilities can be enumerated and sorted in such a way that all 2^{M}^{1} dcarrying gametes precede the 2^{M}^{1} Dcarrying gametes. After indexing the sorted gametes with g, the probability for observing the complementary maternal alleles at the considered linkage group can be written as
Identitybydescent probabilities between sibpairs: It is possible to express the likelihood of the previous paragraphs as a function of a parameter t, which is defined as the probability that a progeny shares the first allele of the sire, given the marker data or, in other words, as the markerderived gametic identitybydescent (IBD) probability between the first chromosome of the sire and the paternal chromosome of an offspring. The probability of observing a paternal haplotype with allele 1 is given in Table 1 as
The estimates of t_{j} are of interest, because the probability that two halfsibs with explanatory variables x_{j}_{1} and x_{j}_{2} have a probability of sharing a paternal allele identical by descent is
Computational considerations: The phenomenon of unbounded parameter estimates may arise in applying the iterative equations (27) and (30). Under certain circumstances this is what we expect: if a sire is heterozygous for a recessive gene with full penetrance and ln c is estimated from affected offspring of this sire, as described above, we expect a value of either plus or minus infinity, because 100% of the affected offspring have inherited the recessive allele from the sire. When ln c tends to minus infinity, the contributions to the loglikelihood have limiting values of
EXAMPLE
An example demonstrates the application to socalled super sisters, as they occur in the European honeybee Apis mellifera. In honeybees some traits, e.g., honey yield, can be measured at colony level only. For other traits, which can be measured on individual worker bees, it may be possible to collect a rather large number of individuals showing a certain phenotype. This becomes possible even for a rare phenotype, because a colony comprises tens of thousands of offspring from a single queen. Such traits could be morphological or, e.g., defense behavior against the parasitic mite Varroa jacobsonii. If, in an experiment, the queen has been artificially inseminated with the sperm of a single haploid drone (Harbo 1989), then all workers share the same paternal genotype and they build a family of super sisters with a coefficient of relationship of 0.75 between two of them.
In a hypothetical example tissue from 68 super sisters with the same rare phenotype is available together with tissue of the queen and the drone. If all these animals are genotyped for codominant markers, then the super sisters can be treated like fully informative halfsibs in the analysis because the haploid drone always transmits the same allele. The marker genotypes of the worker bees are given in Table 5. The queen is heterozygous 1, 2 at all markers. The drone’s genotypes at the three consecutive marker loci are 1, 3, and 2, respectively, and allow us to resolve the genotypes of the worker bees unequivocally into maternal haplotypes (Table 5). The frequencies of these maternal haplotypes allow us to assume that the haplotypes of the queen are 111 and 222. The number of super sisters with a particular maternal haplotype is given in the second column of Table 5. The fourth and fifth columns of this table show the index (as defined in Table 1) for the class of the maternal haplotype for the first and the second marker bracket separately. The contributions of single individuals to the likelihood h_{i} are given for the null hypothesis (c = 1) and also for the maximumlikelihood estimate c = 5.234 at a map position of 52 cM. The length of the first marker interval was assumed to be 35 Haldane cM and for the second one 60 Haldane cM. The LODscore profile over both marker brackets is given in Figure 2. It shows that the most probable locus position is at 52 cM.
The maximumlikelihood estimate of c translates to an identityofdescent probability of 0.73 for the maternal gamete at 52 cM, when the marker genotypes are unknown for two randomly chosen affected individuals. Table 6 shows the same IBD probability for all possible pairs of affected progeny with known flanking markers, with values ranging from 0.17 to 0.97.
Under natural conditions a queen mates with several drones. Each of them is the founder of a group of super sisters and therefore the entire colony is a mixture of several super sister families with a common mother. Tissue from the drones is usually not available, because the matings take place far away in the air. Under these circumstances a traitbased linkage analysis with worker bees can be conducted in the same way as for halfsibs, with some of the workers only partially informative. The application of artificial insemination, however, enables us to save tissue from all the drones, regardless of their number. With artificial insemination and semen from more than a single drone, markers can be used for a paternity test to determine the membership of each genotyped worker bee to one of the super sister groups in the colony. Then, by genotyping all drones together with the queen and the workers, the resolution of the progeny’s marker genotypes into maternal haplotypes again becomes unequivocal and the analysis can proceed as described in the example for a single super sister family.
DISCUSSION
In applying the method to partial or wholegenome scans it becomes necessary to account for repeated testing. The simplest method would be to use a predefined significance threshold, e.g., a LOD score of 3 with 1 d.f. (Ott 1991). In the important case of genotyping affected family members only, a permutation test (Churchill and Doerge 1994) cannot be applied, because all progeny remain to have the same phenotype before and after permutation. The distribution of the test statistics under the null hypothesis of no linkage can, however, be achieved by simulation. For this purpose parental marker alleles for each progeny can be drawn from the known haplotypes of the parent of each halfsib family by a random walk along the chromosome and gene drop. The missing alleles of the other parent can be supplemented by choosing them randomly from the population, according to their gene frequency. Of course it is not necessary to perform the last step of this simulation in fully informative halfsibs (e.g., with a cross of inbred lines) or super sisters. If all members of a halfsib family are genotyped, regardless of their phenotype, then a permutation test can be applied. The same is true with selective genotyping, provided that all possible paternal haplotypes are adequately represented and a c value of 1 can be expected in the joint data and, on average, also in the permuted data. This would be the case, for example, in a daughter design where marker genotypes are determined for the best 25% and the worst 25% of the animals.
The null hypothesis H_{0}: c = 1 will usually be adequate in all analyses with only a single phenotypic class in one family. If marker data are available for two distinct phenotypes, then the null hypothesis could either be H_{0}: c_{1} = c_{2} = 1 or H_{0}: c_{1}  c_{2} = 0. The first hypothesis would also be rejected in the case of segregation distortion. With a twofold excess of the frequency of the second paternal haplotype in both phenotypes we would, e.g., expect c_{1} and c_{2} to be equal with a value of 2. Therefore a test for linkage should be performed as a test for differences in the c values of different phenotypes or, in other words, as a test for the trait dependency of c. It may, however, be possible, at least in principle, to compare the two alternative null hypotheses by a likelihoodratio test, to draw inferences on the presence or absence of segregation distortion.
The linear predictor
The proposed maximumlikelihood approach may be viewed as a generalization of the logistic regression procedure of Henshall and Goddard (1999). Under perfect linkage and in fully informative halfsibs the numbers of animals with a paternally derived Q allele and a paternally derived q allele at a QTL can be directly observed. Henshall and Goddard (1999) showed that the trait dependency of these counts can be exploited in a logistic regression analysis. For recombination rates larger than zero the authors suggested that the regression coefficients obtained at the closest markers be interpolated and that expected counts of Q and q animals (derived from the markers) be used to calculate the likelihoodratio test. It has been demonstrated in this article that these approximations are not necessary when paternal marker haplotypes and their frequencies are used as observations instead of the, in most cases, unobservable number of Q and q individuals in a family. The range of possible applications and the biological interpretation of the estimates obtained by logistic regression and the maximumlikelihood approach of this article are, of course, identical. The analysis is “nonparametric” in the sense that there are no parameters, which explicitly describe the mode of inheritance, such as penetrances and gene frequencies. It is, however, possible to derive the c value if the mode of inheritance can be specified in detail. As already mentioned above, the c for affected animals tends toward infinity for a single recessive gene (or zero, depending on the linkage phase). This holds for any value of the penetrance of the homozygous recessive genotype. The use of the limiting values of the contributions to the loglikelihood given in (53) and (54) makes it possible to test for the recessive expression of the trait locus by comparing the likelihood of the data for an infinitely large c value with the maximum of the likelihood, which has been estimated from the data. A significant result would exclude all kinds of recessive singlegene inheritance, regardless of the penetrance of the homozygous recessive genotype. Another interesting pattern is a single, fully dominant gene. In this case the c value for affected animals becomes equal to the gene frequency (or one over the gene frequency for the alternative linkage phase).
Observing the expected c value for different phenotypes is also useful for power calculations and helps in answering the question of which phenotypes should be genotyped preferentially. Usually the c value for a group of affected animals will be larger than the c value for a group of unaffected animals and the experimental power from genotyping a certain number of affected will be higher than from genotyping either the same number of unaffected or a mixture of affected and unaffected. After having obtained a significant linkage, the genotypes of unaffected halfsibs are, however, useful for excluding the existence of segregation distortion, which could lead to false positive results, if only affected sibs are genotyped. In the special case when a sire is heterozygous for a dominant trait locus and is mated to dams from a line in which the recessive allele is fixed, then the c value for the first chromosome and affected progeny becomes equal to the c value for the second chromosome and unaffected animals, and animals of both phenotypes provide the same possibility to map the gene. There may be further examples for such an equivalence, especially in cases with more than a single trait locus.
In many species, e.g., cattle, pigs, honey bees, trees, and fish, the reproductive capacity of one or both sexes is tremendously higher compared to humans, either naturally or by the extensive use of artificial insemination in commercial or experimental populations. In human genetics methods for mapping disease loci by genotyping affected sibs for multiple markers have been implemented, e.g., in the widely used “Genehunter” program (Kruglyaket al. 1996) or derivatives such as “Allegro” (Gudbjartsson and Jonasson 1999). The author’s experience is that these implementations are at present technically restricted to halfsib families sized not larger than ∼20. The proposed traitbased maximumlikelihood analysis makes it possible to analyze simultaneously collections of large and even very large halfsib families comprising only affected individuals practically without any important restrictions due to family size. Beyond this important class of mapping experiments, the range of potential applications is much broader and offers possibilities for linkage detection with any multilocus feasible mapping function in experiments with features such as selective or nonselective genotyping, single or multiple traits, continuous or discontinuous traits, and combinations of them.
Footnotes

Communicating editor: C. Haley
 Received January 11, 2002.
 Accepted May 28, 2002.
 Copyright © 2002 by the Genetics Society of America