Abstract
In linecrossing experiments, deviations from Mendelian segregation ratios are usually observed for some markers. We hypothesize that these deviations are caused by one or more segregationdistorting loci (SDL) linked to the markers. We develop both a maximumlikelihood (ML) method and a Bayesian method to map SDL using molecular markers. The ML mapping is implemented via an EM algorithm and the Bayesian method is performed via the Markov chain Monte Carlo (MCMC). The Bayesian mapping is computationally more intensive than the ML mapping but can handle more complicated models such as multiple SDL and variable number of SDL. Both methods are applied to a set of simulated data and real data from a cross of two Scots pine trees.
CHROMOSOMAL regions that cause distorted segregation ratios in early life stages may be referred to as segregationdistorting loci (SDL). These distortions are caused either by differential representation of SDL genotypes in gametes before fertilization or by viability differences of SDL genotypes after fertilization but before genotype scoring. In both cases, the observable phenotype is a distortion of marker locus genotypes in chromosomal regions close to the SDL. Hence, regardless of the timing of action of the SDL, mapping of locations and estimation of effects of SDL follow the same statistical treatment.
Let us first discuss mechanisms that cause deviated segregation ratios by altering the gametic proportions. With meiotic drive, gametic proportions become distorted during meiosis because one chromosome type may preferentially end up in the egg nucleus (meiotic drive). Meiotic drive is known, e.g., for the maize chromosome 10 where a variant carrying a heterochromatic knob is preferentially transmitted (reviewed in Grant 1975). Gametes carrying a certain allele act to render gametes carrying the homologous chromosome, e.g., the segregation distorter (SD) and sex ratio (SR) loci of Drosophila and the talleles of mice (e.g., Hartl and Clark 1997, p. 244ff). Meiotic drive can be a powerful selective force. The talleles are maintained in the population, even though they are homozygous lethals, due to their 0.95 probability of being passed to the next generation in heterozygotes. In many species hybridizations, outbreeding depression and segregation distortion have been observed in the F_{2} generation. These are often caused by structural differences between chromosomes (Whitkus 1998), i.e., by events before fertilization.
Haploid life stages can be exposed to selection, especially in plants. In the life cycle of mosses, the haploid life stage (the gametophyte) is dominant over the diploid life stage (the sporophyte). In vascular plants, maize gametophytic mutations indicate that pollen tube growth rates are determined in part by the genotypes of the microgametophytes (reviewed in Grant 1975).
Viability selection after fertilization may be more important than gametic selection. Viability selection is common in consanguinous matings where inbreeding depression reduces the survival of homozygotes compared to heterozygotes (Charlesworth and Charlesworth 1987). Viability selection gives rise to segregation ratios distorted from 1:2:1 at linked loci. Inbreeding depression is often expressed in very early life stages (Husband and Schemske 1996). In Scots pine, only ~15% of selffertilized embryos develop into mature seeds, whereas ~75% do so in windpollinated seeds (Kärkkäinenet al. 1996). Some aspects of the genetic basis of inbreeding depression require further investigation, e.g., number and effects of loci and degree of dominance. Yet these factors have major consequences for mating system evolution (Charlesworth and Charlesworth 1998), conservation genetics (Hedrick 1994), and plant breeding (e.g., Williams and Savolainen 1996). A biased segregation ratio due to viability differences of genotypes also occurs in the F_{2} generation of wide crosses. This is generally thought to be caused by epistatic interactions.
Often events before fertilization cannot be distinguished from events after fertilization. McColdrick and Hedgecock (1997) reported that crosses of Crassostrea gigas, the Pacific oyster, produced biased segregation ratios when tested as adults. Later Launey and Hedgecock (1999) showed that, for many loci, the ratios were Mendelian when 6hrold larvae were assessed, but the ratios deviated from the Mendelian ratios when the animals were 2 to 3 mo old in the same crosses. Hence, the differences are due to postfertilization viability selection.
Quantitative trait loci (QTL) are usually mapped in agronomically important plants and animals. To increase differences of parental types, and thus to increase the power of mapping, crosses are often conducted between inbred lines or between distantly related cultivars or even between species. As discussed above, these conditions promote segregation distortion.
For molecular characterization of the genetic causes of distorted segregation ratios, mapping of the location and effects of SDL would be desirable. As the phenotype in SDL mapping is different from that of QTL mapping (data in SDL mapping usually consist of frequencies of genotypes among survivors), QTL methods cannot be used for SDL mapping. Development of advanced methods for estimation of locations and effects of SDL has been lagging behind that for QTL mapping. In the past, often a single marker was considered at a time, where only the linkage between one fully informative marker and a single SDL was tested (Sorensen 1967; Servitová and Cetl 1984; Hedrick and Muona 1990; Fu and Ritland 1994a; Kärkkäinenet al. 1999). In a singlemarker test, the number of distinguishable genotypic configurations of the marker is at best equal to the number of genotypic configurations of a linked SDL, but the genotypic frequencies of the marker are affected by the recombination fraction in addition to the frequencies of the SDL's genotypic configurations. Hence, for a singlemarker test, estimations of the position and effect are confounded.
Errors in marker genotyping may also cause systematic deviations from the expected segregation ratio. Randomly amplified polymorphic DNA (RAPD) markers are often misscored as a faint band and may be interpreted as absent. This may lead to misscoring of only a single marker. In contrast, if segregation distortion is caused by SDL, all markers in the vicinity of the SDL will be affected.
Fu and Ritland (1994b), MitchellOlds (1995), and Cheng et al. (1996) have developed maximumlikelihood methods for mapping one SDL using flanking markers, i.e., an interval mapping strategy (Lander and Botstein 1989). Given a map of fully informative markers, no missing data, no interference between recombinations, and no more than one SDL per chromosome, this theory can be used to scan the genome for SDL. Under these assumptions, loci outside the interval flanking the SDL contribute no information to the segregation of the SDL. But more than one SDL per chromosome may be present and markers may be only partially informative. Furthermore, due to the effects of SDL, estimation of map distances of markers might become biased (Lorieux et al. 1995a,b; Liu 1998). This might cause the interval mapping method to become inefficient and biased.
The SDL analysis is based on binomial (or multinomial) distributions instead of normal distributions, and hence multiple regression is not readily available and cannot be combined with conventional interval mapping as in the composite interval mapping (CIM; Zeng 1994) or the multiple QTL mapping (MQM) scheme (Jansen and Stam 1994). Therefore, multiple SDL on a single chromosome pose an unsolved theoretical problem. On the other hand, if maps are inferred correctly and if SDL on different chromosomes do not interact epistatically, i.e., SDL effects combine multiplicatively, linkage to an SDL is solely responsible for the phenotype. SDL analysis of one chromosome is therefore usually independent from other chromosomes.
We present a multipoint method for mapping multiple SDL using a backcross design. The multipoint method is developed under both the maximumlikelihood and the Bayesian frameworks.
THEORY
Model: We develop and present the model under a backcross design only, although the method can be applied to other controlled mating designs as well. We assume that the parents that initiate the cross are pure inbred lines. The F_{1} of the cross is backcrossed to one of the parents and a total of N individuals are generated in the backcross (BC) family for mapping. We are interested in mapping loci responsible for segregation distortion using multiple markers that are already mapped on the genome. The data here are the observed marker genotypes (configurations). The parameters, however, are the number of SDL, the locations, and effects of these loci. We assume that all markers are neutral in the sense that their segregations would be Mendelian if there were no linked SDL on the same chromosome. The observed segregation distortions on these neutral markers, however, are caused by one or more SDL near the markers.
Note that the flow of causality is from the SDL to the genotypic configurations of the SDL, then from the genotypic configurations of the SDL to the genotypic configurations of the marker loci, and finally from the genotypic configurations of the marker loci to the observed marker information. We first consider a single SDL. The genotype of the F_{1} is heterozygous and that of a BC individual (generated from F_{1} backcrossed to the first inbred parent) is either heterozygous or homozygous for the allele of the first parent with an unequal probability. The degree of asymmetry in the probability is determined by the effect (size) of the SDL. Define
Given the position (λ) of the SDL on the chromosome, the joint distribution for ϕ_{i} and ϕ_{i}_{1}, … , ϕ_{iM} is
Let I_{i} = [I_{i}_{1}, … , I_{iM}]. Combining formula (6) with the marker information and “summing out” the marker inheritance digits, we get
Maximum likelihood: Having formulated the probability model, we now introduce a maximumlikelihood method to estimate and test the SDL. There are several ways to find the maximumlikelihood estimate of π; we adopt an expectation maximization (EM) algorithm and treat ϕ_{i} as missing data. We treat λ as a known constant for the moment. Let I = [I_{1}, … , I_{N}] and φ = [φ_{1}, … , φ_{N}]. For the EM algorithm we need to determine the logarithm of Pr(I, ϕπ, λ), i.e.,
Conditional on the data, the position, and the initial value of the parameter, π^{(0)}, the posterior probabilities of ϕ_{i} = 0 and ϕ_{i} = 1 are, respectively,
We can now test the null hypothesis that there is no segregation distortion for the particular location λ. The null hypothesis is formulated as H_{0}: π = ½, which can be tested using the likelihoodratio test statistic
The maximumlikelihood estimate of the position of the SDL, λ, can be obtained by examining the likelihoodratio profile along the chromosome, as is commonly done in interval mapping of QTL.
Bayesian analysis: We now introduce the Bayesian analysis of SDL implemented via the Markov chain Monte Carlo (MCMC). We first classify variables into observables and unobservables. The observables are the data, denoted by I. The unobservables include parameters and missing information. The parameters here include π and λ, and the missing information consists of the inheritance digits ϕ and ϕ in the current situation. We always sum over all the missing information, such that inheritance digits will only appear in intermediate steps. The joint posterior distribution of the parameters is
Starting with an initial value for each parameter, {π^{(0)}, λ^{(0)}}, we sample π using the MetropolisHastings algorithm (e.g., Gelmanet al. 1995). A new proposal, π*, is sampled from a beta proposal distribution J(π*π^{(0)}) = Beta(π^{(0)}N + 2, (1 − π^{(0)})N + 2). The proposal π* is accepted with probability min{1, a(π*, π^{(0)})}, where
MultipleSDL model: Consider the joint action of L SDL located on the chromosome of interest. Define the locations of these SDL by λ = {λ_{l}} for l = 1, … , L, in contrast to the singleSDL model where λ is a scalar. Also define the marginal effects of the SDL by π = {π_{l}} for l = 1, … , L. Assume that these SDL act multiplicatively then the joint effect of all the SDL can be formulated as a product of these marginal effects. Define ϕ_{i} = [ϕ_{i}_{1}, … , ϕ_{iL}] and ϕ_{i} = [ϕ_{i}_{1}, … , ϕ_{iM}] as vectors of inheritance digits of all SDL and marker loci, respectively, for the ith individual. Using Bayes' theorem, the joint posterior distribution of ϕ_{i} can be formulated as
With the Bayesian approach, the number of SDL (L) can be treated as an unknown variable. This involves a change in the dimension of the model. Reversible jump MCMC (Green 1995; Satagopan and Yandell 1996; Heath 1997; Richardson and Green 1997; Sillanpää and Arjas 1998; Stephens and Fisch 1998) is an extension to the MetropolisHastings sampler, permitting moves to be made between models with different dimensions. The joint posterior distribution of the parameters is
For adding an SDL, a new location λ_{L+1} and effect π_{L+1} are sampled from their uniform priors for the new SDL. The new sets of parameters are π* = (π^{(0)}, π_{L+1}) and λ* = (λ^{(0)}, λ_{L+1}). We then accept this new SDL with probability min{1, a(L + 1, L)}, where
If the new SDL is accepted, its location and effect are accepted simultaneously; otherwise, the number of SDL remains the same. In the deleting step, a random SDL is proposed to be deleted. Then the SDL are renumbered such that the candidate SDL is the last SDL, i.e., the Lth SDL. The new parameter sets will be π* = (π_{1}^{(0)}, … , π_{L−1}^{(0)}) and λ* = (λ_{1}^{(0)}, … , λ_{L−1}^{(0)}). The proposal is accepted with probability min{1, a(L − 1, L)}, where
APPLICATIONS
To illustrate the method, a simulation study and an analysis of a data set from one cross of two Scots pine (Pinus sylvatica) trees are presented. The simulation study conforms to an inbred line BC situation. In the pine data analysis, we concentrate on the maternal part of the progeny of a single tree, i.e., a pseudobackcross design. In a backcross it is not possible to distinguish between gametic selection and viability selection after fertilization.
Simulations: In the simulation study, first, a single viability locus that eliminates 50% of the progeny of the heterozygous genotype, i.e., π = 2/3, was placed in the middle of a chromosome of length 1 M; six markers were spaced at regular intervals of 0.2 M along the chromosome; no missing data were considered. In the second simulation, two SDL with the same effects as in the singleSDL situation were placed at locations 0.33 M and 0.67 M, respectively. In both cases, simulations with sample sizes of 500 were repeated five times and results were compared; additionally, simulations with sample sizes of 100 and less were also performed. Compared to empirical reports of distortions of marker loci from Mendelian ratios, the simulated effect is high but not unrealistic. The marker map is rather dense and fully informative.
The outcomes of the analyses of the five simulated data sets were almost identical such that we present only one of them. In the maximumlikelihood (ML) analysis, the number of SDL was fixed to one. The inferred effect, the likelihoodratio statistic Λ, is reported at each location. We also performed an MCMC analysis of the same data. From Figure 1A, we see that the position and effect of the SDL are estimated quite accurately. For the other four simulations, the inferred positions were also mostly between the two middle markers and the estimated effects were close to the true value. Reducing sample sizes did not appreciably change the estimate of location or effect. The likelihoodratio statistic, however, dropped considerably (results not shown). We do not present the ML results with two SDL, because the model is not appropriate.
With the Bayesian MCMC analysis, the Poisson prior mean was set to μ = 1 and the maximum number of SDL was set to three. The chain length was 10^{5}. The chain was thinned by storing only after every 10th cycle. No burnin period was discarded because the chain reached approximate stationarity very quickly. The posterior probability of the simulated number of SDL (i.e., one or two, respectively) was always between 0.6 and 0.9. In the oneSDL case, frequencies are higher at the center, i.e., close to the simulated position (Figure 1B). Effects are very similar to those estimated with the ML method. In the twoSDL case, posterior distributions of both the locations as well as the effects are about correct (Figure 1C). It can be easily discerned from the posterior distribution of frequencies that there are actually two SDL present. When the number of individuals was reduced, the posterior probability of the different numbers of SDL approached that of the prior distribution rapidly (data not shown). This corresponds to the decrease in the likelihoodratio statistic with decreasing sample size.
Pine data: In the second application, data consisted of the megagametophytes of openpollinated offspring of a single Scots pine P. sylvestris tree, P304 (Hurme and Savolainen 1999). Megagametophytes are haploid tissues consisting of the maternal part of the seedling's genome and can be scored at the seedling stage without damaging the seedling. We treated the progeny of this tree as a pseudobackcross family. Map distances and linkage phases were determined with Mapmaker as described in Hurme and Savolainen (1999). Five RAPD markers from linkage group 2 were used in this family: C02680, G13750, K09750, E09250, and AC15270 at positions 0.038 M, 0.115 M, 0.287 M, 0.461 M, and 0.478 M, respectively. As determined from other crosses, the map length of the whole linkage group was ~0.85 M. The sample size was 73 individuals, and in many individuals some markers were scored as missing.
With the ML analysis, the loglikelihood ratio statistic was appreciable only close to the marker G13750 (Figure 2A). At this location the inferred effect was an excess of the heterozygous genotype of ~0.2 over the Mendelian value of 0.5. For the Bayesian MCMC analysis, the prior distribution was the same as for the simulation study. The posterior probabilities of zero, one, two, and three SDL were 0.01, 0.15, 0.61, and 0.23, respectively. This result is, however, quite sensitive to the prior distribution of SDL number. We report the posterior distributions of both one and two inferred SDL. If a single SDL was inferred, it was most often placed close to marker C02680 (the beginning of the marker region), and the inferred effect was a considerable increase in the second genotype, as in the ML analysis (Figure 2B). If two SDL were inferred, most often location and effect of one of the SDL was similar to the singleSDL case, while the other counteracted its effect at the other end of the linkage group (Figure 2C).
DISCUSSION
Herein, a method for mapping SDL in a backcross is presented. The method makes efficient use of a map of partially or fully informative marker loci by using the multipoint method (Lander and Green 1987; Jiang and Zeng 1997). A maximumlikelihood analysis via an EM algorithm as well as a Markov chain Monte Carlo Bayesian analysis using a reversible jump algorithm for varying the number of loci is presented in detail. Given a dense marker map, the method can be used for precision analysis of positions and effects of the SDL. The best previously available methods (Fu and Ritland 1994b; MitchellOlds 1995; Chenget al. 1996) rely on fully informative markers flanking the putative SDL and assume just one SDL per chromosome.
With our approach, it is possible to efficiently analyze the number, positions, and effects of SDL in organisms, for which a highresolution marker map has been developed and where inbred line crosses can be performed easily. Analysis can be extended easily to a general fullsib family or to the selfing of an outcrossing individual: the dimension changes from two to four, binomial distributions change to multinomial distributions, and the transition probabilities between adjacent loci change. Marker information now contributes to the full or partial identification of four combinations of genotypic configurations. As with the BC case, partial marker information can be defined as the union of compatible cases. All the above changes are rather trivial consequences of the change in dimension but complicate presentation substantially. Additionally, the missing phase information needs to be considered. Furthermore, the multipointing algorithm becomes more important for the fullsib design.
Presently, our method for the backcross can only be used to analyze the SDL currently segregating in the two lines, not those that have been segregating in the ancestral population from which the inbred lines derived. Segregation distortion might have already affected the inbreeding process for creation of the lines. Extrapolation from the current to the ancestral situation is therefore problematic. This problem is even more pressing for recombinant inbred lines, where overrepresention of chromosomal fragments of one or the other parent is commonly observed (e.g., Lister and Dean 1993) and requires a more elaborate approach.
A distinction needs to be made between segregation distortion before and after fertilization. An SDL acting before fertilization can only alter gametic proportions. Thus genotypic proportions will only be altered indirectly through the combination of gametic proportions, which restricts the achievable combinations of genotypic proportions. On the other hand, SDL acting after fertilization may alter genotypic proportions directly. Thus, many more combinations of genotypic proportions are possible for SDL acting after fertilization. In experimental crosses more complex than the backcross design, inferred genotypic proportions of an SDL may thus render unlikely prefertilization mechanisms of segregation distortion. Two or more SDL acting before fertilization may, however, mimic the effect of SDL acting after fertilization because of the increase in combinatorial possibilities.
In hybrids of species or subspecies, segregation distortion commonly occurs (see, e.g., Whitkus 1998 and references therein). This may be caused by structural rearrangements, e.g., inversions, which constitute a prefertilization mechanism. Alternatively, the segregation distortion may be caused by postfertilization differences in viability between genotypic configurations, most probably caused by epistatic interactions. Our method can be used to detect chromosomal areas that are causing these distortions. But because of the presumed epistasis, relaxation of the assumption of a multiplicative effect of different SDL may be necessary.
Our method may also be used to map loci influencing early viability. This would enhance our understanding of the nature of early inbreeding depression. The method provides another approach for estimating the number and effects of loci causing inbreeding depression. Traditionally, such information has been derived mainly from biometric analysis of crosses (e.g., Dudash and Carr 1998). But as inbreeding depression can be expressed in embryonic life stages not amenable to biometric analysis, application of this method is limited. To gain insight on these early life stages, sparse maps and singlemarker methods have been used to infer the effect of a viability locus influencing inbreeding depression (Sorensen 1967; Servitova and Cetl 1984; Hedrick and Muona 1990; Fu and Ritland 1994a; Kärkkäinenet al. 1999). With singlemarker analysis, estimation of position and effect of the SDL is, however, confounded and multiple SDL on a single linkage group cannot be handled at all. Interval methods (Fu and Ritland 1994b; MitchellOlds 1995; Chenget al. 1996) rely on fully informative markers flanking the putative SDL and assume just one SDL per chromosome. Dense linkage maps of fully informative markers may be hard to obtain in closely related individuals that need to be considered in the analysis of inbreeding depression. Like the interval methods, our method requires a dense linkage map of polymorphic markers but is not restricted to fully informative markers; instead it can make efficient use of, e.g., dominant markers.
Only rarely have data sets been gathered for mapping segregation distortion or viability selection (see, however, Harushimaet al. 1996 and Kuanget al. 1998). But often in QTL experiments, wide crosses are used to increase differences between parents and thus the power of mapping. Probably for this reason, markers with segregation ratio distortions are commonly observed in data sets used for QTL mapping resulting from wide crosses (e.g., van Ooijenet al. 1994). Segregation ratio distortion is also commonly observed in doubled haploid lines (e.g., Fultonet al. 1997).
Usually generation of a linkage map of marker loci precedes QTL analysis. If a dense map of informative markers is inferred correctly, the bias introduced by segregation distortion into QTL analysis will be negligible. But if recombination fractions or, worse, order of marker loci are inferred incorrectly, basic assumptions of QTL analysis do not hold and results will be imprecise at best. Hence, aside from being interesting in themselves, SDL cause practical problems in QTL projects as observed, e.g., by Sandbrink et al. (1995). Thus, segregation distortion should be accounted for in mapping projects.
Segregation distortion is known to bias estimation of recombination fractions in twopoint inference of recombination distances between markers (Lorieux et al. 1995a,b; Liu 1998). If markers are fully informative, estimation of the recombination fraction of only the markers flanking the SDL will be affected. Only in the unlikely case of coincidence of SDL and marker location will no bias be observed. If less than fully informative markers are used, the effects of the distortion are spread out to the smallest interval of fully informative markers flanking the distorted region. As a remedy, markers that show obvious segregation distortion are often excluded from the map. But that reduces coverage of the genome and qualitative or quantitative trait loci might be missed.
Our method can be extended to allow for detection of SDL concurrently with estimation of a linkage. Cheng et al. (1996) have already developed an EM algorithm to infer positions of two fully informative markers in the presence of a single SDL (an interval method) in a backcross or doubled haploid lines. This could be extended to a multipoint inference of a marker map in the presence of SDL by augmenting the EM or MCMC schemes presented herein by allowing the markers to change their positions relative to each other.
The source code for a C++ program and executables for a Sun workstation, with which the above calculations can be performed, are available from Claus Vogl (claus{at}genetics.ucr.edu).
Acknowledgments
We thank Päivi Hurme and Outi Savolainen for the data set and Elja Arjas, Anita de Haan, Mikko Sillanpää, and Nengjun Yi for discussion of this and related issues. Outi Savolainen, Elja Arjas, and Lori Weingartner have commented on earlier versions of this manuscript. We thank ZhaoBang Zeng and two anonymous reviewers for their patient work, which helped to improve this article a lot. This work was supported by grants from the Environment and Natural Resources Research Council and the Medical Research Council to Outi Savolainen and by the National Institutes of Health Grant GM55321 and the U.S. Department of Agriculture National Research Initiative Competitive Grants Program 97352055075 to S.X.
Footnotes

Communicating editor: ZB. Zeng
 Received July 13, 1998.
 Accepted April 3, 2000.
 Copyright © 2000 by the Genetics Society of America