Abstract
A new strategy for studying the genome structure and organization of natural populations is proposed on the basis of a combined analysis of linkage and linkage disequilibrium using known polymorphic markers. This strategy exploits a random sample drawn from a panmictic natural population and the openpollinated progeny of the sample. It is established on the principle of gene transmission from the parental to progeny generation during which the linkage between different markers is broken down due to meiotic recombination. The strategy has power to simultaneously capture the information about the linkage of the markers (as measured by recombination fraction) and the degree of their linkage disequilibrium created at a historic time. Simulation studies indicate that the statistical method implemented by the Fisherscoring algorithm can provide accurate and precise estimates for the allele frequencies, recombination fractions, and linkage disequilibria between different markers. The strategy has great implications for constructing a dense linkage disequilibrium map that can facilitate the identification and positional cloning of the genes underlying both simple and complex traits.
WITH improved techniques for highthroughput identification and genotyping of polymorphisms, it has been possible to genotype molecular markers throughout the genome and construct a dense linkage map covering the entire genome (Landegrenet al. 1998; Wanget al. 1998). For species with homozygous inbred lines available, the linkage analysis of markers is based on the recombinations of a particular chromosome region that are created by hybridization between two genetically divergent inbred lines (Mather and Jinks 1982). Such a strategy can directly provide an estimate for the linkage relationship of markers as measured by recombination fraction, because there is clear information about parental linkage phase between alleles of different markers. However, linkage mapping has two major limitations. First, for closely linked markers, there will be few recombinations in a segregating generation and, hence, a dense linkage map will provide little extra information about the localization of target genes, unless the number of individuals of the generation is very large (Darvasiet al. 1993). For example, using linkage analysis, Long et al. (1995) could only map quantitative trait loci (QTL) affecting bristle numbers in Drosophila to regions of ∼510 cM. Second, homozygous inbred lines used to generate the F_{1} parents of a priori known linkage phases for the traditional linkage analysis (Mather and Jinks 1982) are virtually unavailable for natural populations. For many study materials sampled randomly from a natural population, therefore, uncertainty of linkage phase between markers prevents a direct estimate of their recombination fraction.
For natural populations, the degree of nonrandom genetic association or linkage disequilibrium, produced at a historic time by various evolutionary forces such as mutation, drift, selection, and admixture, is estimated to indirectly infer how strongly these markers are linked on the same chromosome. If the linkage disequilibrium of the markers occurred a long time ago, a strong linkage disequilibrium detected may suggest close physical linkage between the markers because linkage disequilibrium decays with time (Kaplanet al. 1995). This principle has tremendous potential for constructing finescale linkage disequilibrium maps for cloning the genes that cause complex qualitative or quantitative traits (reviewed in Templeton 1999). At present, a number of theories or techniques have been well established for linkage disequilibriumbased mapping of target genes in natural populations (Hästbacka et al. 1992, 1994; Risch and Merikangas 1996; Xiong and Guo 1997; Terwilliger and Weiss 1998; Kruglyak 1999; Meuwissen and Goddard 2000).
The success of linkage disequilibrium mapping is the presence of linkage disequilibrium between different loci arising from the covariance of the population frequencies of nonalleles in the same gamete (Lynch and Walsh 1998). The degree and extent of linkage disequilibrium reflect the evolutionary history of a population and its interactions with different evolutionary forces (Hill and Robertson 1968; Nei and Li 1973; Ohta 1982a,b; Epperson and Allard 1987; Petersonet al. 1999; Farniret al. 2000). A number of earlier studies have focused on statistical methods for detecting the existence of linkage disequilibrium. A likelihoodbased procedure was developed by Hill (1974) to estimate the coefficient of linkage disequilibrium between two loci in a finite random mating population. Brown (1975) established a theoretical framework for the sample sizes required for detecting the disequilibrium by the use of data on gametic and zygotic frequencies. Weir and Cockerham (1978) suggested a statistical procedure for calculating the power of testing linkage disequilibrium between different loci of multiple alleles. More recently, Luo (1998) and Luo and Suhai (1999) proposed statistical approaches for testing and estimating linkage disequilibrium between a polymorphic marker and a putative QTL. All of these analyses have laid a necessary foundation for linkage disequilibrium mapping of disease genes in human populations (Hästbacka et al. 1992, 1994; Collins and Morton 1998; Escamillaet al. 1999; Serviceet al. 1999).
A major problem with current strategies for linkage disequilibrium mapping is that they provide little insight into the mechanistic basis of linkage disequilibrium detected in a natural population. Without such knowledge, however, the genomic localization and cloning of genes based on linkage disequilibrium may not be successful, because a strong linkage disequilibrium detected between two genetic loci may be due to the recent occurrence of disequilibrium rather than a close physical map distance of the two loci. In human genetics, the cause of linkage disequilibrium can be revealed through a combined linkage and linkage disequilibrium analysis, as shown by a transmission/disequilibrium testing (TDT) approach (Allison 1997; Rabinowitz 1997; Camp 1998). However, TDT is critically relied upon for nuclear family data with complete records for multiple successive generations. This approach therefore cannot be used for genome mapping in many other situations where no nuclear family records are available or for other undomesticated species, such as wildlife and forest trees. For these situations or species, it is essential for developing a powerful approach that needs no nuclear family but can still provide a simultaneous estimate for linkage and linkage disequilibrium between genetic loci of interest.
In this article, we propose a new strategy for detecting linkage and linkage disequilibrium between polymorphic markers in natural populations. The new strategy is expected to provide a new avenue for studying the evolutionary dynamics of population variation and differentiation. Furthermore, as compared to a pure linkage analysis or linkage disequilibrium analysis, the combined use of linkage and linkage disequilibrium analysis methods can greatly enhance the feasibility of highresolution mapping of genes of interest and their subsequent genetic manipulation. The strategy is presented in two parts, one on dioecious species and the other on monoecious species. Dioecious species including animals, humans, and many forest trees, such as Ginkgo, poplar, and willow, display a single sex for an individual and, therefore, are predominantly outcrossing. Monoecious species comprising most crop and horticultural plants and forest trees such as pine, fir, and spruce carry both sexes on every individual and could be both selfcompatible and outcrossing. We first deal with a simpler dioecious model. A more complicated statistical model for analyzing natural populations of monoecious species will be reported in a forthcoming companion article.
TWOLOCUS MODEL
Population structure theory: Consider a panmictic natural population of a dioecious species in HardyWeinberg equilibrium. In the population, η neutral codominant markers M^{1},..., M^{η} are assumed to be segregating. Let an allele at marker M^{i} (i = 1,..., η), designated by M^{i}_{r} (r = 1,..., n_{i}), have population frequency
Assume that a second marker M^{j} is located on the same chromosome as M^{i}, both markers having a recombination fraction θ^{ij}. These two linked markers are genetically associated in the population with the coefficient of gametic linkage disequilibrium between a pair of nonalleles from the two markers denoted by
If all zygotes can produce gametes for the next generation, there will be a total of n_{i}n_{j} gametes for markers M^{i} and M^{j} at the entire population level. But different zygotic genotypes produce different types of gametes; only the genotypes heterozygous at both markers generate all types of gametes whose relative frequencies are affected by recombination fraction and linkage disequilibrium. Table 1 gives nine zygotic genotypes, their population frequencies, and the frequencies of gametes they produce for the next generation under a simpler biallelic model (see appendix a for derivations). According to the population genetics theory (Nagylaki 1991), the amount of linkage disequilibrium between any two loci is reduced at the rate of recombination frequency after the population mates at random for one generation. The coefficient of linkage disequilibrium in the new generation is changed to be
Sampling theory: A sample of H female plants is randomly selected from the population. The seeds of these sampled plants are collected and germinated into seedlings. In traditional quantitative genetics, these seedlings grown in a regular experimental design initiate a progeny test, which serves as the selection of best parents for the next generations (McKeand and Bridgwater 1998). Both the sampled plants and their progeny are genotyped for molecular markers M^{1},..., M^{η}, each of an arbitrary number of alleles. According to the sampling theory, every randomly selected plant from the original population should be one of the ¼n_{i}n_{j}(n_{i} + 1)(n_{j} + 1) distinguishable genotypes for the two markers M^{i} and M j, each genotype with a frequency of
Estimation theory: The allele frequencies, linkage, and linkage disequilibrium for the markers M^{i} and M^{j} in the original population can be estimated using the random sample. To estimate these unknown genetic parameters associated with the two markers
There have been a number of computational algorithms available to obtain maximumlikelihood estimates (MLEs) of the four unknowns. In this article, the Fisherscoring algorithm based on iterations is employed (Edwards 1984) because it is easy to derive and also very fast. In terms of this algorithm, the estimates at the (τ + 1)th iteration can be expressed by the score function vector S(π^{ij}) and Fisher information matrix I(π^{ij}) (appendix b). The values at the τth iteration are modified by adding to them the scores divided by the information, both evaluated at the τth iteration. This iteration continues until successive iterates differ by less than some specified amount. It is apparent that the appropriateness of the Fisherscoring algorithm relies upon the condition that the information is not zero or the information matrix is nonsingular. In practice, it is always desirable to try several different starting values and to compare the likelihoods found after convergence. After obtaining the MLEs of the unknown parameters, the inverse of
For the purpose of linkage mapping, the degree of linkage between the two markers under consideration is important and should be tested statistically. The hypotheses for testing for linkage are H_{0} (free recombination), θ^{ij} = 0.5 vs. H_{1} (linkage), θ^{ij} ^{1} 0.5. The likelihoodratio (LR) test statistic has the form of
If the null hypothesis of (5a) is accepted, then a significant linkage disequilibrium detected by (5b) indicates that linkage disequilibrium between a pair of markers is not due to their strong linkage. In this case, results from pure linkage disequilibrium mapping (Luo 1998; Luo and Suhai 1999; Meuwissen and Goddard 2000) are ineffective for gene mapping. If nonsignificant linkage disequilibrium is detected for two linked markers, although this may be rare, the two markers are still useful for potential localization of a target gene. Thus, by testing simultaneously for the significance of linkage and linkage disequilibrium, our analytical approach increases both the effectiveness and efficiency of gene mapping in a natural population.
MARKER ORDERING
The principle for a joint linkage and linkage disequilibrium analysis of two markers can be extended to include more than two markers. This extension is based on two assumptions: (1) recombination between any two markers is independent of recombination between any other nonoverlapping two, i.e., no crossover interference; and (2) linkage disequilibrium between one pair of markers is independent of disequilibrium between other pairs. When there are more than two markers, the most likely linkage order should give the highest likelihood value for a particular dataset. With the two assumptions described above, we propose a hidden Markov model to determine an optimal order for different markers (see also Lander and Green 1987).
Assume that all η codominant markers are derived from the same chromosome in a randomly mating population. We use
The coefficient of linkage disequilibrium between a pair of nonalleles r_{i} and r_{i}_{+1} from two adjacent markers is denoted by
Similarly, the MLEs of the unknown vector π can be obtained by the Fisherscoring algorithm based on iterations (appendix b). The hypotheses for linkage and linkage disequilibrium for every two adjacent markers can be tested accordingly. Using the Markov chain model (6), we can only estimate the linkage disequilibria between two adjacent markers and ignore the estimates of disequilibria between distant markers. Such a result may be limited from a population genetic perspective, because one cannot detect all possible linkage disequilibria generated by evolutionary forces. However, this result can definitely facilitate genomic localization and cloning of genes because our objective is to use a nearest marker to manipulate a target gene of interest.
RESULTS
To demonstrate the statistical properties of the method proposed in this article, we analyze examples on the basis of simulations. In these examples, plants for seed collection are supposed to be randomly sampled from a natural population in HardyWeinberg equilibrium. The effects of different sampling schemes and parameter values on the estimates for unknowns are examined, respectively.
Effects of sampling schemes: Assume that the total number (1000) of the openpollinated progeny collected from all sampled plants is fixed. Five different sampling schemes are generated by changing the number of the sampled plants (H), each of which corresponds to a halfsib family, and the size of progeny (N) generated by each sampled plant (Table 3). These five schemes represent few large families, many small families, and moderately sized families of a moderate number. Among all the strategies, the value for each of the genetic parameters

Randomly assign the nine joint genotypes at the two markers M^{i} and M^{j} to the H sampled plants according to multinomial distribution with the probabilities as given in Table 1.

Randomly assign the twomarker genotypes to the openpollinated progeny generated by each sampled plant of a particular marker genotype according to the probabilities of the marker genotypes of the progeny given in Table 2.
Because the estimates for the four unknowns are based on known marker genotypes, a likelihoodbased approach has many desirable properties in the rate of convergence to achieve stable MLEs and the accuracy and power to obtain these estimates (Hill 1974). In this simulation, we compare the predicted variances of the estimates for these parameters from the asymptotic variancecovariance matrix of the MLEs to the empirical estimates calculated from the repeated simulations. The estimates of the parameters based on 100 runs are averaged and their standard errors are calculated. The sampling errors of the estimates based on a single run are calculated using the Fisher information index as described in appendix b.
Table 3 illustrates the MLEs for each of the four unknown parameters and two types of standard errors under different sampling schemes. In all situations, regardless of the combinations of family number (H) and size (N), the MLEs of the allele frequencies for two hypothesized markers using the estimation procedures developed in this article are adequately consistent with their actual values. The same is also true for the MLEs of recombination fraction and linkage disequilibrium between the two markers. Results from statistical tests based on Equations 5a and 5b indicate that the alleles of these two different markers are physically significantly linked and genetically significantly associated in the population.
In this example, the predicted values for standard errors estimated from the inverse of the information matrix are reasonably approximate to their empirical values from multiple simulation runs (Table 3). This may be partly because our parameter estimates are based on complete marker information without missing data (see also Luo and Suhai 1999). As assessed by these two types of estimates for standard errors, the method proposed here has good precision for estimating the population genetic parameters of molecular markers. The power to detect significant linkage or linkage disequilibrium between the two simulated markers is higher for few families of large sizes than for many families of small sizes. But in all sampling schemes, the power is 0.90 or higher.
Effects of linkage and linkage disequilibrium: In this simulation, we assume five biallelic markers with a known order on the same chromosome. These markers are jointly sampled from a natural population in which allele frequency is set to be
Generally, the estimates of allele frequency are not much affected by the degrees of linkage and linkage disequilibrium of markers (Table 4), with consistent results from separate and joint analyses. The estimation precision of recombination fraction and linkage disequilibrium can be much increased when two markers are tightly linked or display low nonrandom association between the allelic frequencies of the markers (Table 4). Both accuracy and precision of parameter estimates from a separate analysis are largely reduced when two markers have loose linkage and strong disequilibrium. However, these can be much improved by using a joint analysis of all the five markers based on a Markov model.
DISCUSSION
The originality of the statistical method proposed in this study is a combined use of the current linkage analysis and linkage disequilibriumbased mapping theory to simultaneously estimate genetic map distances and population genetic associations of markers using random samples drawn from a natural population. Linkage analysis looks for coinheritance of different markers or QTL within a chromosomal region, while linkage disequilibrium looks for differences in the frequency of marker alleles between genotypes of a different marker or different categories of a phenotype. The combined analysis not only can overcome the limitations of linkage analysis, as noted in the Introduction, but also can increase the effectiveness and efficiency of linkage disequilibrium mapping aimed at precise estimation of gene location. The new analytical method can be seen as an extension of linkage disequilibrium mapping for human pedigrees with complete family records toward any types of natural populations.
In this article, a mapping model is developed for dioecious plant species. The progeny of random samples collected from a dioecious population form a series of openpollinated (or halfsib) families each with a common female parent and different male parents. The experimental strategy for including both the sampled plants and their progeny for genome mapping offers a unique opportunity to study the transmission of genes from the parental to progeny generation, which causes the breakdown of linkage through meiotic recombination and, thus, the dissipation of linkage disequilibrium between two markers. Unlike previous strategies for a linkage disequilibrium analysis (Hill 1974; Weir and Cockerham 1978), the new strategy, therefore, can capture the intrinsic relationship between linkage and linkage disequilibrium. In human genetic mapping, simultaneous estimation of linkage and linkage disequilibrium is based on nuclear family data (Allison 1997). Such a strategy cannot take advantage of analyzing random samples from a natural population in that one can collect sample sizes as large as those considered in the simulation study. In practice, the new strategy can make an immediate application to many plant species in which progeny tests have been established in the field for a number of years (McKeand and Bridgwater 1998).
Given a fixed sample size, our simulation study has focused on the influence of different allocations of the samples between and within families on parameter estimation. When a sample size is adequately large, for instance, as is that used in our example, the precise estimation of genetic parameters, allele frequencies, linkage, and linkage disequilibrium for markers can be obtained, irrespective of few large families or many small families. Such an advantage for the strategy proposed in this article results from two reasons. First, our mapping analysis is established on the foundation of both parental generation and openpollinated progeny generation. As a random sample, the parental generation contains as much full information about marker allele frequencies, linkage, and linkage disequilibrium as the original population. Unlike fullsib families, openpollinated families used in our strategy contain full information not only about marker linkage but also about marker population genetic properties due to the contribution of the paternal gametes (pollen) from the population. Second, our linkage analysis of known marker genotypes includes no missing information, a situation not analogous to QTL mapping in which the genotypes at QTL are unknown.
In this study, we implement the Fisherscoring algorithm to obtain the MLEs of unknown parameters defining the likelihood function of a marker dataset. The Fisherscoring algorithm is computationally faster and can be more easily derived (Edwards 1984), as compared to the expectationmaximization (EM) algorithm (Dempsteret al. 1977) or Markov chain Monte Carlo method (MCMC; Hoescheleet al. 1997). Its implementation permits a multiple sampling technique to be used more conveniently. Also, this algorithm can provide the estimates for the asymptotic variances of the parameter estimates. For parameter estimates of known markers with joint genotypes in a multinomial distribution, the asymptotic variances estimated from the Fisher information matrix can adequately describe their sampling errors, especially for large samples, although this may not be an actual case for QTL mapping as seen in Kao and Zeng (1997). In some situations, the calculation of the sampling errors from the Fisher information matrix may encounter negative definitive or singular problems of the matrix. Although these may not be serious for the linkage mapping of known markers, it is advisable to try several different starting values for the parameters to be estimated.
In our experience, a simple Fisherscoring algorithm is sufficient for analyzing informative markers of known genotypes. However, for a real dataset, there may be many marker types of different segregating patterns. Some markers may be dominant and others may be incomplete or misscored. Many questions for treating these noninformative markers are still open. For example, can we extract useful information from these markers to globally enhance our joint linkage and linkage disequilibrium analysis throughout an entire genome? If yes, how do we make this more efficient? Because of the involvement of the markers of missing information, the Fisherscoring algorithm may be insufficient for parameter estimation. The EM algorithm or MCMC methodology should be developed to effectively handle these missing data. In addition, when the idea for a combined linkage and linkage disequilibrium analysis is extended to map QTL of unknown genotypes, which is viewed as a missing data problem, the Fisherscoring algorithm may be very limited. For QTL mapping, more advanced approaches, such as EM algorithm or MCMC, should be developed. Although these approaches are computationally demanding, they can take account of the distribution of multilocus markerQTL genotypes and permit investigators to fit different models of variation at the QTL.
One of the major contributions of this study is to derive general formulas for estimating allelic frequencies, recombination fractions, and linkage disequilibria for multiallelic markers in natural populations. A number of molecular experiments have demonstrated that multiple alleles per genetic locus are very common in undomesticated populations, such as forest trees (Degenet al. 1999). Also, the capacity to detect multiallelic markers is largely enhanced by the development of new biotechnologies such as microsatellites. Analyses of multiallelic markers can be simplified by collapsing them into a few alleles at each locus. But, as found by Weir and Cockerham (1978), this simplification may change the power of detecting linkage disequilibrium and lose some important information about disequilibrium inferences. Thus, it is especially not advisable to use a collapsed set of data when the aim of a study is precise localization of QTL and its subsequent positional cloning.
Our mapping approach here is based on a twopoint analysis. We further extend the simple twopoint analysis to include all markers from the same chromosome through a Markov model. Such a joint twopoint analysis can increase both accuracy and precision of parameter estimation, as demonstrated by a simulation study. For two markers that are not strongly linked but strongly associated between their allelic frequencies, the twopoint analysis excluding other marker information likely has low precision (Table 4). But when other markers are included, the precision of the analysis of these two markers is much increased. Apart from the improvement of the precision of parameter estimation, a joint twopoint analysis based on a Markov model can facilitate the ordering of molecular markers on a chromosome (see also Lander and Green 1987). The other means of ordering markers is to develop a multipoint analysis. Although this is mathematically very complicated, the extension of our method to a multipoint analysis is straightforward. Assume that there are three different markers with an order M^{i}, M^{j}, and M^{k} on the same chromosomes. A threelocus gamete
Many of our species are still in wild states and are of great importance in terms of their economical significance and theoretical values of biological research. For example, as evidenced in Tanksley and McCouch (1997), a number of favorable diseaseresistant genes in crop plants are currently warehoused in natural populations of their wild relative species and can be made useful to humans if the number, effects, and locations of these genes are understood. From an evolutionary perspective, knowledge about the organization and structure of wild populations helps us to understand the genetic mechanisms of population evolution and make reasonable predictions about the dynamic changes of the populations (Barton 2000). Unfortunately, the genelevel studies of genetic architecture and inheritance mode for a complex trait in natural populations are surprisingly rare as a result of the paucity of powerful tools to effectively analyze these populations. Although the method reported here is developed for dioecious species, its extension to monoecious species is possible, but requires an additional mathematical manipulation on outcrossing rate, a population genetic parameter that describes the relative importance of outcrossing pollination to selfing pollination within the same plant. With such powerful mapping approaches available to different kinds of species, we are close to addressing many theoretical or practical genetic questions in depth for natural populations.
Acknowledgments
We are grateful to Dr. George Casella, Dr. Bruce Weir, and Dr. Mark Yang for stimulating discussions about this work; Dr. James Hobert and Dr. Kenneth Portier for helpful readings of this manuscript; and Dr. Gary Churchill and two anonymous referees for thoughtful comments on this manuscript. This research is partially supported by a grant (GM 45344) from the National Institutes of Health.
APPENDIX A: DERIVATION OF GAMETE FREQUENCIES
When a zygotic genotype is homozygous for both markers M^{i} and M^{j}, only one type of gamete is produced and, thus, recombinant and nonrecombinant gametes are mixed. When a zygotic genotype is homozygous for a marker but heterozygous for the other marker, two types of gametes are produced. In this case, one still cannot distinguish between recombinant and nonrecombinant gametes, because each gamete type is mixed. However, for a genotype that is heterozygous at both markers, four types of gametes can be produced. The frequency for each of the four gamete types produced by a twomarker heterozygous genotype is dependent on the recombination fraction of the markers and the frequency with which the heterozygous genotype was yielded through gamete combination in the previous generation. There are two ways for yielding the heterozygous genotype
APPENDIX B: FISHERSCORING ALGORITHMS FOR OBTAINING MLES OF π
For the Fisherscoring algorithm based on iterative steps, the estimates at the (τ + 1)th iteration can be expressed by
Footnotes

Communicating editor: G. A. Churchill
 Received April 19, 2000.
 Accepted October 30, 2000.
 Copyright © 2001 by the Genetics Society of America