Abstract
This article presents methodology for the construction of a linkage map in an autotetraploid species, using either codominant or dominant molecular markers scored on two parents and their full-sib progeny. The steps of the analysis are as follows: identification of parental genotypes from the parental and offspring phenotypes; testing for independent segregation of markers; partition of markers into linkage groups using cluster analysis; maximum-likelihood estimation of the phase, recombination frequency, and LOD score for all pairs of markers in the same linkage group using the EM algorithm; ordering the markers and estimating distances between them; and reconstructing their linkage phases. The information from different marker configurations about the recombination frequency is examined and found to vary considerably, depending on the number of different alleles, the number of alleles shared by the parents, and the phase of the markers. The methods are applied to a simulated data set and to a small set of SSR and AFLP markers scored in a full-sib population of tetraploid potato.
GENETIC linkage maps are now available for man and for a large number of diploid plant and animal species. In contrast, mapping studies in polyploid species are much less advanced, partly due to the complexities in analysis of polysomic inheritance as demonstrated in, for example, Mather (1936), De Winton and Haldane (1931), Fisher (1947), and Bailey (1961). The development of DNA molecular markers [restriction fragment length polymorphisms (RFLPs), amplified fragment length polymorphisms (AFLPs), randomly amplified polymorphic DNAs (RAPDs), simple sequence repeats (SSRs), and single nucleotide polymorphisms (SNPs), etc.] and advances in computer technology have made both theoretical and experimental studies of polysomic inheritance much more feasible than ever before. Some of these markers have recently been used as a fundamental tool to construct genetic linkage maps in polyploid species that display polysomic inheritance (Al-Janabiet al. 1993; Da Silvaet al. 1993; Yu and Pauls 1993; Hackettet al. 1998; Brouwer and Osborn 1999), to search for quantitative trait loci (QTL) affecting disease resistance in tetraploid potato (Bradshawet al. 1998; Meyeret al. 1998), and to investigate population structure in autotetraploid species (Ronfortet al. 1998).
Due to a lack of well-established theory for mapping genetic markers in polyploid species, much research has been based on strategies by which the complexities involved in modeling polysomic inheritance can be avoided. These involve either the use of single-dose (simplex) dominant markers (e.g., AFLPs and RAPDs) that segregate in a simple 1:1 ratio in segregating populations or use of the corresponding diploid relative as an approximation to the polyploid case (Bonierbaleet al. 1988; Gebhardtet al. 1989). More recently, Hackett et al. (1998) presented a theoretical and simulation study on linkage analysis of dominant markers of different dosages in a full-sib population of an autotetraploid species, and this approach was used by Meyer et al. (1998) to develop a linkage map in tetraploid potato.
The use of codominant markers, particularly those with a high degree of polymorphism such as SSRs, is known to improve the efficiency and accuracy of linkage analysis in diploid species (Terwilligeret al. 1992; Jiang and Zeng 1997). In polyploid species, the relationship between the parental genotype and the phenotype as shown by the gel band pattern is less clear-cut, due to the possibilities of different dosages of alleles, and this provides extra complexity as explained in Luo et al. (2000). The aim of the present study is to develop methodology for constructing linkage maps of codominant or dominant genetic markers in autotetraploid species under chromosomal segregation, i.e., the random pairing of four homologous chromosomes to give two bivalents. The complications arising from quadrivalent or trivalent plus univalent formation are not considered in this article. A series of problems involved in tetrasomic linkage analysis are addressed. Statistical properties of the methods are investigated by theoretical analysis or simulation study, and some experimental data from a tetraploid potato study are used to illustrate the use of the theory and methods in analyzing breeding experiments.
THEORY OF LINKAGE MAP CONSTRUCTION
Model and notation: The theoretical analysis considers a full-sib family derived from crossing two autotetraploid parental lines. Let Mi (i = 1... m) be m marker loci (with dominant or codominant inheritance). Let G1 and G2 be the genotypes at the marker loci for two parental individuals, respectively. Gi (i = 1, 2) can be expressed as a m × 4 matrix. Because two tetraploid individuals have at most eight distinct alleles, we represent each element of Gi as a letter A-H or O, where O represents the null allele due to mutation within primer sequences (see, for example, Callenet al. 1993). It is important to note that allele A at marker locus 1 is different from allele A at marker locus 2.
When we are considering linked loci, it is often necessary to specify how the alleles at different loci are grouped into homologous chromosomes, i.e., the linkage phases of the alleles. Alleles linked on the same homologous chromosome will appear in the same column of the matrix Gi. For a two-locus genotype with four different alleles at each locus, one possible genotype is
We define P1 and P2 to be the phenotypes of the two parents, i.e., their gel band patterns at the marker loci. Pi (i = 1, 2) can be denoted by a m × 8 matrix, each of whose elements may take a value of 1 indicating presence of a band at the corresponding gel position or 0 indicating absence of a band. These matrices carry no information about phase. The jth rows of Gi and Pi correspond to locus Mj. Let OMi be the n × 8 matrix of phenotypes of the n offspring at the marker locus Mi.
In general, there is no simple one-to-one relationship between the phenotype and the genotype of markers scored in tetraploid individuals. There are two reasons for this. First, a multiple dosage of an allele cannot be distinguished from a single dosage on the basis of the gel band pattern. Second, some alleles may not be revealed as the presence of a corresponding gel band, i.e., the null alleles. Table 1 summarizes the relationship between genotype and phenotype at a marker locus in which all possible cases of null alleles and multiple dosages of identical alleles are taken into account. It can be seen from Table 1 that there may be four, six, four, or one corresponding genotype(s) if the parental phenotype shows one, two, three, or four bands. An individual genotype can be uniquely inferred from its phenotype if and only if the individual carries four different alleles and these alleles are also observed as four distinct bands.
The relationship between marker phenotypes and genotypes at a single locus for an individual
Luo et al. (2000) recently developed a method for predicting the probability distribution of genotypes of a pair of parents at a codominant (for example, RFLPs, microsatellites) or dominant (for example, AFLPs, RAPDs) marker locus on the basis of their and their progeny’s phenotypes scored at that locus. This approach infers the number of possible configurations of the parental genotypes with the corresponding probabilities, conditional on the parental and offspring phenotypes. For each of the predicted parental genotypic configurations, the expected number of offspring phenotypes and their frequencies can be calculated and compared to the observed frequencies. Results from a simulation study and analysis of experimental data showed that in many circumstances both the parental genotypes can be correctly identified with a probability of nearly 1. A tetrasomic linkage analysis can then be carried out using the most probable parental genotype, or using each of a set of possible parental genotypes in turn if more than one genotype is consistent with all the phenotypic data. This is illustrated in the following analyses of data from simulation and experimental studies.
The steps of the linkage analysis are (i) the prediction of the parental genotype(s) that is consistent with the parental and offspring phenotype data using the method described in Luo et al. (2000); (ii) the detection of linkage between pairs of marker loci and their partition into linkage groups; (iii) the estimation of linkage phase, recombination frequency, and LOD score for pairs of markers within each linkage group; and (iv) the ordering of markers within each linkage group. The power to detect linkage and the variance of the estimates of the recombination frequency are shown to vary considerably with parental configuration and phase, and this will be examined.
Test for independent segregation of loci: The first step of the linkage analysis is to test whether pairs of loci are segregating independently. We propose that this may be investigated for each pair of markers by representing their joint segregation in a two-way contingency table and testing for independent segregation, as discussed by various authors (e.g., Maliepaardet al. 1997) for diploid crosses. Let nij be the observed number of progeny with the ith (i = 1, 2,..., I) marker phenotype at the first locus and the jth(j = 1, 2,..., J) marker phenotype at the second locus. The expected number under independent segregation is eij = ni·n·j/n, where
The power of Pearson’s chi-square test to detect linkage was examined for 100 simulations of each of a range of configurations, linkage phases, and true recombination frequencies. For true recombination frequencies r ≤ 0.2, the power was generally 100% (i.e., the hypothesis of independent segregation was always rejected) for a significance level α = 0.01 and >90% for r ≤ 0.3. The exceptions to this were configurations with alleles restricted to simplex repulsion or duplex mixed configurations; e.g., for cross AB/AA/BA/BB × CC/CD/DD/DC, with all alleles in duplex mixed configurations, and a true recombination frequency of 0.2, independent segregation was rejected for 3/100 simulations. When the markers were genuinely unlinked, the rejection rate for a significance level α = 0.05 was found to be close to 5% for all configurations examined.
Partition of loci into linkage groups: Cluster analysis is a suitable technique to partition the marker loci into linkage groups, so that a marker segregates independently of markers in different linkage groups and shows a significant association with at least some of the other markers within its linkage group. The above test statistics depend on the number of marker phenotypes at each locus, but the significance level of the test for independent segregation is comparable for all pairs and could be regarded as a distance between loci. Although it ranges from 0 for the most tightly linked loci to 1, the range (0, 0.05) is of most interest for indicating pairs of loci that are likely to be linked. We therefore prefer to transform the significance level, say s, to a measure of distance that gives more discrimination between the distances of most interest. The transformation d = 1 - 10-2s, which maps the range of the significance level (0, 0.05) to the range of the distance measure (0, 0.21), was used here, although many alternative transformations are possible. Different clustering methods will give slightly different dendrograms: the nearest-neighbor cluster analysis adds a marker to a cluster according to its distance to the closest marker in the cluster, but can combine large groups on the strength of one marker from each subgroup. We prefer to compare the dendrogram from nearest-neighbor cluster analysis with that from average linkage cluster analysis to avoid such “chaining.” Inspection of the clustering at distances corresponding to different levels of significance will indicate how the marker loci should be partitioned into linkage groups. In practice, the criterion for partitioning the dendrogram into different linkage groups can be determined as the distance measure by which significant linkage is inferred. However, the Bonferroni correction for the overall significance level may be necessary to take the multiple linkage tests into account. The calculation of recombination frequencies and LOD scores then proceeds for each linkage group in turn.
Percentage points for the distribution of 500 replicates of test statistics for independent segregation of two loci with parental genotypes AA/BB/CC/DD × EE/FF/GG/HH
Calculation of segregation probabilities: One of the major difficulties in linkage analysis with tetraploid species is to calculate the conditional distribution of the offspring genotypes, and hence phenotypes, at two linked loci for any given pair of parental genotypes. This involves consideration of a large number of segregation and recombination events. In this section, a general computer-based algorithm is described to compute the probability distribution.
For simplicity but without loss of generality, we use A and B for two loci in this section and subscripts to represent the alleles. Consider a parental genotype AiBi/AjBj/AkBk/AlBl. During gametogenesis of the individual, three equally likely pairs of bivalents can be generated, i.e., AiBi/AjBj//AkBk/AlBl, AiBi/AkBk//AjBj/AlBl, and AiBi/AlBl//AjBj/AkBk, where is used to distinguish paired homologous chromosomes. The gametes created from each of these pairs of bivalents can be sorted into three classes: (i) nonrecombinants, AξBξAηBη(ξ ≠ η; ξ and η may be i, j, k, or l), four gametic genotypes, each of which has a frequency of (1 - r)2/4; (ii) single recombinants AξBηAγBγ(ξ ≠ η ≠ γ; ξ, η, or γ may be i, j, k, or l), eight gametic genotypes, each with a frequency of r(1 - r)/4; (iii) double recombinants AξBηAγB∼ (ξ ≠ η ≠ γ ≠ ∼; ξ, η, γ, or ∼ may be i, j, k or l), four gametic genotypes, each with a frequency of r2/4. Thus, when the three possible pairs of bivalents are considered, a general form for frequency of the gametic genotype i can be written as
To evaluate the coefficients yij manually is obviously very tedious. A computer algorithm was developed to calculate the offspring’s genotypic distribution for any given pair of tetraploid parental genotypes. The computer subroutine outputs the number of all possible distinct offspring genotypes k and {yij} (i = 1, 2,..., k) from the two parental genotypes. For example, if two parental genotypes are AA/BB/BB/OB and CA/DAEC/EO, there are a total of 225 possible genotypes in their offspring. Many of these offspring genotypes correspond to the same phenotype. Thus, the phenotypic distribution of the offspring can be readily derived by combining the probabilities of those genotypes that result in the same phenotype, so that the general formula for the probability of zygote phenotype i is
Maximum-likelihood estimate of r: If the parental genotypes and their linkage phase are known, the joint expected phenotypic distribution of their offspring can be derived using the method suggested above. The corresponding observed offspring phenotypes at the marker loci can be recognized as a random sample from a multinomial distribution with probabilities fi (i = 1, 2,..., k) and sample size n = Rki=1ni, where k is the number of possible phenotypes and ni is the observed number of offspring in the ith phenotype class. Thus, the log-likelihood of the recombination frequency, r, given the observed data at loci Mi and Mj, is given by
The maximum-likelihood estimate (MLE) of the recombination frequency r may be obtained by solving
Phenotypic distribution of a full-sib family from crossing two autotetraploid genotypes AA/BB/BB/OB and CA/DA/EC/EO
In Equation 5, define zij = Σg∊iygjrj (1 - r)4 - j/144, so that the probability of phenotype i is fi = R4j=0zij. Substituting this into Equation 7, we obtain
Estimation of parental pairwise linkage phases: In the above analyses, it was assumed that the parental genotypes and their linkage phases were known. In practice, only the parental and offspring phenotypes are observable. As pointed out in the Model and notation section, Luo et al. (2000) calculate the genotypic distribution for any pair of tetraploid parents at a single dominant or codominant marker locus using data on the marker phenotypes scored on the parents and their offspring. However, the method does not provide information about the linkage phases of the alleles the parents carry at different loci. Knowledge about the linkage phase of the parental genotypes is not only required in the linkage analysis, but it is also important in using the map information in locating QTL (e.g., Lander and Botstein 1989) or optimizing schemes of marker-assisted selection for quantitative traits (Luoet al. 1997).
The number of possible different linkage phases depends on the number of distinct alleles at each locus and increases exponentially with the number of loci under consideration. We therefore consider here the phase for each pair of linked loci and use these as building blocks to estimate the multilocus linkage phase. In a two-locus system of tetrasomic inheritance, an individual genotype may have a maximum of 4 × 3 × 2 = 24 distinct linkage phases, and for a pair of individuals there may be a maximum of 24 × 24 = 576 distinct linkage phase configurations. A Fortran-90 computer subroutine was developed to work out all possible linkage phase configurations for any given pair of parental genotypes G1 and G2 at any two loci i and j. Let S1 and S2 be possible two-locus linkage phases for parents 1 and 2, respectively. The likelihood of r, S1, and S2, given the observed phenotypic data OMi and OMj at the loci, may be written as
Ordering the markers: The above analyses give the maximum-likelihood estimate of the recombination frequency and the linkage phase for each pair of markers in a linkage group. This information can be used to order the markers in linkage groups and to calculate map distances between them. One possible approach, the least-squares method for estimation of multilocus map distances as implemented in the JoinMap linkage software (Stam and Van Ooijen 1995), was examined by Hackett et al. (1998) in a simulation study of dominant markers in a tetraploid population. They concluded that the reconstructed marker order and map distance using the JoinMap analysis of simulation data was in good agreement with the simulated ones. The same method was used here.
Estimation of parental multilocus linkage phases: Once the markers have been ordered, we need to reconstruct the phase of the complete linkage group. Prediction of the multilocus linkage phase in tetrasomic linkage analysis is not feasible: there are a huge number of configurations of possible phases and no appropriate theory of multilocus linkage analysis for tetraploid species. Here we propose an intuitive algorithm to predict the multilocus parental linkage phase in the tetrasomic linkage analysis, on the basis of the range of likelihood values of the alternative linkage phases obtained in the above two-locus analysis. Let dij be the difference in the log-likelihood value between the most likely and the second most likely linkage phases predicted for the marker loci i and j on a linkage group. The phase of the marker pair with the largest log-likelihood difference dij is reconstructed first, and further markers are then placed relative to this pair, placing markers with large dij before those with smaller dij. There may be a contradiction between the phase of two markers estimated directly and the phase estimated when each of the pair is referred to a third marker; we reject an overall configuration with such contradictions for a pair with large dij, but accept the overall configuration if dij is close to zero.
INFORMATION AND POWER OF THE MAXIMUM-LIKELIHOOD ESTIMATION
The information of the maximum-likelihood estimate of the recombination frequency r is given by
Hackett et al. (1998) demonstrated that the simplex coupling linkage phase was the most informative for estimating recombination frequency among dominant marker configurations. For this the information is n/r(1 - r) and the information content of other configurations is examined relative to this by means of the relative information
It has been shown by Agresti (1990, pp. 98, 241) that G2has an approximate large-sample noncentral chisquare distribution with 1 d.f. and the noncentral parameter in the present context is
For two parents, there are 128 configurations at a single locus where the parents share one or more alleles, which are informative about recombination in both parents. This count does not include permutations of the parents; i.e., AAOO × AOOO and AOOO × AAOO are considered as the same configuration. To consider all pairs of such loci, and to allow for the different phases, would give a very large number of configurations. We therefore examined the information and power of the likelihood-ratio test for each configuration when linked to a locus with eight alleles, ABCD × EFGH. The most informative configurations are those with seven or eight alleles: AA/BB/CC/DD × EE/FF/GG/HH and AA/BB/CC/DD × EA/FE/GF/HG, which are four times as informative as the simplex coupling configuration for all values of the recombination frequency. For many configurations, the relative information varies with the recombination frequency. At a recombination frequency of 0.2, 20 of the configurations examined were less informative than simplex coupling: these configurations were characterized by a small number of alleles occurring as simplex or duplex in each parent. The least informative configuration was AA/BA/CO/DO × EA/FA/GO/HO, with a relative information of 0.14. There was a strong linear relationship between the information and the noncentrality of the likelihood-ratio test, for example, a correlation of 0.996 using a recombination frequency of 0.2. For a recombination frequency of 0.2 and a population of 200 offspring, the power of the likelihood-ratio test was >0.9 for all configurations except the least informative AA/BA/CO/DO × EA/FA/GO/HO, although the power decreases with decreasing population size or increasing marker separation.
When the two parents do not share any alleles, the information can be calculated for each parent separately and then summed. The most informative configuration for a single parent is AA/BB/CC/DD, which is twice as informative as the simplex coupling configuration for all values of the recombination frequency. The relationship between the information and the noncentrality is the same as for two parents with shared alleles. The least informative configurations are some of those with a single informative allele: duplex-duplex mixed (AA/AO/OA/AA, relative information = 0.04), simplex repulsion (AO/OA/OO/OO, relative information = 0.07), and duplex-duplex repulsion (AO/AO/OA/OA, relative information = 0.11). Some configurations with two informative alleles also have very low information, for example, AO/AB/BA/OA, where the two duplex alleles at each locus are in repulsion and so are the two simplex alleles. The relationship between the relative information and the recombination frequency is illustrated for a range of configurations in Figure 1.
As the information depends on the configurations of both loci and on their phase, it is difficult to exclude any single-locus configurations as uninformative. The more alleles at a locus, the more informative it is likely to be, especially if these loci are present in a simplex configuration. A locus with an allele that occurs in both parents is likely to have a low information content, unless we are considering a configuration such as AA/OO/OO/OO × AA/OO/OO/OO, with single-dose alleles in coupling in both parents. The configurations with low power for detecting nonindependent segregation by Pearson’s chi-square test also had low information and low power in the likelihood-ratio test.
SIMULATION STUDY
To validate the theoretical analyses represented above and to investigate their statistical properties, we conducted a simulation study using the method developed above.
Simulation model: Computer programs were developed to simulate meiosis in a tetraploid individual with any genotype at the simulated marker loci, random pairing of four homologous chromosomes to give two bivalents (i.e., no double reduction), random sampling of gametes from meiosis, random union of gametes randomly sampled from the gamete pool, and generation of the phenotype from any given individual genotype. In a single meiosis, the “random walk” procedure suggested by Crosby (1973) was extended to simulate genetic recombination between linked loci. Chiasmata interference, sexual differentiation in recombination frequency, and segregation distortion were assumed to be absent in the simulation model.
—Relative information about the recombination frequency for different parental genotype configurations.
A full-sib family was simulated by crossing two tetraploid parental lines. Twenty-two codominant marker loci were generated, 10 linked on the first chromosome, 5 on each of the second and third chromosomes, and 2 isolated loci that were independent of the rest. The simulated parental genotypes at each of the marker loci were determined by sampling independently from six possible alleles whose population frequencies were assumed to be 0.3 (allele A), 0.2 (allele B), 0.2 (allele C), 0.1 (allele D), 0.1 (allele E), and 0.1 (null allele O), respectively. Loci with more than six alleles were not simulated, as these appear to be rare in practice (R. C. Meyer, personal communication). The main purpose for choosing parental genotypes in such a way is to test the theory and method on a general basis. The parental genotypes at these marker loci and the recombination frequencies between the adjacent loci are shown in Table 4. It should be noted that the alleles listed in the same column for loci on the same chromosome have the same linkage phase. The phenotypes of the two parents and 200 offspring (a realistic number for actual experiments) were scored at all 22 marker loci. To elucidate statistical properties, some pairs of these loci were studied in 100 repeated simulation trials.
Analysis of the simulated data: The genotypes of the two parents were predicted for each of the loci using the method proposed by Luo et al. (2000), on the basis of the phenotypes of the parents and their offspring. The predicted parental genotypes are tabulated in Table 4 together with the corresponding probabilities. It can be seen that the parental genotypes at 18 of the 22 marker loci were diagnosed correctly with a prediction probability of nearly 1.0. However, there were two almost equally likely parental genotypes predicted for the marker loci L2, L5, L20, and L22. For locus L2 the parental phenotypes are the same (1110000), but the most likely parental genotypes are different (AABC and ABCC) and it is not possible at this stage to tell which parent has which genotype. Both genotypes at this locus were used in the linkage analysis. For the other three loci, allele A is present for all offspring, and this is consistent with more than one configuration with multiple dosages of A. The dosages of the informative alleles are the same for the two possible configurations for L5, L20, and L22, and so estimates of recombination frequencies are the same for the two configurations.
The simulated parental genotypes (G1 and G2), their corresponding phenotypes (P1 and P2), and the most likely parental genotypes (Ĝ1 ×Ĝ2) predicted at 22 simulated marker loci
Pearson’s chi-square tests of independence were performed for all possible pairs of these marker loci using the test statistic given in Equation 1. Figure 2 displays the significance probabilities, transformed to distances as described previously, as dendrograms calculated using nearest-neighbor cluster analysis and average linkage cluster analysis. The nearest-neighbor analysis shows the three clusters (loci L1-L10, L13-L17, and L18-L22) have each grouped at a distance of zero. Loci L11 and L12 remained isolated until the distance exceeded 0.13. However, the three linkage groups also merge at a very small distance. Inspection of the significance levels shows that this is due to a single (spurious) significant association between L1 and L18. Inspection of the average linkage cluster analysis shows that the same initial groupings form more slowly, but that locus L18 is clearly associated with L19 and L20, and the average distance between locus L18 and loci L1-L10 is large. The distantly linked group L18-L22 finally merges at a large distance using average linkage cluster analysis, but inspection of the significance levels shows highly significant associations between 6 of the 10 pairs of this group, and we proceed assuming that they form a linkage group.
—Cluster analysis of the 22 simulated loci, using (a) nearest-neighbor cluster analysis and (b) average linkage cluster analysis.
The maximum-likelihood estimates of pairwise recombination frequencies (the upper diagonal) and the LOD scores (the second rows of the lower diagonal) calculated for the most likely parental phases for loci L1-L10
Linkage analysis was performed on all pairs of loci within each linkage group. For brevity only the results from the largest linkage group (loci L1-L10) are presented. Table 5 shows the significance of this test (the first rows of the lower diagonal), the maximum-likelihood estimates of recombination frequencies (the upper diagonal) for the most likely phase, and the corresponding LOD scores (the second rows of the lower diagonal) among the pairs of loci. It can be seen that the true parental genotype at L2 has consistently higher LOD scores in its pairings with L1, L3, L4, L6, and L7 (i.e., all the highly significant linkages) than the other predicted parental genotype (L2′) with the parental genotypes reversed. The estimate of the recombination frequency and LOD score were unaffected by the choice between the alternative genotypes for locus L5. Most cases where the independence test was significant (P < 0.05) corresponded to LOD scores >3, although a small number of pairs with a significant independence test (e.g., L1, L8) had large recombination frequencies and lower LODs.
The maximum-likelihood estimates of the pairwise recombination frequencies and the LOD scores in Table 5 were used to construct a linkage map of these genetic marker loci using the JoinMap linkage software (Stam and Van Ooijen 1995), as summarized in Figure 3. The best-fitted map predicted from JoinMap indicates that loci L1-L10 were joined into a correct order except that the relative simulated positions of the marker loci L7 and L8 were reversed. The map distances of the linkage group agreed well with the actual ones.
The linkage phases of the parental genotypes were reconstructed using the procedure described in the above analysis. Table 6 illustrates the parental linkage phases at every pair of loci with a difference dij > 3 in the log-likelihood between the most likely and second most likely phase. Locus L5 does not appear in Table 6, as there was only one phase with a recombination frequency <0.5 in each case. The reconstructed linkage phases of the parental genotypes are shown in Figure 3. This reconstruction uses the most likely phase for all pairs except for four [(L1, L8), (L1, L9), (L2, L7), (L6, L9)]. For these four pairs, the largest difference in the log-likelihood between the most likely phase and the reconstructed phase was 0.56, and the difference in the estimates of the recombination frequency was always <0.01. The reconstructed phase is identical to that simulated.
—The best-fitted map, the estimated map distance (in centimorgans), and parental linkage phases reconstructed from the codominant marker loci L1-L10 from the simulation study.
Linkage maps of loci L13-L17 and L18-L22 were estimated using the same approach. In each case the order and phase were reconstructed correctly.
To investigate the reliability of the pairwise linkage phase estimation, separate simulation trials were performed. The simulated recombination frequencies were 0.05, 0.1, and 0.3 and the sample size was 200. Figure 4 illustrates the maximum-likelihood estimate of r between L9 and L10 (Figure 4a), and between L7 and L10 (Figure 4b), and the corresponding LOD scores calculated at all possible parental linkage phase configurations. It can be seen that the correct parental linkage phases were the most likely when the marker loci were closely linked (i.e., r ≤ 0.1), although the difference in the likelihood value between the most likely and the second most likely linkage phases reduced as the value of r increased. When the loci were loosely linked (i.e., r = 0.3), the most likely parental linkage phase could differ from the simulated phase, but when this occurred the MLE of r at the most likely phase was always very close to that calculated at the simulated phase.
Further simulations were carried out to examine the power to detect linkage and the bias in estimates of the recombination frequency. Table 7 shows the means and standard deviations of the maximum-likelihood estimates of recombination frequencies for 100 replicate simulations of some pairs of marker loci considered above. Linkage was detected as significant (P < 0.05) by both the independence test and the likelihood-ratio test with a frequency ≥90% when the recombination frequency r ≤ 0.3, except for the least informative pair (L2, L5). For r = 0.5, the frequency of significant tests was close to 5%. The means of the MLEs of r were close to the corresponding simulated values for r ≤ 0.3. For r = 0.5, the marker estimates were biased downward, due to the selection of the most likely phase. The parental linkage phases at the marker loci were correctly predicted for at least 89% of simulations with cases with r ≤ 0.3.
The most likely parental genotypic linkage phases (S1 and S2) and the difference (dij) in log-likelihood value between the most likely and the second most likely linkage phases of the marker loci Li and Lj
LINKAGE ANALYSIS OF EXPERIMENTAL DATA IN AUTOTETRAPLOID POTATO
Some preliminary data from the Scottish Crop Research Institute were used to test this approach, using five SSR marker loci (STM0017, STM1017, STM1051, STM1052, and STM1102) and six AFLP marker loci (e35m61-18, e35m61-21, e37m39-14, e39m61-7, e46m37-12, and p46m37-12) scored on 77 offspring from a cross between two parental lines: the advanced potato breeding line 12601abl and the cultivar Stirling (Bradshawet al. 1998). Details of scoring the DNA molecular markers are described in Meyer et al. (1998) and Milbourne et al. (1998). Preliminary analysis of the AFLP markers (Meyeret al. 1998), and of the SSR markers in diploid and tetraploid populations (Milbourneet al. 1998) suggested that these markers are all on the same linkage group.
Table 8 summarizes the parental phenotypes and the phenotype distribution of the offspring at the marker loci. Of a total of 77 offspring scored at these marker loci, there were 73, 73, 72, and 70 progeny whose phenotypes at the marker loci STM1017, STM1051, STM1052, and STM1102, respectively, were unambiguously observed. The phenotypic data were used to predict the parental genotypes using the method of Luo et al. (2000). The predicted parental genotypes at the marker loci are also shown in Table 8 together with the corresponding prediction probabilities and the χ2 values of the goodness-of-fit test. It was found from the analysis that the number of possible genotype configurations with probability ≥0.1 varies from 1 (at STM0017, STM1051, and the AFLP marker loci) up to 8 (STM1017). For this locus, all that can be deduced is that allele 1 occurs in a simplex condition in parent 1.
—The maximum-likelihood estimates of recombination frequencies and the corresponding LOD scores for all possible linkage phases for (a) loci L9 and L10 and (b) loci L7 and L10. Arrows indicate the true linkage phases.
The independence tests were performed for all possible pairs of the marker loci and the significance probabilities of the tests are listed as the first rows of the lower diagonal in Table 9, along with the results of a pairwise linkage analysis. The maximum-likelihood estimates of recombination frequency between the pairs of marker loci are listed on the upper diagonal and the LOD scores are given in the second rows of the lower diagonal of Table 9. Both possible genotypes for locus STM1052 are shown, as these gave slightly different estimates of the recombination frequencies and LOD scores. The likelihood for genotype AABO × AACO was always larger than for the other genotype. The use of the alternative parental genotypes at loci STM1017 and STM1102 did not affect the estimates of recombination frequencies and LOD scores.
Mean and standard deviation of the maximum-likelihood estimate of recombination frequency and the empirical statistical power for detecting the linkage based on 100 simulations
Phenotypes of five SSR and six AFLP marker loci scored on two parents (P1, Stirling; P2, 12601abl) and their progeny and the predicted parental genotypes G1 and G2 at these marker loci
The MLEs of the pairwise recombination frequencies and the LOD scores were used to map the marker loci using JoinMap. The 11 markers were mapped as a linkage group with a length of 48.9 cM (using genotype AABO × AACO for STM1052). The order was the same using the alternative genotype, and the calculated length in this case was 48.7 cM. The allelic linkage phases of the parental genotypes at the marker loci were reconstructed as described. The linkage map and the reconstructed phases are illustrated in Figure 5. For loci STM1017 and STM1102, the dosage of the alleles that are present for all offspring is uncertain, but the phase of the segregating alleles can be reconstructed. For STM0017, there is uncertainty about the phase for parent 2 as this marker is well separated from the other SSR markers that are informative about parent 2, although allele A of this marker is unlikely to be linked in coupling to the simplex alleles of the other SSR markers (A for STM1102, C for STM1052, and D for STM1051). The only difference between the inferred phase in Figure 5 and the most likely phase is for the pair STM0017 and STM1102, for which the inferred phase has a log-likelihood 1.16 less than the most likely, although both phases correspond to loose linkages (recombination frequency ≈ 0.4, LOD 0.90). The conclusion that the SSR markers form a single linkage group agrees with the analysis of these markers in a diploid cross (Milbourneet al. 1998).
The maximum-likelihood estimates of pairwise recombination frequencies (the upper diagonal) among five SSR and six AFLP marker loci in autotetraploid potato, their corresponding LOD scores (the second rows of the lower diagonal), and the significance level of the independence tests (the first rows of the lower diagonal)
DISCUSSION
In this article we have developed the methodology for constructing linkage maps of codominant or dominant genetic markers in autotetraploid species under chromosomal segregation, i.e., the random pairing of four homologous chromosomes to give two bivalents. Our strategy has the following steps:
Identify which parental genotype(s) are consistent with the parental and offspring phenotype data.
For each pair of loci, calculate Pearson’s chi-square statistic for independent segregation, and its significance.
Use cluster analysis, based on the significance, to partition the loci into linkage groups. For each linkage group in turn, proceed as follows:
For each pair of loci, calculate the recombination frequency and the LOD score for all possible phases. The EM algorithm allows this to be done for any parental genotype configuration.
For each pair, identify the phase with the largest likelihood and estimate the difference in log-likelihood dij between the most likely and second most likely phase.
Use the recombination frequencies and LOD scores for the most likely phases to order the loci and calculate distances between them.
Reconstruct the linkage phase for the complete linkage group, using pairs in order of decreasing dij.
Check that the inferred linkage phases are the most likely ones for all pairs with a substantial difference in log-likelihood, for example dij > 3.
For pairs where the inferred phase is not the most likely phase, compare estimates of the recombination frequencies and LOD scores. Recalculate the linkage map on the basis of the inferred phase if necessary.
In the simulated and experimental data sets, there have been examples of loci for which more than one genotype for the parents is possible. This occurred for three reasons. First, some alleles may be present in all offspring, and so are uninformative, for example, simulated locus L5. Alleles A and C are present in parents 1 and 2, respectively, and in all offspring; only allele D (present in parent 2 only) segregates in the offspring in a 1:1 ratio. The parental genotypes are consistent with either AAAA × CCCD or AAAO × CCCD and, as all information about linkage comes from the segregating allele D, the choice between these two genotypes has no consequence for the estimation of the map. This will be the situation if all the possible genotypes have the same configuration for the segregating alleles. Second, the parents may have the same phenotypes but different genotypes, e.g., simulated locus L2 where genotypes AABC × ABCC and ABCC × AABC are possible. In this case, comparison of the likelihoods for the two possible genotypes segregating jointly with a linked, informative marker should resolve the issue. For locus L2, the likelihood of the joint segregation data with loci L1, L3, L4, L6, and L7 was consistently higher for the true genotype AABC × ABCC than for the alternative genotype ABCC × AABC. Third, the offspring phenotypes may be compatible with more than one possible genotype configuration, with different configurations for the segregating alleles, e.g., STM1052. In this case, the best approach is to calculate and compare the maps using each genotype. For the experimental data used here, the differences in the maps were negligible, but this may not always be so.
—The linkage map and parental linkage phases reconstructed from five SSR and six AFLP markers using a full-sib family from two autotetraploid potato lines. (Stirling and SCRI clone 12601abl). (?) An allele unresolved in the linkage analysis. The phase of marker STM0017 in parent 1260lab1 (shown in braces {}) cannot be resolved.
The examination of the information of different configurations shows that, as expected, there are many configurations of codominant markers that are more informative than the simplex coupling configuration, which is the most informative configuration for a dominant marker, as demonstrated in Hackett et al. (1998). Markers with many different alleles are most informative, and markers with multiple doses of alleles or alleles shared by both parents are less informative in general, but linkage phase also contributes, and so it is difficult to reject any locus configuration as uninformative for mapping purposes.
The reconstruction of the parental genotypes involves a test for double reduction. Luo et al. (2000) showed that the power of this test was high for detecting double reduction, but no significant double reduction was found in the experimental data. Little work has been done on the theory for predicting the joint segregation probabilities under a two-loci tetrasomic inheritance model when double reduction occurs, and we have not attempted to include it in the linkage analysis at present. However, double reduction is known to occur in potato (Bradshaw and Mackay 1994). It has also been observed that in potato, while bivalents predominate, low frequencies of quadrivalents, trivalents, and univalents occur (Swaminathan and Howard 1953). In autotetraploid alfalfa, in contrast, Bingham and McCoy (1988) found that most cells have the full complement of 16 bivalents at metaphase I. We hope to explore these complications in a future publication. In the meantime it is worth exploring the use of the current simple model on as wide a range of real data as possible.
Inference of linkage phase is a complicated issue in linkage analysis for diploids and even more so for polyploids, particularly when multiple loci have to be considered simultaneously. In this study, a likelihood-based approach was proposed to search over all possible linkage phase configurations of any given pair of tetraploid parental genotypes at two loci for the most likely one. For closely linked and/or informative pairs of loci, the difference between the most likely and the second most likely phase is clear-cut, and then the actual phase was predicted adequately. However, several phases may be nearly equally likely when the loci are loosely linked or the genotypic pair is less informative. In the cases examined here, phases with similar likelihoods had similar inferred recombination frequencies. Because of this, it is reasonable to calculate the linkage map using the recombination frequencies and LOD scores for the most likely phases for each pair, reconstruct the phase for the whole group, and then compare the estimates of the recombination frequencies at the inferred and most likely phases where these differ. For the simulated data, the difference in the estimates of the recombination frequency was always <0.01. We did not find any case where a difference in phase between the inferred and most likely one caused a nonnegligible difference in the estimate of the recombination frequency, but the possibility of this should be borne in mind.
This analysis has reconstructed the map on the basis of pairwise analyses. A least-squares method, implemented in the JoinMap software, was used to calculate multipoint map distances. A practical strategy is suggested for constructing the phase for the entire linkage group from the estimated pairwise phases and for checking its consistency. This approach reconstructed the complete phases correctly for the three linkage groups of our simulation study and gave a consistent phase for the experimental data. In theory, our approach could be improved by the use of a multilocus linkage analysis and phase analysis. There have been several approaches to multilocus linkage analysis in diploids. Prominent among them is the hidden Markov chain model proposed by Lander and Green (1987). The multilocus approach takes into consideration the cosegregation of genes at several linked loci simultaneously, and problems such as missing marker data and incomplete information of some markers (for example, dominant markers) can be appropriately addressed in the analysis. The basic principle of the multilocus linkage analysis in diploids would be extendable to tetraploids, but innovative theoretical efforts would have to be invested to model a more complicated stochastic process of multilocus crossovers under tetrasomic inheritance, and more efficient numerical algorithms have to be developed to analyze the model. In particular, methods have to be developed to handle the large number of possible linkage phases, which for a tetraploid genotype has a maximum of 4!m-1 for m linked loci, or 4!2(m-1) phases for the two parents. A possible way to tackle the problem might be the use of the Markov property of recombinant events over the linked marker loci as demonstrated for diploids in Jiang and Zeng (1997). The Markov model allows the division of all marker loci under question into groups flanked by fully informative markers, thus reducing the scale of the modeling problem. In practice, the scarcity of fully informative markers (i.e., with eight distinct alleles) in tetraploid species may be a difficulty.
A direct utility of the marker linkage maps is to map QTL. Doerge and Craig (2000) recently proposed a model selection strategy for quantitative trait locus analysis in polyploids. Though their method was suitable only for a single-marker QTL linkage test, the study highlighted aspects of difficulties and tools necessary to investigate QTL mapping in polyploids. As they have recognized, QTL analysis under a setting of multiple-marker loci will present entirely new challenges. The methodologies developed in the present article open another window for viewing and tackling the complexities of polyploid linkage analysis with quantitative trait loci.
APPENDIX: DERIVATION OF THE INFORMATION MEASURE
By definition,
Acknowledgments
We thank two anonymous reviewers and Dr. Z-B. Zeng for the comments and criticisms that have been very helpful in improving the manuscript. We are grateful for useful discussions with Dr. R. C. Meyer. This research was financially supported by a research grant from the UK Biotechnology and Biological Sciences Research Council. Z.W.L. was also supported by China’s “973” program, the National Science Foundation, the QiuShi Foundation, and the Changjiang Scholarship; the other authors were supported by the Scottish Executive Rural Affairs Department.
Footnotes
-
Communicating editor: Z-B. Zeng
- Received May 1, 2000.
- Accepted November 30, 2000.
- Copyright © 2001 by the Genetics Society of America