Abstract

A method for estimating the nucleotide diversity from AFLP data is developed by using the relationship between the number of nucleotide changes and the proportion of shared bands. The estimation equation is based on the assumption that GC-content is 0.5. Computer simulations, however, show that this method gives a reasonably accurate estimate even when GC-content deviates from 0.5, as long as the number of nucleotide changes per site (nucleotide diversity) is small. As an example, the nucleotide diversity of the wild yam, Dioscorea tokoro, was estimated. The estimated nucleotide diversity is 0.0055, which is larger than estimations from nucleotide sequence data for Adh and Pgi.

THE amplified fragment length polymorphism (AFLP) technique, developed by Vos et al. (1995), is a powerful tool for DNA fingerprinting of organismal genomes. In principle, it is a combination of RFLP and PCR techniques. Briefly, DNA is digested with two restriction enzymes (EcoRI and MseI in the original protocol), and double-stranded oligonucleotide adapters are ligated to the restriction sites. PCR primers complementary to the adapters and restriction sites are used for the amplification of fragments that are flanked by the adapters. A subset of fragments is selectively amplified by PCR primers that have 2- or 3-base extensions into the restriction fragments. Only those fragments that perfectly match the primer sequences can be amplified by PCR. Therefore the complexity of PCR amplicons is reduced. In fact, DNA fingerprints consisting of 50 to 100 restriction fragments can be detected after separation in a denaturing polyacrylamide gel. Relative ease of implementation, large number of polymorphisms detected per gel, small amount of genomic DNA required, and high reproducibility of DNA fingerprint patterns recommend AFLP as an attractive method to study DNA polymorphism in general.

Although AFLP has been increasingly applied to linkage mapping of genomes in various organisms (Thomaset al. 1995; Maheswaranet al. 1997), its application to population genetics and evolution is still limited (Hillet al. 1996; Maughamet al. 1996; Sharmaet al. 1996). In relevant studies, AFLP patterns were compared between individuals, and their similarity was described by the similarity index (percentage of shared fragments among the total fragments). These indices were used to generate a distance matrix, and further to reconstruct phylogenetic trees, although they do not increase linearly with divergence time. To our knowledge, no attempt has been made to date to use AFLP data for estimating the number of nucleotide changes per site between the genomes of two individuals.

Here, we report the application of the AFLP technique for estimating the nucleotide diversity (π), defined as the average number of pairwise nucleotide changes per site (Nei and Li 1979). To date, methods are available for estimating nucleotide diversity from DNA sequence (Nei and Tajima 1981; Tajima and Nei 1984), RFLP data (Nei and Li 1979; Nei and Tajima 1981), and RAPD data (Clark and Lanigan 1993), but not from AFLP data. The method for estimating the nucleotide diversity from AFLP data, reported here for the first time, might be generally useful for genetic diversity studies.

ESTIMATION METHOD

For estimation of the nucleotide diversity from AFLP data, we consider a random nucleotide sequence under the Jukes and Cantor model (Jukes and Cantor 1969), where the frequencies of four bases (G, A, T, and C) are equal (0.25). Following Nei and Li (1979) and Clark and Lanigan (1993), we assume that changes in DNA sequence are caused only by the nucleotide changes and we ignore the effect of other factors such as insertion and deletion. We denote the rate of nucleotide change per site per generation by μ. We consider a model for a haploid genome here, although the AFLP technique is usually applied to diploid species. An application to a diploid genome is presented in the next section.

The nucleotide diversity (π) in a sample of n haploid individuals can be estimated by averaging the estimated numbers of nucleotide changes (d) over all the pairs in the sample. Namely, π can be estimated by π^=2n(n1)i<jd^ij, (1) where ij is the estimated number of nucleotide changes between the ith and jth haploid individuals. Note that the estimated value is presented with a circumflex.

First, we consider the probability that a fragment is conserved by time t. If we follow the original protocol, in the AFLP technique, we have three classes of PCR products: those flanked by EcoRI-adapters in both sides, those flanked by EcoRI- and MseI-adapters, and those flanked by MseI-adapters in both sides. As only EcoRI-primers are labeled, the first and second classes of fragments are visible on the autoradiograph. We call these two classes of fragments type 1 and type 2 fragments, respectively. Let Q1(L) and Q2(L) be the probabilities that type 1 and type 2 fragments with L nucleotides are conserved by time t. Note that L is not the real length of the amplified fragment, but L represents the nucleotide length of the fragment excluding the length of the adapter sequences. In other words, L is the length of sequence that originated from the genomic DNA. If no nucleotide change occurs at both primer sites and no new restriction site appears between them, the fragment can be conserved. Let c1 and c2 be the numbers of the selected bases of EcoRI- and MseI-primers, respectively. Under the Jukes and Cantor model, the probability (p) that the nucleotide at a particular site is the same as that t generations ago is given by p = [1 + 3 exp(-4μt/3)]/4 (Jukes and Cantor 1969). Therefore, the probability that the EcoRI-primer site (length of recognition sequence of EcoRI + c1 bp) remains by time t, p1 , is given by p-6-c1, and that for the MseI-primer site (length of recognition sequence of MseI + c2 bp), p2 , is given by p-4-c2. Denote by b1 the probability that a new EcoRI restriction site appears in a given 6-bp nucleotide sequence that is not originally an EcoRI site. The probability that one or more nucleotide substitutions occur by time t in this 6-bp sequence is 1 - p1 = 1 - p6, and the probability that a new EcoRI site forms, a1, is 0.256. Then, following Nei and Li (1979) and Nei and Tajima (1983), b1 is given by b1=a1(1p1). (2) In the same way, the probability that a new MseI site appears in a given 4-bp sequence, b2, is also obtained. Because the probability that one or more nucleotide substitutions occur in this 4-bp sequence by time t is 1 - p2 = 1 - p4, b2 becomes b2=a2(1p2), (3) where a2 is the probability that a new MseI site forms in the 4-bp sequence (a2 = 0.254). In a fragment with L nucleotides, there are L - 6 + 1 possible 6-bp sequences and L - 4 + 1 possible 4-bp sequences. Then, using p1 , p2 , b1, and b2, we have Q1(L) and Q2(L), which are approximately given by Q1(L)=p12(1b1)L6+1(1b2)L4+1, (4) and Q2(L)=p1p2(1b1)L6+1(1b2)L4+1. (5) Equations 4 and 5 are approximates because the events during which a new restriction site appears are considered to be independent for all the 6- or 4-bp sequences. Apparently, these events are not independent. For example, if a new EcoRI site forms in a 6-bp sequence, say the sequence between nucleotide positions x and x + 5 (x is the nucleotide position number from the 5′ end of the fragment), a new EcoRI site never forms in the 6-bp sequences that start with the position x - 5, x - 4,..., x - 1, x + 1,..., x + 5. However, (4) and (5) can be good approximations (Nei and Li 1979).

Next, we consider the distribution of L. Assume that L is restricted within a range between Lmin and Lmax. Lmin and Lmax mean the minimum and maximum nucleotide lengths of the fragments, respectively, which can be scored on the autoradiograph. Let G1(L) be the distribution of L of type 1 fragment and a1 be the probability that a 6 + c1-bp sequence matches EcoRI-primer (a1=0.256+c1) . Then G1(L) is given by G1(L)=g1(L)L=LminLmaxg1(L), (6) where g1(L) is approximately given by g1(L)=a1(1a1)L6+1(1a2)L4+1. (7) If we denote (1 - a1)(1 - a2) by A, (7) can be rewritten as g1(L)=a1(1a1)5(1a2)3AL, (8) and (6) becomes G1(L)=(1A)ALLmin1ALmaxLmin+1. (9) In the same way, we can obtain G2(L), the distribution of L of type 2 fragment. Let a2 be the probability that a 4 + c2-bp sequence matches MseI-primer (a2=0.254+c2) . Then, we have G2(L)=g2(L)L=LminLmaxg2(L), (10) where g2(L) is approximately given by g2(L)=a2(1a1)5(1a2)3AL. (11) After some calculations, (10) becomes G2(L)=(1A)ALLmin1ALmaxLmin+1, (12) indicating that the distributions of L of types 1 and 2 fragments follow the same geometric distribution in the interval between Lmin and Lmax.

Finally, we consider the relationship between the number of nucleotide changes (d) and the expected proportion of shared bands (F) for a pair of haploid individuals. Denote by R1 the average probability that a type 1 fragment is conserved by time t in both lineages of a pair of haploid individuals. When they diverged t generations ago, the expectation of d is 2μt. Therefore, R1 is written as the average of Q1 (L)2 weighted by G1(L) in the interval between Lmin and Lmax. Namely, R1=L=LminLmaxG1(L)Q1(L)2=(1A)p14(1b1)2Lmin10(1b2)2Lmin61ALmaxLmin+1×1[A(1b1)2(1b2)2]LmaxLmin+11A(1b1)2(1b2)2. (13) In the same way, the average probability that a type 2 fragment is conserved in both haploid individuals, R2, is given by R2=L=LminLmaxG2(L)Q2(L)2=(1A)p12p22(1b1)2Lmin10(1b2)2Lmin61ALmaxLmin+1×1[A(1b1)2(1b2)2]LmaxLmin+11A(1b1)2(1b2)2. (14) Because the expected ratio of the number of type 1 fragments to that of type 2 fragments is a12a2 , the probability that a fragment is conserved by both of haploid individuals is given by R=a1R1+2a2R2a1+2a2. (15)

Here, let us consider the relationship between F and R. In RFLP analysis, Nei and Li (1979) used the relationship F = R. In AFLP analysis, a number of bands can appear. In this case, when a pair of haploid individuals are compared, there is a possibility that both haploid individuals share a particular band on an autoradiograph, but the band has not originated from the same region on the chromosome. This is because more than two fragments with the same length can appear from the different regions. Namely, there may be some bands that are shared by a pair of haploid individuals by chance. Therefore we have F > R, and F is given by F=R+C, (16) where C is the expected proportion of bands shared by chance. Let m be the expected number of bands scored. Because the expected number of bands that is conserved in both lineages of the pair of haploid individuals is Rm, the remaining (1 - R)m bands have a possibility to be shared by chance. The probability that a band with length L is shared by chance is G1(L) {= G2(L)}, and the distribution of L also follows G1(L). Hence, C is given by C=(1R)mL=LminLmaxG1(L)2=(1R)mG, (17) where G=L=LminLmaxG1(L)2=(1A)[1A2(LmaxLmin+1)](1+A)(1ALmaxLmin+1)2. (18) From (16) and (17), we have F=R+(1R)mG. (19)

From the relationship between F and d (= 2μt), we can estimate d from F. Let n be the number of haploid individuals investigated and ij be the estimated proportion of shared bands when the ith and jth haploid individuals are compared. Following Nei and Li (1979), ij is given by F^ij=2mij(mi+mj), (20) where mi and mj are the observed numbers of bands scored in the ith and jth haploid individuals and mij is the observed number of bands shared by both haploid individuals. Because we can estimate dij from (19), the nucleotide diversity (π) is obtained by averaging ij as shown in (1).

There is another method for estimating π, in which the average of ij () is used. Namely, we have F=2i<jF^ijn(n1). (21a) If is substituted for F in (19), we can estimate π directly (Nei and Miller 1990). Nei and Miller (1990) suggested that π estimated by this method is virtually the same value as that estimated by (1), when π is relatively small (Nei and Miller 1990). Apparently, this method is more convenient because (19) is used only once in this case. Equation 19 is too complex to calculate by hand. A computer program for estimating π is available on request.

F can be also estimated by F=2[Σi<jnmij][n(n1)][Σinmi]n=2Σi<jnmij(n1)Σinmi. (21b) This method uses the averages of mij and mi to estimate F. We can also estimate π from F . In the AFLP analysis, F appears to be almost the same as , because the numbers of bands for all haploid individuals are relatively large and not so different from each other.

COMPUTER SIMULATION

In the above equations we have made several assumptions and approximations. To know the accuracy of the present method, a computer simulation was conducted. The procedure of the simulation is as follows. A random ancestral sequence with the length of M million bp is constructed. The sequence consists of four nucleotides, A, T, G, and C with a given GC-content (g). On this sequence, random mutations are generated. The number of mutations is determined by following the Poisson distribution with mean μt. As models of mutation, we used the equal-input and equal-output models in Tajima and Nei (1982). The mutation rates used in the simulation are as follows, where we denote the mutation rate from nucleotide X to Y by μXY. In the equal-input model with g = 0.33, μAT = μTA = μGA = μGT = μCA = μCT = 6μ/13 and μAG = μAC = μTG = μTC = μGC = μCG = 3μ/13. In the equal-input model with g = 0.67, μAT = μTA = μGA = μGT = μCA = μCT = 3μ/13 and μAG = μAC = μTG = μTC = μGC = μCG = 6μ/13. In the equal-output model with g = 0.33, μAT = μAG = μAC = μTA = μTG = μTC = 3μ/4 and μGA = μGT = μGC = μCA = μCT = μCG = 3μ/2. In the equal-output model with g = 0.67, μAT = μAG = μAC = μTA = μTG = μTC = 3μ/2 and μGA = μGT = μGC = μCA = μCT = μCG = 3μ/4. Apparently, all the mutation rates are μ/4 when g = 0.5 in both models. This mutational process is carried out twice so that two descendant sequences are obtained. For these two sequences, the AFLP fragments are detected and the lengths of the fragments (L) are scored if LminLLmax, and the proportion of the shared bands (fragments) is calculated by (20).

The results of the simulation for M = 1.6 and g = 0.5 are shown in Figure 1. The selective base of EcoRI-primer was A and that of MseI-primer was G, so that c1 = 1 and c2 = 1. The number of replications for a given d was 1000. Note that the equal-input and equal-output models result in the same model when g = 0.5. The average number of bands (m) that can be scored was ∼38. Figure 1A shows the average of with the theoretical expectation obtained by (19). It is shown that the average of is very close to the expected value. From , d is estimated by (19), and the average of is plotted in Figure 1B. is very close to the true d. The variance of increases as d increases, although the variance of is nearly constant.

It is known that GC-content is not 0.5 in many organisms. By computer simulation, we investigated whether the relationship between d and F presented by Equation 19 holds when GC-content deviates from 0.5. Note that this formula assumes that GC-content is 0.5. Two values of GC-content were investigated (g = 0.33 and 0.67). Since GC-content affects the number of bands (m), the genome size (M) was adjusted so that m ≈ 38 (M = 1.3 and 5.8 for g = 0.33 and 0.67, respectively). From , d was estimated by (19). In Figure 2, the average of is plotted with true d. When g = 0.33, is smaller than the true value (Figure 2A). On the other hand, is larger than the true value when g = 0.67 (Figure 2B). The deviation of from true d is larger in the equal-output model than in the equal-input model, indicating that the degree of the deviation of from true d depended on the mutation model. However, if d < 0.025, is very close to the true value in our simulation even when g = 0.33 and 0.67, suggesting that Equation 19 is quite useful in a range of GC-content between 0.33 and 0.67 when d is small.

Figure 1.

—The results of simulation where genome size (M) = 1.6 million bp and GC-content (g) = 0.5 are assumed. The average number of observed bands is ∼38. (A) The average of is shown with SD. The solid line represents the expectation of F calculated by (19). (B) obtained by (19) is shown with SD. The solid line represents the expectation of d.

APPLICATIONS

Using the relationship between F and d, we estimated the nucleotide diversity in Dioscorea tokoro. D. tokoro is a dioecious, diploid, wild yam species distributed in East Asia. The AFLP data are unpublished results of R. Terauchi and G. Kahl. Two individuals [DT5 (female) and DT7 (male)], collected from Wakayama Prefecture in Japan, were investigated. For linkage analysis, they have segregation data of AFLP patterns in their F1 progenies. In the present article, we estimate the nucleotide diversity in these two individuals, DT5 and DT7 (corresponding to four haploid individuals) from the AFLP data.

Figure 2.

—The effect of GC-content (g) on the estimate of d (). obtained by (19) is shown with SD. The solid line represents the expectation of d. The genome size is adjusted for the number of bands (m) to be ∼38. (A) g = 0.33 and M = 1.3. (B) g = 0.67 and M = 5.8. (□) The result of the equal-input model; (▪) the result of the equal-output model.

Table 1 summarizes the results of AFLP detected between DT5 and DT7 for 14 primer combinations. PCR primers complementary to EcoRI- and MseI-adapters have two and three selective bases at their 3′ ends, respectively. As there are segregation data among progeny (R. Terauchi and G. Kahl, unpublished results), it was possible to distinguish the homozygous (indicated by ++) and heterozygous (+-) states of the fragments. Thus the combinations of the AFLP genotypes for DT5 and DT7 could be classified into eight classes. The number of AFLP fragments (bands) detected for each primer combination ranged from 48 to 102, with a total of 897 fragments for 14 primer combinations. About 76% of bands were homozygous (++) for both individuals.

From Table 1, F was calculated as follows. Note that (21b) is not applicable because D. tokoro is a diploid. Because we have data of diploid individuals, it is necessary to consider the diploid individual as a unit of two haploid genomes. Fortunately, in this example, we know from F1 data whether the scored band is homozygous or heterozygous (Table 1). Here, consider the banding patterns of n diploid individuals, which consist of a total of K types of bands. If we focus on a particular band (for example, the xth band), we know the number of haploid genomes that have this band on the autoradiograph. Denote this number by Sx, where Sx ranges from 1 to 2n. Let us consider the probability, Hx, that the band is shared by two haploid genomes randomly chosen from the sample. There are (2n2) ways to choose a pair of haploid genomes among the sample, of which (Sx2) pairs share the band. Then, we have H^x=(Sx2)(2n2)=Sx(Sx1)2n(2n1). (22) Considering all the K types of bands, therefore, we can obtain the average proportion of the shared bands (F) for a pair of haploid genomes in the sample. Namely, F=x=1KH^xx=1K(Sx2n), (23) where the denominator of the right side is the average number of bands per haploid genome. From (19), then, we can estimate π using F .

In this case, F was calculated to be 0.914. Then we have π^=0.0055 from (19). The sampling variance of π^ was computed by the jackknife method (Efron 1982) following Nei and Miller (1990), which was 1.19 × 10-8.

The nucleotide diversities of six Lens species were calculated. The data are taken from Table 2 of Sharma et al. (1996). As all the six species are selfing species, we can directly calculate by averaging Fij. The obtained is summarized in Table 2. From , the nucleotide diversity was calculated by (19), and the results are also shown in Table 2. The estimated nucleotide diversity ranges from 0.0048 to 0.0220. The sampling variance was also estimated by the jackknife method. Maugham et al. (1996) analyzed AFLP patterns in two species of Glycine (soybean), where they used PstI (six-base recognition enzyme) instead of EcoRI. Because their PstI-primer has three selected bases, c1 = 3 and c2 = 3 are given. Then, using (19), the nucleotide diversities in Glycine max and G. soja are estimated to be 0.0077 and 0.0233, respectively (Table 2).

View this table:
TABLE 1

Results of AFLP analysis of the D. tokoro genome

In the case of D. tokoro, we know whether the scored band is homozygous or heterozygous, because we have data of F1 progeny. If such data are not available, we cannot use (23) for estimating . In this case, we have to use the frequency of the band in the population. The following procedure is essentially the same as in Stephens et al. (1992). Denote the expected frequency of the xth band by fx (1 ≤ xK), where K is the number of types of scored bands. Consider that n diploid individuals are sampled from a population, and assume that the population is in Hardy-Weinberg equilibrium. Let Sx be the number of (diploid) individuals that have the xth band (1 ≤ Sxn). Then, we have E(Sxn)=fx2+2fx(1fx)=2fxfx2. (24) Using this relationship with Haldane’s correction (Haldane 1956), fx is estimated by f^x=14(nSx)+14n+1. (25) Let hx be the probability that the xth band is shared by two haploid genomes randomly chosen from the population, so that hx corresponds to the homozygosity of the xth band (hx=fx2) . From (24), hx can also be estimated by h^x=2f^xSxn, (26) where x is given by (25). Therefore, F is given by F=x=1Kh^xx=1Kf^x, (27) where the denominator of the right side is the expected number of bands per haploid genome. Using the above F , we can calculate the nucleotide diversity.

View this table:
TABLE 2

Nucleotide diversity in Lens and Glycine

DISCUSSION

In this study, we developed a method for estimating nucleotide diversity (π) from AFLP data. Although Equation 19 is very complex to calculate, the computer simulation indicates that this equation gives a good estimate of d as shown in Figure 1. The variance of the estimate increases with d, indicating that the estimate is not as reliable when d is large.

Our method was directly applied to the AFLP data set from D. tokoro. The estimated value of π was 0.0055 ± 0.0001 (SD). This value was compared with those in two gene regions of D. tokoro, which were estimated from DNA sequences by Terauchi et al. (1997). Table 3 shows the estimated π from DNA sequences. The sampling variance of the estimated π from DNA sequences is also calculated by Equation 32 in Tajima (1983). As shown in Table 3, π estimated from AFLP is larger than π from DNA sequences, except for Adh introns. Apparently, π from AFLP represents the nucleotide diversity of the total genome of D. tokoro. It is known that in eukaryote genomes many regions have little or no functions, and that in such regions the selective constraint may be very weak in comparison with functional regions (Kimura 1983; Nei 1987). Therefore, we can consider that π for the total genome may be larger than that for a specific coding region.

Another explanation for the large value of π based on AFLP data is the effect of insertions and deletions, which are assumed to be very rare events and are neglected in this study. If insertion and deletion events are not rare, π estimated by our method might be an overestimate. This problem also appears in estimation of π from RFLP data without a restriction map (Nei and Li 1979) and from RAPD data (Clark and Lanigan 1993). The degree of overestimation depends on the ratio of the rate of indel to that of nucleotide substitution, which might vary among organisms. Unfortunately, it is not always possible to know the ratio. When the ratio is not known, the present method should be used with caution.

View this table:
TABLE 3

Nucleotide diversity in D. tokoro

To investigate the amount of intraspecific variation, the AFLP pattern of D. tokoro was analyzed. As expected from the results with other plant species (Voset al. 1995), on the average 55.8 bands per primer pair were obtained for 14 primer combinations, indicating that this technique is very efficient for surveying a large number of DNA fragments. Because a number of fragments were analyzed simultaneously, the sampling variance of the estimated nucleotide diversity was relatively small, although the sample size is small. If the AFLP technology is used for large-scale population surveys, it can provide a reliable estimate of the amount of nucleotide variation.

Acknowledgments

The authors thank Naohiko Miyashita and Akira Kawabe for their comments and suggestions. This work was supported in part by a grantin-aid from the Ministry of Education, Science, Sports, and Culture of Japan.

Footnotes

  • Communicating editor: A. G. Clark

  • Received July 16, 1998.
  • Accepted November 12, 1998.

LITERATURE CITED

View Abstract