Abstract
A method for estimating the nucleotide diversity from AFLP data is developed by using the relationship between the number of nucleotide changes and the proportion of shared bands. The estimation equation is based on the assumption that GCcontent is 0.5. Computer simulations, however, show that this method gives a reasonably accurate estimate even when GCcontent deviates from 0.5, as long as the number of nucleotide changes per site (nucleotide diversity) is small. As an example, the nucleotide diversity of the wild yam, Dioscorea tokoro, was estimated. The estimated nucleotide diversity is 0.0055, which is larger than estimations from nucleotide sequence data for Adh and Pgi.
THE amplified fragment length polymorphism (AFLP) technique, developed by Vos et al. (1995), is a powerful tool for DNA fingerprinting of organismal genomes. In principle, it is a combination of RFLP and PCR techniques. Briefly, DNA is digested with two restriction enzymes (EcoRI and MseI in the original protocol), and doublestranded oligonucleotide adapters are ligated to the restriction sites. PCR primers complementary to the adapters and restriction sites are used for the amplification of fragments that are flanked by the adapters. A subset of fragments is selectively amplified by PCR primers that have 2 or 3base extensions into the restriction fragments. Only those fragments that perfectly match the primer sequences can be amplified by PCR. Therefore the complexity of PCR amplicons is reduced. In fact, DNA fingerprints consisting of 50 to 100 restriction fragments can be detected after separation in a denaturing polyacrylamide gel. Relative ease of implementation, large number of polymorphisms detected per gel, small amount of genomic DNA required, and high reproducibility of DNA fingerprint patterns recommend AFLP as an attractive method to study DNA polymorphism in general.
Although AFLP has been increasingly applied to linkage mapping of genomes in various organisms (Thomaset al. 1995; Maheswaranet al. 1997), its application to population genetics and evolution is still limited (Hillet al. 1996; Maughamet al. 1996; Sharmaet al. 1996). In relevant studies, AFLP patterns were compared between individuals, and their similarity was described by the similarity index (percentage of shared fragments among the total fragments). These indices were used to generate a distance matrix, and further to reconstruct phylogenetic trees, although they do not increase linearly with divergence time. To our knowledge, no attempt has been made to date to use AFLP data for estimating the number of nucleotide changes per site between the genomes of two individuals.
Here, we report the application of the AFLP technique for estimating the nucleotide diversity (π), defined as the average number of pairwise nucleotide changes per site (Nei and Li 1979). To date, methods are available for estimating nucleotide diversity from DNA sequence (Nei and Tajima 1981; Tajima and Nei 1984), RFLP data (Nei and Li 1979; Nei and Tajima 1981), and RAPD data (Clark and Lanigan 1993), but not from AFLP data. The method for estimating the nucleotide diversity from AFLP data, reported here for the first time, might be generally useful for genetic diversity studies.
ESTIMATION METHOD
For estimation of the nucleotide diversity from AFLP data, we consider a random nucleotide sequence under the Jukes and Cantor model (Jukes and Cantor 1969), where the frequencies of four bases (G, A, T, and C) are equal (0.25). Following Nei and Li (1979) and Clark and Lanigan (1993), we assume that changes in DNA sequence are caused only by the nucleotide changes and we ignore the effect of other factors such as insertion and deletion. We denote the rate of nucleotide change per site per generation by μ. We consider a model for a haploid genome here, although the AFLP technique is usually applied to diploid species. An application to a diploid genome is presented in the next section.
The nucleotide diversity (π) in a sample of n haploid individuals can be estimated by averaging the estimated numbers of nucleotide changes (d) over all the pairs in the sample. Namely, π can be estimated by
First, we consider the probability that a fragment is conserved by time t. If we follow the original protocol, in the AFLP technique, we have three classes of PCR products: those flanked by EcoRIadapters in both sides, those flanked by EcoRI and MseIadapters, and those flanked by MseIadapters in both sides. As only EcoRIprimers are labeled, the first and second classes of fragments are visible on the autoradiograph. We call these two classes of fragments type 1 and type 2 fragments, respectively. Let Q_{1}(L) and Q_{2}(L) be the probabilities that type 1 and type 2 fragments with L nucleotides are conserved by time t. Note that L is not the real length of the amplified fragment, but L represents the nucleotide length of the fragment excluding the length of the adapter sequences. In other words, L is the length of sequence that originated from the genomic DNA. If no nucleotide change occurs at both primer sites and no new restriction site appears between them, the fragment can be conserved. Let c_{1} and c_{2} be the numbers of the selected bases of EcoRI and MseIprimers, respectively. Under the Jukes and Cantor model, the probability (p) that the nucleotide at a particular site is the same as that t generations ago is given by p = [1 + 3 exp(4μt/3)]/4 (Jukes and Cantor 1969). Therefore, the probability that the EcoRIprimer site (length of recognition sequence of EcoRI + c_{1} bp) remains by time t,
Next, we consider the distribution of L. Assume that L is restricted within a range between L_{min} and L_{max}. L_{min} and L_{max} mean the minimum and maximum nucleotide lengths of the fragments, respectively, which can be scored on the autoradiograph. Let G_{1}(L) be the distribution of L of type 1 fragment and
Finally, we consider the relationship between the number of nucleotide changes (d) and the expected proportion of shared bands (F) for a pair of haploid individuals. Denote by R_{1} the average probability that a type 1 fragment is conserved by time t in both lineages of a pair of haploid individuals. When they diverged t generations ago, the expectation of d is 2μt. Therefore, R_{1} is written as the average of Q_{1} (L)^{2} weighted by G_{1}(L) in the interval between L_{min} and L_{max}. Namely,
Here, let us consider the relationship between F and R. In RFLP analysis, Nei and Li (1979) used the relationship F = R. In AFLP analysis, a number of bands can appear. In this case, when a pair of haploid individuals are compared, there is a possibility that both haploid individuals share a particular band on an autoradiograph, but the band has not originated from the same region on the chromosome. This is because more than two fragments with the same length can appear from the different regions. Namely, there may be some bands that are shared by a pair of haploid individuals by chance. Therefore we have F > R, and F is given by
From the relationship between F and d (= 2μt), we can estimate d from F. Let n be the number of haploid individuals investigated and Fˆ_{ij} be the estimated proportion of shared bands when the ith and jth haploid individuals are compared. Following Nei and Li (1979), Fˆ_{ij} is given by
There is another method for estimating π, in which the average of Fˆ_{ij} (F˜_{)} is used. Namely, we have
F can be also estimated by
COMPUTER SIMULATION
In the above equations we have made several assumptions and approximations. To know the accuracy of the present method, a computer simulation was conducted. The procedure of the simulation is as follows. A random ancestral sequence with the length of M million bp is constructed. The sequence consists of four nucleotides, A, T, G, and C with a given GCcontent (g). On this sequence, random mutations are generated. The number of mutations is determined by following the Poisson distribution with mean μt. As models of mutation, we used the equalinput and equaloutput models in Tajima and Nei (1982). The mutation rates used in the simulation are as follows, where we denote the mutation rate from nucleotide X to Y by μ_{XY}. In the equalinput model with g = 0.33, μ_{AT} = μ_{TA} = μ_{GA} = μ_{GT} = μ_{CA} = μ_{CT} = 6μ/13 and μ_{AG} = μ_{AC} = μ_{TG} = μ_{TC} = μ_{GC} = μ_{CG} = 3μ/13. In the equalinput model with g = 0.67, μ_{AT} = μ_{TA} = μ_{GA} = μ_{GT} = μ_{CA} = μ_{CT} = 3μ/13 and μ_{AG} = μ_{AC} = μ_{TG} = μ_{TC} = μ_{GC} = μ_{CG} = 6μ/13. In the equaloutput model with g = 0.33, μ_{AT} = μ_{AG} = μ_{AC} = μ_{TA} = μ_{TG} = μ_{TC} = 3μ/4 and μ_{GA} = μ_{GT} = μ_{GC} = μ_{CA} = μ_{CT} = μ_{CG} = 3μ/2. In the equaloutput model with g = 0.67, μ_{AT} = μ_{AG} = μ_{AC} = μ_{TA} = μ_{TG} = μ_{TC} = 3μ/2 and μ_{GA} = μ_{GT} = μ_{GC} = μ_{CA} = μ_{CT} = μ_{CG} = 3μ/4. Apparently, all the mutation rates are μ/4 when g = 0.5 in both models. This mutational process is carried out twice so that two descendant sequences are obtained. For these two sequences, the AFLP fragments are detected and the lengths of the fragments (L) are scored if L_{min} ≤ L ≤ L_{max}, and the proportion of the shared bands (fragments) is calculated by (20).
The results of the simulation for M = 1.6 and g = 0.5 are shown in Figure 1. The selective base of EcoRIprimer was A and that of MseIprimer was G, so that c_{1} = 1 and c_{2} = 1. The number of replications for a given d was 1000. Note that the equalinput and equaloutput models result in the same model when g = 0.5. The average number of bands (m) that can be scored was ∼38. Figure 1A shows the average of Fˆ with the theoretical expectation obtained by (19). It is shown that the average of Fˆ is very close to the expected value. From Fˆ_{,} d is estimated by (19), and the average of dˆ is plotted in Figure 1B. dˆ is very close to the true d. The variance of dˆ increases as d increases, although the variance of Fˆ is nearly constant.
It is known that GCcontent is not 0.5 in many organisms. By computer simulation, we investigated whether the relationship between d and F presented by Equation 19 holds when GCcontent deviates from 0.5. Note that this formula assumes that GCcontent is 0.5. Two values of GCcontent were investigated (g = 0.33 and 0.67). Since GCcontent affects the number of bands (m), the genome size (M) was adjusted so that m ≈ 38 (M = 1.3 and 5.8 for g = 0.33 and 0.67, respectively). From Fˆ_{,} d was estimated by (19). In Figure 2, the average of dˆ is plotted with true d. When g = 0.33, dˆ is smaller than the true value (Figure 2A). On the other hand, dˆ is larger than the true value when g = 0.67 (Figure 2B). The deviation of dˆ from true d is larger in the equaloutput model than in the equalinput model, indicating that the degree of the deviation of dˆ from true d depended on the mutation model. However, if d < 0.025, dˆ is very close to the true value in our simulation even when g = 0.33 and 0.67, suggesting that Equation 19 is quite useful in a range of GCcontent between 0.33 and 0.67 when d is small.
APPLICATIONS
Using the relationship between F and d, we estimated the nucleotide diversity in Dioscorea tokoro. D. tokoro is a dioecious, diploid, wild yam species distributed in East Asia. The AFLP data are unpublished results of R. Terauchi and G. Kahl. Two individuals [DT5 (female) and DT7 (male)], collected from Wakayama Prefecture in Japan, were investigated. For linkage analysis, they have segregation data of AFLP patterns in their F_{1} progenies. In the present article, we estimate the nucleotide diversity in these two individuals, DT5 and DT7 (corresponding to four haploid individuals) from the AFLP data.
Table 1 summarizes the results of AFLP detected between DT5 and DT7 for 14 primer combinations. PCR primers complementary to EcoRI and MseIadapters have two and three selective bases at their 3′ ends, respectively. As there are segregation data among progeny (R. Terauchi and G. Kahl, unpublished results), it was possible to distinguish the homozygous (indicated by ++) and heterozygous (+) states of the fragments. Thus the combinations of the AFLP genotypes for DT5 and DT7 could be classified into eight classes. The number of AFLP fragments (bands) detected for each primer combination ranged from 48 to 102, with a total of 897 fragments for 14 primer combinations. About 76% of bands were homozygous (++) for both individuals.
From Table 1,
In this case,
The nucleotide diversities of six Lens species were calculated. The data are taken from Table 2 of Sharma et al. (1996). As all the six species are selfing species, we can directly calculate F˜ by averaging F_{ij}. The obtained F˜ is summarized in Table 2. From F˜, the nucleotide diversity was calculated by (19), and the results are also shown in Table 2. The estimated nucleotide diversity ranges from 0.0048 to 0.0220. The sampling variance was also estimated by the jackknife method. Maugham et al. (1996) analyzed AFLP patterns in two species of Glycine (soybean), where they used PstI (sixbase recognition enzyme) instead of EcoRI. Because their PstIprimer has three selected bases, c_{1} = 3 and c_{2} = 3 are given. Then, using (19), the nucleotide diversities in Glycine max and G. soja are estimated to be 0.0077 and 0.0233, respectively (Table 2).
In the case of D. tokoro, we know whether the scored band is homozygous or heterozygous, because we have data of F_{1} progeny. If such data are not available, we cannot use (23) for estimating F˜_{.} In this case, we have to use the frequency of the band in the population. The following procedure is essentially the same as in Stephens et al. (1992). Denote the expected frequency of the xth band by f_{x} (1 ≤ x ≤ K), where K is the number of types of scored bands. Consider that n diploid individuals are sampled from a population, and assume that the population is in HardyWeinberg equilibrium. Let S_{x} be the number of (diploid) individuals that have the xth band (1 ≤ S_{x} ≤ n). Then, we have
DISCUSSION
In this study, we developed a method for estimating nucleotide diversity (π) from AFLP data. Although Equation 19 is very complex to calculate, the computer simulation indicates that this equation gives a good estimate of d as shown in Figure 1. The variance of the estimate increases with d, indicating that the estimate is not as reliable when d is large.
Our method was directly applied to the AFLP data set from D. tokoro. The estimated value of π was 0.0055 ± 0.0001 (SD). This value was compared with those in two gene regions of D. tokoro, which were estimated from DNA sequences by Terauchi et al. (1997). Table 3 shows the estimated π from DNA sequences. The sampling variance of the estimated π from DNA sequences is also calculated by Equation 32 in Tajima (1983). As shown in Table 3, π estimated from AFLP is larger than π from DNA sequences, except for Adh introns. Apparently, π from AFLP represents the nucleotide diversity of the total genome of D. tokoro. It is known that in eukaryote genomes many regions have little or no functions, and that in such regions the selective constraint may be very weak in comparison with functional regions (Kimura 1983; Nei 1987). Therefore, we can consider that π for the total genome may be larger than that for a specific coding region.
Another explanation for the large value of π based on AFLP data is the effect of insertions and deletions, which are assumed to be very rare events and are neglected in this study. If insertion and deletion events are not rare, π estimated by our method might be an overestimate. This problem also appears in estimation of π from RFLP data without a restriction map (Nei and Li 1979) and from RAPD data (Clark and Lanigan 1993). The degree of overestimation depends on the ratio of the rate of indel to that of nucleotide substitution, which might vary among organisms. Unfortunately, it is not always possible to know the ratio. When the ratio is not known, the present method should be used with caution.
To investigate the amount of intraspecific variation, the AFLP pattern of D. tokoro was analyzed. As expected from the results with other plant species (Voset al. 1995), on the average 55.8 bands per primer pair were obtained for 14 primer combinations, indicating that this technique is very efficient for surveying a large number of DNA fragments. Because a number of fragments were analyzed simultaneously, the sampling variance of the estimated nucleotide diversity was relatively small, although the sample size is small. If the AFLP technology is used for largescale population surveys, it can provide a reliable estimate of the amount of nucleotide variation.
Acknowledgments
The authors thank Naohiko Miyashita and Akira Kawabe for their comments and suggestions. This work was supported in part by a grantinaid from the Ministry of Education, Science, Sports, and Culture of Japan.
Footnotes

Communicating editor: A. G. Clark
 Received July 16, 1998.
 Accepted November 12, 1998.
 Copyright © 1999 by the Genetics Society of America