- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Koopman, W. J. M.
- Articles by Gort, G.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Koopman, W. J. M.
- Articles by Gort, G.
Genetics, Vol. 167, 1915-1928, August 2004, Copyright © 2004
doi:10.1534/genetics.103.015693
Significance Tests and Weighted Values for AFLP Similarities, Based on Arabidopsis in Silico AFLP Fragment Length Distributions
Wim J. M. Koopman*,1 and
Gerrit Gort
* Nationaal Herbarium NederlandWageningen Branch, Biosystematics Group, Wageningen University, 6703 BL Wageningen, The Netherlands
Biometris, Wageningen University and Research Centre, 6700 AA, Wageningen, The Netherlands
1 Corresponding author: Plant Research International B.V., 6700 AA, Wageningen, The Netherlands.
E-mail: wim.koopman{at}wur.nl
Many AFLP studies include relatively unrelated genotypes that contribute noise to data sets instead of signal. We developed: (1) estimates of expected AFLP similarities between unrelated genotypes, (2) significance tests for AFLP similarities, enabling the detection of unrelated genotypes, and (3) weighted similarity coefficients, including band position information. Detection of unrelated genotypes and use of weighted similarity coefficients will make the analysis of AFLP data sets more informative and more reliable. Test statistics and weighted coefficients were developed for total numbers of shared bands and for Dice, Jaccard, Nei and Li, and simple matching (dis)similarity coefficients. Theoretical and in silico AFLP fragment length distributions (FLDs) were examined as a basis for the tests. The in silico AFLP FLD based on the Arabidopsis thaliana genome sequence was the most appropriate for angiosperms. The G + C content of the selective nucleotides in the in silico AFLP procedure significantly influenced the FLD. Therefore, separate test statistics were calculated for AFLP procedures with high, average, and low G + C contents in the selective nucleotides. The test statistics are generally applicable for angiosperms with a G + C content of
3540%, but represent conservative estimates for genotypes with higher G + C contents. For the latter, test statistics based on a rice genome sequence are more appropriate.
AFLP is a DNA fingerprinting technique developed by Keygene N.V. (VOS et al. 1995). The technique consist of four steps: (1) digestion of DNA with two restriction enzymes, (2) ligation of double-stranded oligonucleotide adapters to the restriction fragments, (3) selective PCR amplification of the ligated fragments with specific PCR primers that have selective nucleotides at their 3' end, and (4) separation of the amplified fragments on a denaturing polyacrylamide gel. On this gel, the fragments are separated by their length. Inclusion of a base-pair ladder enables determination of the exact length of each fragment.
In recent years, AFLPs have become a popular tool for relationship studies (MUELLER and LAREESA WOLFENBARGER 1999). In these studies, the AFLPs are scored as dominant anonymous markers. Dominant scoring of AFLPs means that each fragment is scored as either present or absent and that the fragments are assumed to occur independently of each other. Scoring as anonymous markers means that the fragments are recognized only by their length, while their sequence is unknown. Fragments of the same length, which are comigrating on a gel, are assumed to be identical. The fraction of fragments comigrating across genotypes, expressed in some way by a similarity or dissimilarity coefficient, is used as a measure for genetic or phenetic relationship. Various coefficients have been developed to quantify (dis)similarity, mainly differing in the weighting of comigrating relative to noncomigrating fragments (see, e.g., NEI and LI 1979; ROHLF 1993).
The assumption that all comigrating fragments are identical is an oversimplification of the actual situation (VEKEMANS et al. 2002). In reality, a certain fraction of fragments will be comigrating by chance only, while having distinct sequences. Because these fragments will be scored as identical, their presence leads to an overestimation of the similarity among genotypes. The presence of nonidentical fragments comigrating across genotypes was demonstrated in actual data sets of Solanum tuberosum (ROUPPE VAN DER VOORT et al. 1997), Carduinae thistles (O'HANLON and PEAKALL 2000), and Hordeum species (EL-RABEY et al. 2002). The presence of nonidentical fragments comigrating within genotypes was demonstrated in Beta (HANSEN et al. 1999) and Glycine max (MEKSEM et al. 2001). The proportion of comigrating nonidentical fragments ranged from at least 10% within genotypes or among closely related genotypes (ROUPPE VAN DER VOORT et al. 1997; HANSEN et al. 1999; MEKSEM et al. 2001) to 100% for pairs of genotypes from more distantly related taxa (O'HANLON and PEAKALL 2000). Given the proportions of comigrating nonidentical bands, a serious overestimation of pairwise similarities among genotypes can be expected. Indeed, KARP et al. (1996) noted that the occurrence of nonidentical comigrating AFLP fragments may pose serious problems for the application of AFLPs in relationship studies, but the issue was largely ignored in literature thereafter.
In this study, we quantify the occurrence of nonidentical comigrating AFLP fragments for AFLP procedures with restriction enzymes EcoRI/MseI. The estimates are used to (1) determine the expected numbers of comigrating nonidentical bands and (2) develop significance tests for AFLP similarities. As a basis for the significance tests we determine and evaluate theoretical AFLP fragment length distributions based on INNAN et al. (1999) and in silico AFLP fragment length distributions (FLDs) based on the complete Arabidopsis thaliana (L.) Heynh. genome sequence (ARABIDOPSIS GENOME INITIATIVE 2000). Using the A. thaliana (hereafter, Arabidopsis) FLD, we estimate the probability distribution of the number of nonidentical AFLP bands comigrating across genotypes. From this distribution, we determine expectations and 95 and 99% critical values for band numbers and (dis)similarity coefficients Dice, Jaccard, Nei and Li, and simple matching (NEI and LI 1979; ROHLF 1993). The critical values can be used to test the significance of a given pairwise similarity among angiosperm genotypes. If desired, genotypes that do not contribute significant relationship information can be removed from a data set. Determination of the expected numbers of comigrating nonidentical bands also yielded information on the underlying band length distribution probabilities. However, the usual similarities calculated using the Dice, Jaccard, Nei and Li, and simple matching coefficients ignore this information, assuming identical probabilities for all bands. As an alternative, we propose similarity coefficients that weight the AFLP bands according to their band length distribution probabilities. It is expected that the use of the significance tests and weighted similarities will make the analysis of AFLP data sets more informative and more reliable.
General strategy:
The number of nonidentical AFLP bands comigrating across genotypes depends on the number of bands scored for each genotype, the number of possible band lengths for the genotypes (i.e., the number of discrete band positions possible within a selected scoring range), and the length distribution of the AFLP fragments. Note that one AFLP band may contain multiple fragments (discussed later). In empirical data sets, the number of possible band positions and the number of bands for each genotype are known; only the FLD remains to be determined. The distribution can be obtained in several ways, e.g., (1) derived from AFLP band data in empirical data sets, (2) calculated using theoretical FLDs, and (3) determined in silico, if representative genome sequence data (preferably entire genomes) are available.The use of empirical data involves the risk of introducing methodological error into the calculations resulting from the AFLP procedure itself. Such errors may include, e.g., biases in fragment amplification or in scoring of bands. Theoretically derived or in silico-generated FLDs do not have this drawback.
Theoretical distributions may be preferred over in silico distributions, because they are exactly formulated, using explicit assumptions and parameter settings. In this article, we examine the length distribution for AFLP fragments proposed by INNAN et al. (1999) as a theoretical basis on which to estimate the proportion of nonidentical bands comigrating across genotypes. To our knowledge, no alternative AFLP FLD has been proposed yet.
Use of in silico AFLP FLDs has the drawback that the distribution itself has to be estimated from the available genome data. Therefore, it is inherently subject to uncertainty because of estimation error and limited by the availability and representativeness of the genome data. However, in silico AFLP data also have two major advantages. First, the AFLP fragments represent an actual genome. Thus, their distribution is not subject to assumptions that underlie theoretical models. Second, when the procedure is performed properly, no fragments will be lost due to methodological errors, and all possible fragments will be represented in the AFLP data set. Here, we examine an in silico FLD based on the genome sequence of the model plant Arabidopsis as an alternative to the theoretical distribution of INNAN et al. (1999). All statistical procedures were performed in SAS Release 8.00 (SAS Institute, Cary, NC).
Theoretical AFLP fragment length distributions:
INNAN et al. (1999) describe AFLP FLDs for EcoRI and MseI restriction enzymes under the assumption of (1) a random nucleotide sequence under the Jukes and Cantor model [equal base frequencies C = A = T = G = 0.25, and all substitutions equally likely (JUKES and CANTOR 1969)]; (2) nucleotide changes as sole cause of changes in DNA sequence; and (3) a haploid genome. They showed that both EcoRI/EcoRI and EcoRI/MseI fragments follow the same truncated geometric distribution
, in which L is the length of the AFLP fragments, Lmin and Lmax are the minimum and maximum possible lengths of the fragments considered, and A = (1 probability of formation of new EcoRI site)(1 probability of formation of new MseI site). The probability of formation of a restriction site equals the multiplied relative frequencies of the individual nucleotides required for such a site (GAATTC for EcoRI, TTAA for MseI). Under the assumption of equal frequencies of occurrence for all four nucleotides as made by INNAN et al. (1999), A = (1 0.256)(1 0.254). To examine the influence of nucleotide frequencies on the AFLP FLD, we calculated distributions for various ratios of A + T vs. G + C. A literature survey revealed that the G + C contents of the majority of plants ranged between 35 and 50% (see, e.g., MARIE and BROWN 1993; BAROW and MEISTER 2002). However, various plant groups showed different G + C contents. The average G + C content was 37% for gymnosperms, 40% for dicotyledons, 41% for ferns, 44% for monocotyledons, and 45% for algae. Viscum album possibly occupies a special position with only 30% G + C (NAGL and STEIN 1989), although MARIE and BROWN (1993) reported 39% G + C. We covered the G + C range by calculating separate AFLP FLDs for 35, 40, 45, and 50% G + C. The nucleotide frequencies of A in the formula of INNAN et al. (1999) were adjusted accordingly, with equal splitting of percentages over A + T and G + C nucleotides. For easy comparison with empirical data sets, all fragment and band lengths that are reported in this article include adapter sequences.
Figure 1 depicts the AFLP FLDs for 3550% G + C. The distributions show that the probability that a fragment will occur decreases with increasing fragment length for all G + C contents. The shape of the distribution is also influenced by the base composition: low G + C contents yield relatively high frequencies of smaller fragments, while high G + C contents yield relatively high frequencies of longer fragments. The uniform distribution (all fragment lengths equally likely) is given as a reference.
|
Arabidopsis in silico AFLP fragment length distributions:
Sequence data of the entire Arabidopsis genome sequence were obtained from The Institute for Genomic Research through the web site at http://www.tigr.org. The Arabidopsis in silico AFLP was performed using the restriction enzyme sequences of EcoRI/MseI without any selective nucleotides. The probability distribution of the fragment lengths was estimated by fitting a cubic smoothing spline and rescaling properly, using SAS PROC IML. The smoothing parameter of the spline (200.000) was chosen by eye. The more objective approach of cross-validation (SAS PROC INSIGHT) resulted in an unsatisfactory smoothing level and a spline oscillating around the one chosen by eye. The smoothing spline and the relative frequency distribution of the in silico AFLP fragments are depicted in Figure 2. Fragment lengths range from 32 to 1024 bp.
|
To compare the in silico AFLP FLD with the theoretical distribution of INNAN et al. (1999), we calculated a theoretical distribution using the nucleotide frequencies from the Arabidopsis genome sequence (G = C = 0.18 and A = T = 0.32 for all five chromosomes). Figure 2 shows a clear difference between the theoretical and the in silico FLD. Compared to the theoretical distribution, the in silico distribution shows a lack of smaller bands (<179 bp) and an excess of larger bands (>179 bp). The difference may originate in the nucleotide sequence model employed by INNAN et al. (1999), which was probably too simple to adequately describe the Arabidopsis in silico FLD (see DISCUSSION). Given the limitations of the theoretical model and the fact that, in contrast, the Arabidopsis in silico FLD reflects an actual genome sequence, we consider the Arabidopsis distribution to be the more accurate basis for our significance tests for AFLP similarities.
The in silico AFLP FLD was generated without selective nucleotides to obtain the highest possible number of AFLP fragments. In practice, however, selective nucleotides are always employed in AFLP procedures on plants. To test the influence of selective nucleotides on the AFLP FLD, we performed additional in silico AFLP runs with three +1/+1 selective nucleotide combinations: A/C (the most commonly used single-nucleotide combination), T/A (the nucleotides with the highest frequency in the Arabidopsis genome), and C/G (the nucleotides with the lowest frequency in the Arabidopsis genome). A two-sample Kolmogorov-Smirnov test (SAS PROC NPAR1WAY) showed a significant influence of T/A (P = 0.002) and C/G (P = 0.001) selective nucleotides on the FLD. The distribution for selective nucleotides A/C did not differ significantly from that without selective nucleotides (P = 0.62). Figure 2 illustrates the influence of selective nucleotides on the in silico AFLP FLD. The use of T/A selective nucleotides results in an overrepresentation of shorter fragments (<107 bp) and an underrepresentation of longer fragments (>107 bp). The use of C/G selective nucleotides results in an overrepresentation of longer fragments (>107 bp) and an underrepresentation of shorter fragments (<107 bp). The difference indicates that selection of AFLP fragments using selective nucleotides is not a random process (see DISCUSSION).
Each fragment in an AFLP profile contains a discrete number of nucleotides. If properly measured, the length of a fragment equals this number of nucleotides. Given the discrete nature of the AFLP fragment lengths, the AFLP FLDs are discrete distributions. In Figures 2 and 4, however, the AFLP FLDs appear as continuous distributions, because the large number of possible lengths makes it impossible to visualize the actual discreteness. For the in silico AFLPs without selective nucleotides, Figures 2 and 4 show both the smoothed discrete FLDs (line A in Figure 2; lines A and B in Figure 4) and the nonsmoothed discrete FLDs (probability in each length class depicted by a dot). All statistical procedures in this study are based on the discrete smoothed distributions. As a consequence, band lengths used as input for the statistical tests developed in our study should be discrete (i.e., integer) values.
|
AFLP fragments and AFLP bands:
Similarities in AFLP patterns result from fragments that are comigrating across genotypes, and two types of such fragments can be distinguished: first, fragments that share the same sequence and originate from the same loci (comigrating identical fragments; these fragments reflect the genetic similarity among genotypes); and second, fragments having different sequences, originating from different loci (comigrating nonidentical fragments; these fragments comigrate by chance only, and do not reflect genetic similarity). Genotypes that are too distantly related for the AFLP technique to detect any relationship information (called "unrelated" hereafter) share only the second type of fragments. Therefore, an estimate of the number of nonidentical fragments comigrating across genotypes is an estimate of the lower boundary for fragment similarity to indicate relationship. We use this number to derive test statistics for significance tests on pairwise AFLP similarities between genotypes.In an ideal situation, each AFLP band consists of only one AFLP fragment, enabling a one-to-one translation of AFLP fragments into AFLP bands. In that case, test statistics for significance tests can be based directly on the numbers of nonidentical fragments comigrating across genotypes. In practice, however, an AFLP band often contains multiple fragments that are comigrating within the same genotype. As a result, identical bands comigrating across genotypes may contain both identical and nonidentical fragments, while nonidentical bands comigrating across genotypes each may contain multiple nonidentical fragments. The phenomenon of nonidentical comigrating fragments (both within and across genotypes) is known as size homoplasy (VEKEMANS et al. 2002). In most relationship studies this size homoplasy is ignored, and only the presence or absence of AFLP bands is recorded. As a result, the similarities calculated in these studies are based on AFLP band similarities rather than on AFLP fragment similarities. For significance tests to be readily applicable in such relationship studies, the test statistics should be derived from the numbers and positions of nonidentical bands comigrating across genotypes. To account for the size homoplasy, however, information on the numbers and positions of nonidentical fragments comigrating across genotypes should be included as well. We constructed a series of significance tests that meet both demands. To our knowledge, there is no straightforward analytical procedure to calculate the relationship between the numbers of AFLP fragments and numbers of AFLP bands. Therefore, we estimated this relationship using Monte Carlo simulations.
Significance tests for pairwise AFLP band similarities:
The significance tests for pairwise AFLP band similarities were developed in three steps. In the first step, probability distributions, P, of the numbers of nonidentical bands comigrating across genotypes were determined. In the second step, from P the expectation, standard deviation, and approximate critical values (95 and 99%) of numbers of nonidentical bands comigrating across genotypes were determined. In the third step, the same quantities were determined for four widely employed (dis)similarity coefficients.- For each pairwise comparison, two independent AFLP band patterns were generated with the appropriate numbers of bands (e.g., 50 and 60). The band patterns were generated by randomly drawing fragments from the smoothed Arabidopsis AFLP FLD. Note that the fragments are drawn only from the part of the Arabidopsis AFLP FLD corresponding to the scoring range of interest (e.g., 50500 bp). The numbers of fragments needed for each band pattern were often higher than the numbers of bands in the patterns, because some of the fragments ended up in the same bands. The difference between the numbers of fragments and the numbers of bands indicates the amount of size homoplasy in the band pattern (see also Nonidentical AFLP fragments comigrating within genotypes). To determine the number of fragments to be drawn from the AFLP FLD in an unbiased way, we repeatedly drew a fragment count from a uniform distribution. Next, a number of fragments equal to the fragment count was drawn from the smoothed Arabidopsis FLD, and the resulting number of AFLP bands was determined. The procedure was repeated until the appropriate numbers of bands (e.g., 50 and 60) were reached in both AFLP patterns. For these numbers of bands, the number of bands comigrating across both AFLP patterns was determined and recorded. The entire procedure was repeated 1,000,000 times, and the probability distribution P was estimated from the scores of all 1,000,000 replications.
- In the second step, expected numbers of nonidentical bands comigrating across genotypes (i.e., expected numbers of bands comigrating by chance), standard deviation, and approximate critical values (95 and 99%) were determined from the probability distribution P. Because the variables under study are discrete, exact 95 and 99% critical values could not be calculated. Instead, approximate values were determined by interpolation.
- In most relationship studies, similarity among genotypes is reported using (dis)similarity coefficients rather than numbers of comigrating bands. These coefficients somehow express the proportion of comigrating relative to noncomigrating bands. A literature survey showed that the majority of studies employed Dice similarity (DICE 1945) or Nei and Li distance (NEI and LI 1979), while Jaccard (JACCARD 1908) and simple matching (SOKAL and SNEATH 1963) similarity are also widely employed. For a given pair of genotypes, let xi = 0 when no AFLP band is present at position i in genotype 1, and xi = 1 when an AFLP band is present at position i in genotype 1. Likewise, yi = 0 or 1 for genotype 2. For a scoring range 1N, let si = 1 when a certain band position is scored a data set and si = 0 when a band position is not scored. Let
, and
. Then Dice = 2a/(2a + b + c), Jaccard = a/(a + b + c), and simple matching = (a + d)/(a + b + c + d). Nei and Li = (1 Dice). To make our tests readily applicable in relationship studies employing the above coefficients, we used the numbers of nonidentical bands comigrating across genotypes to get (dis)similarity values. The recalculations involved two steps. First, probability distributions for all four coefficients were calculated, on the basis of the probability distribution of the number of comigrating bands, P. Next, expected values and approximate critical values (95 and 99%) were determined from these distributions as described previously.
The entire procedure has been incorporated in the computer program AFLSIM, which can be downloaded from http://www.dpw.wur.nl/biosys/AFLSIM_UK.html. The program can be used to test the significance of AFLP similarities in empirical data sets with scoring ranges between 34 and 1024 bp (related to the limits of the Arabidopsis AFLP FLD). The minimum number of AFLP bands per genotype should be 1, and the maximum equals half the number of band positions available within the employed scoring range. Band lengths should be input as discrete (i.e., integer) values. As an example, Figure 3 and Table 1 show results for the widely employed scoring range 50500 bp and an AFLP procedure with A/C selective nucleotides. Figure 3 shows the relationship between the number of bands scored in each of two genotypes and the expected number of bands shared. Table 1 gives an overview of the test statistics. The expected (dis)similarities in the table indicate the level of (dis)similarity expected in unrelated genotypes. Pairwise (dis)similarities exceeding the critical values indicate significant phenetic or genetic similarity.
|
|
For the calculations in Table 1, we assumed that all band positions available in the scoring range were present in the data set. As a result, a relatively large proportion of the band positions showed 0/0 matches (i.e., no band present in either of the genotypes compared). Because 0/0 matches are counted as similarity in the simple matching coefficient, this causes a relatively high minimum simple matching value (Table 1, bottom, column 10). The number of 0/0 matches does not influence the Dice, Nei and Li, and Jaccard similarity. Consequently, the theoretical minimum value of these coefficients is always 0, regardless of the number of 0/0 matches in the data set.
The maximum possible (dis)similarity values (given the observed band numbers; see Table 1) illustrate an often overlooked peculiarity of Dice, Jaccard, Nei and Li, and simple matching pairwise (dis)similarities: they can be unity (or 0 in the case of Nei and Li distance) only when AFLP band numbers in both genotypes are identical. Table 1 shows that the maximum possible similarity rapidly decreases with increasing difference in band number between genotypes. Comparison with the critical values corresponding to the unequal band numbers shows that such (dis)similarities, although low, may still be significant.
Nonidentical AFLP fragments comigrating within genotypes:
When simulating band patterns for the probability distribution P, we were surprised by the high amount of size homoplasy. The number of bands containing multiple fragments was much higher than we intuitively anticipated. However, the phenomenon that a co-occurrence of events (in this case the appearance of two AFLP fragments of equal length) is more likely than intuitively expected is well known in statistics and commonly referred to as the birthday paradox. The paradox is often summarized as follows: in a group of only 23 persons, the probability of at least one coinciding birthday, assuming uniformly distributed birthdays over all 365 days of the year, is already >0.5.Translated to AFLP patterns for a scoring range of, e.g., 50500 bp (451 positions), this means that only 26 fragments are needed to have a probability >0.5 that at least one AFLP band contains multiple fragments. In reality, however, the probability distribution of fragment lengths is highly skewed instead of uniform (Figure 2), rendering even higher probabilities of fragments with identical lengths (MUNFORD 1977).
Analogous to the situation for nonidentical AFLP bands comigrating across genotypes, the number of nonidentical AFLP fragments comigrating within a genotype (i.e., the amount of size homoplasy) depends on the number of bands scored, the number of discrete band positions available within the scoring range, and the AFLP FLD. Table 2 illustrates the size homoplasy for a wide series of scoring ranges and band numbers. The table shows that the amount of size homoplasy increases with increasing numbers of bands and with decreasing scoring range. In empirical data sets, the occurrence of multiple fragments in AFLP bands has already been demonstrated for Beta and G. max (HANSEN et al. 1999; MEKSEM et al. 2001).
|
Weighted similarity coefficients including band position information:
In the previous sections, a procedure was developed to test the significance of AFLP-based similarities. The procedure can be used to test similarities that were calculated according to various well-known similarity coefficients. The relationship between band length and band presence is incorporated in the tests using the Arabidopsis AFLP FLD. However, this relationship is not accounted for in the similarity coefficients themselves, since all bands are equally weighted in the existing coefficients.To make the existing similarity coefficients more informative, we propose an adjustment of these coefficients by weighting the bands with the inverse probabilities of their occurrence in an AFLP profile. The rationale behind this is that long bands have a smaller probability of occurring than short bands do, and therefore they have a larger probability of contributing reliable information to a data set. Consequently, long bands should contribute more to the overall similarity values. A proper weighting scheme can be derived from the Arabidopsis AFLP FLD. In the section on Arabidopsis in silico AFLP FLDs, we demonstrated that the Arabidopsis AFLP FLD is a reliable basis for describing the probabilities of occurrence of AFLP fragments and hence of AFLP bands. Therefore, the inverse probabilities from the Arabidopsis AFLP FLD are the logical basis for constructing weighted similarity coefficients.
The weighted coefficients are constructed in two steps, analogous to the construction of the unweighted coefficients. In the first step, weighted similarities are calculated for numbers of bands shared between two genotypes (aw), for numbers of bands unique to one of the genotypes (bw and cw), and for band positions that are not occupied in either of the genotypes (dw). Again, for a given pair of genotypes, let xi = 0 when no AFLP band is present at position i in genotype 1, and xi = 1 when an AFLP band is present at position i in genotype 1. Likewise, yi = 0 or 1 for genotype 2. For a scoring range 1N, let si = 1 when a certain band position is scored a data set, and si = 0 when a band position is not scored. Then,
,
, and
; with inverse weights
,
,
, and
; with pi the probability that genotype 1 has a band at position i; and qi the probability that genotype 2 has a band at position i. The band probabilities are derived from the fragment probabilities in the Arabidopsis in silico AFLP FLD according to pi = 1 [1 p(fragment at i)]N1, and qi = 1 [1 p(fragment at i)]N2, where N1 and N2 are the total numbers of fragments in the scoring range in genotypes 1 and 2, respectively. The number of fragments N for each genotype depends on the scoring range, the total number of bands within the scoring range, and the fragment length distribution and was determined by Monte Carlo simulation as described in Significance tests for pairwise AFLP band similarities. In the second step, weighted similarity coefficients are calculated according to: weighted Dice = 2aw/(2aw + bw + cw), weighted Jaccard = aw/(aw + bw + cw), and weighted simple matching = (aw + dw)/(aw + bw + cw + dw). Weighted Nei and Li = (1 weighted Dice).
The Arabidopsis sequence as a model system:
The test statistics in this study are based on in silico AFLP FLDs from the Arabidopsis genome sequence. This sequence is generally considered to be representative of the genome of an angiosperm species (e.g., ARABIDOPSIS GENOME INITIATIVE 2000; BARNES 2002), and therefore the test statistics based on the Arabidopsis genome sequence should be valid for angiosperms in general.
A limitation of the Arabidopsis sequence is that a significant part is still missing. According to the ARABIDOPSIS GENOME INITIATIVE (2000),
8.5% of the genome has not yet been aligned (
10 of an estimated 125 Mb). This 8.5% mainly consists of repeat sequences in centromeric and rDNA regions. Genetic mapping studies in Arabidopsis (e.g., ALONSO-BLANCO et al. 1998) showed a clustering of AFLP fragments around the centromeres, which could indicate that the actual percentage of AFLP fragments missing from the Arabidopsis AFLP FLD is much higher than the 8.5% of missing sequence. In a recent study, however, PETERS et al. (2001) found that Arabidopsis SacI/MseI in silico AFLP fragments do not cluster around the centromeres, but are evenly dispersed over the genome. They argued that the apparent overrepresentation of AFLP fragments in genetic mapping studies must originate in a higher mutation frequency in the (peri)centromeric regions rather than in an actual overrepresentation of AFLP fragments. Assuming that the findings of PETERS et al. (2001) are representative for AFLP fragments in general, the missing 8.5% of repeat regions in the Arabidopsis genome sequence corresponds to 8.5% of missing AFLP fragments in the Arabidopsis AFLP FLD. These missing regions contain mainly repeat sequences. Estimating the influence of the missing repeats on the Arabidopsis AFLP FLD is highly speculative, but one could argue that their influence on the significance tests may be only limited. Given the fact that the average size of the individual repeat units is relatively small, the size of AFLP fragments resulting from restriction sites in the repeat regions will also be small. The possible underrepresentation of small fragments will mainly influence the lower part of the Arabidopsis AFLP FLD. In most AFLP studies, these smaller fragments are discarded. Consequently, they do not influence the results.
Specific features of the Arabidopsis genome that may limit its general applicability as a model system for angiosperms are its small size (120 Mb) and its relatively low G + C content (36%). We examined the representativity of the Arabidopsis sequence using sequences of Oryza sativa L. Apart from sequences of Arabidopsis, sequences of O. sativa L. subspecies indica (YU et al. 2002) and japonica (FENG et al. 2002; GOFF et al. 2002; SASAKI et al. 2002) are the only complete angiosperm sequences presently available. However, at the time of our study the O. sativa sequences were still very fragmented. We used sequences from chromosomes 3 (43.4% G + C) and 10 (43.6% G + C) of O. sativa subsp. japonica (hereafter, rice), covering nearly complete chromosomes contained in a limited number of BAC assemblies. Sequence data were obtained from the web site of The Institute for Genomic Research at http://www.tigr.org. To generate the rice FLD, we performed the in silico AFLP as described for Arabidopsis, without selective nucleotides. Vector sequences and sequences of suspect origin were removed from the BAC assemblies prior to in silico AFLP, using the National Center for Biotechnology Information VecScreen web tool. The probability distribution of the AFLP fragment lengths was estimated by fitting a cubic smoothing spline as before. The smoothing spline and the relative frequency distribution of the rice in silico AFLP fragments are depicted in Figure 4. Fragment sizes range from 32 to 1024 bp.
The Arabidopsis FLD without selective nucleotides is included as a reference. A two-sample Kolmogorov-Smirnov test showed that the rice FLD differs significantly from the Arabidopsis FLDs with A/C, T/A, or without selective nucleotides (P < 0.0001), but not from that with C/G selective nucleotides (P = 0.09). The most obvious reason for the difference is the high G + C content of the rice sequences relative to those of Arabidopsis. As predicted by the theoretical model of INNAN et al. (1999), the higher G + C content in rice yields a more even FLD. Additionally, there may be other genome differences between rice and Arabidopsis that influence the AFLP FLD. Most notably, these could be differences related to the evolutionary distinct position of Poaceae within the angiosperms (e.g., MONTERO et al. 1990; DEVOS et al. 1999; FREELING 2001). However, the influence of these additional factors cannot be studied separately from that of G + C content until more evolutionary distinct genome sequences with similar nucleotide compositions become available.
Comparison of the test statistics for Arabidopsis and rice in the scoring range 50500 bp (supplemental Table 3, available at http://www.dpw.wur.nl/biosys/AFLSIM_UK.html) showed that the expected number of nonidentical bands comigrating across genotypes is on average 10% lower for rice. Although the numbers are in the same order of magnitude, the difference between Arabidopsis and rice illustrates the need for more than one model species. Given the fact that Arabidopsis and rice cover most of the G + C range for angiosperms, together they probably suffice as model species for the angiosperms in general. Therefore, we propose that the tests statistics based on the Arabidopsis sequence be considered generally applicable for angiosperms with G + C contents between
35 and 40% G + C, and tests based on the rice sequence be considered generally applicable for angiosperms with G + C contents between
40 and 50%. For angiosperms with unknown G + C content, the test statistics for the Arabidopsis genome can be applied as a conservative test. Test statistics based on a more complete rice genome sequence will be developed at a later stage.
The discrepancy between the theoretical and the in silico distribution may be explained by two assumptions made by INNAN et al. (1999). The first is that of a random nucleotide sequence under the JUKES and CANTOR (1969) model. In actual genomes the nucleotides are not randomly distributed, but organized in distinct patterns of dinucleotides and oligonucleotides (NUSSINOV 1981, 1991). At a larger scale, the genome is organized in isochores, showing large blocks of G + C-rich sequences alternated by large blocks of more A + T-rich sequences (SALINAS et al. 1988; MATASSI et al. 1989; MONTERO et al. 1990). Moreover, the Jukes and Cantor model assumes equal base frequencies and equal chances on substitution among all nucleotides, while in reality base frequencies are unequal and substitution rates vary. The second assumption that may explain the deviation between the theoretical and the in silico distribution is that of nucleotide changes as the sole cause of changes in DNA sequence. Under this second assumption, processes such as insertions and deletions are ignored. Obviously, this is a simplification of the dynamics in actual genomes, as was already noted by INNAN et al. (1999). Both assumptions introduce restrictions in the model of INNAN et al. (1999) that may be too limiting to allow for an adequate description of an AFLP FLD.
Our analysis of the Arabidopsis in silico AFLP FLD demonstrated that the type of selective nucleotides influences the shape of the distribution. Use of only G + C nucleotides favors the selection of long fragments over short ones, yielding a relatively even distribution of fragments over length classes. Use of only A + T nucleotides favors the selection of short fragments over long ones, giving a more asymmetrical distribution. The effect probably results from the isochore structure of the genome in combination with the nucleotide composition of the restriction enzymes. The enzymes employed in this study are a frequent cutter (MseI) and a rare cutter (EcoRI). Because MseI cuts are much more frequent than EcoRI cuts, the average AFLP fragment size will be determined mainly by the frequency of MseI cuts. The restriction site of MseI contains no G + C nucleotides, and therefore this enzyme will preferably cut in A + T-rich isochores. Given the preference of the frequent-cutting MseI enzyme to cut in A + T-rich isochores, and the fact that the fragment size is inversely proportional to the frequency of cuts, AFLP fragments resulting from A +T-rich isochores will on average be smaller than fragments resulting from other parts of the genome. Because these fragments originate in A + T-rich stretches of the genome, the fragments themselves will contain relatively high proportions of A + T nucleotides. Inversely, fragments resulting from G + C-rich isochores will on average be longer and contain relatively high proportions of G + C nucleotides (the relation between fraction G + C and fragment length in the Arabidopsis in silico AFLP data is approximately G + C = 0.34379 + 0.00012036 x length). Using T/A selective nucleotides in the AFLP procedure will favor the shorter A + T-rich sequences over the longer G + C-rich sequences, yielding an asymmetric AFLP FLD with mainly short sequences. Using C/G selective nucleotides will favor G + C-rich sequences, yielding a more even distribution of AFLP fragments over length classes. The FLD resulting from an AFLP procedure with A/C selective nucleotides did not differ significantly from the FLD generated without selective nucleotides, illustrating that the selective nucleotides effect is avoided when mixed A + T/G + C selective nucleotides are used.
On the basis of the Arabidopsis in silico AFLP FLDs, the numbers of nonidentical bands comigrating across genotypes were calculated as a basis for significance tests for AFLP similarities. Table 1 shows that the proportion of nonidentical bands comigrating across genotypes increases with the number of bands scored per genotype. When 10 bands are scored in each genotype and A/C selective nucleotides are used, the proportion of comigrating nonidentical bands is
4%. For 30 bands, this proportion is 12%, for 60 bands it is 22%, for 90 bands it is 31%, and for 120 bands it is 40%. The increase results from the fact that the probability for nonidentical AFLP fragments to comigrate at the same position increases with increasing numbers of total fragments. Relative to the proportion of comigrating nonidentical bands for A/C nucleotides, the proportions for T/A selective nucleotides are somewhat higher (4, 13, 24, 33, and 42%), while the proportions for C/G nucleotides are somewhat lower (4, 10, 20, 29, and 37%). However, all are in the same order of magnitude. The differences for the various combinations of selective nucleotides probably result from selection bias due to the isochore structure of the genome and the use of different types of selective nucleotides, as discussed before.
The high numbers of nonidentical comigrating bands apparent from Table 1 and supplemental Table 3 illustrate that overestimation of phenetic or genetic similarities based on AFLP band patterns is a serious problem when 50100 bands per genotype are scored, as recommended by VOS et al. (1995). However, even for lower numbers of bands per genotype, a considerable percentage of comigrating bands are nonidentical. Therefore, overestimation of similarities based on AFLP band patterns cannot be completely ruled out by limiting the number of bands within a scoring range. However, the influence of the overestimation on the final analyses can be diminished by using corrected similarities, or weighted similarities, or by removing from the data sets those genotypes without any significant similarity to other genotypes. This article provides the procedures that enable this, all of which are available in the program AFLSIM. The procedures can be applied in, e.g., genetic diversity studies or phylogenetic studies, which often include less-related genotypes as reference groups. For any genotype to be useful as a reference, at least some genetic similarity with the group under study is required. In many genetic diversity studies, however, the genetic similarities between the groups under study and the reference group are below the 95% critical values indicated in our tests. Such similarities, usually in the order of 0.15 or 0.20, are mistakenly taken to indicate a proper level of similarity for a reference group. To select a proper reference group, pairwise similarities between genotypes in the reference group and in the group under study should be tested, and at least some similarities between genotypes of both groups should be significant. Reference genotypes without significant similarity to the group under study should be discarded prior to further analysis.
By enabling the detection of unrelated genotypes and by the use of corrected and weighted similarity values, application of the procedures proposed in this article will make the analysis of AFLP data sets more informative and more reliable.
ALONSO-BLANCO, C., A. J. M. PEETERS, M. KOORNNEEF, C. LISTER, C. DEAN et al., 1998 Development of an AFLP based linkage map of Ler, Col and Cvi Arabidopsis thaliana ecotypes and construction of a Ler/Cvi recombinant inbred line population. Plant J. 14: 259271.[CrossRef][Medline]
ARABIDOPSIS GENOME INITIATIVE, 2000 Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815.[CrossRef][Medline]
BARNES, S., 2002 Comparing Arabidopsis to other flowering plants. Curr. Opin. Plant Biol. 4: 16.
BAROW, M., and A. MEISTER, 2002 Lack of correlation between AT frequency and genome size in higher plants and the effect of nonrandomness of base sequences on dye binding. Cytometry 47: 17.[CrossRef][Medline]
DEVOS, K. M., J. BEALES, Y. NAGAMURA and T. SASAKI, 1999 Arabidopsis-rice: Will colinearity allow gene prediction across the eudicot-monocot divide? Genome Res. 9: 825829.
DICE, L. R., 1945 Measures of the amount of ecological association between species. Ecology 26: 297302.[CrossRef]
EL-RABEY, H. A., A. BADR, R. SCHAFER-PREGL, W. MARTIN and F. SALAMINI, 2002 Speciation and species separation in Hordeum L. (Poaceae) resolved by discontinuous molecular markers. Plant Biol. 4: 567575.[CrossRef]
FENG, Q., Y. ZHANG, P. HAO, S. WANG, G. FU et al., 2002 Sequence and analysis of rice chromosome 4. Nature 420: 316320.[CrossRef][Medline]
FREELING, M., 2001 Grasses as a single genetic system. Reassessment 2001. Plant Physiol. 125: 11911197.
GOFF, S. A., D. RICKE, T. H. LAN, G. PRESTING, R. WANG et al., 2002 A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100.
HANSEN, M., T. KRAFT, M. CHRISTIANSSON and N.-O. NILSSON, 1999 Evaluation of AFLP in Beta. Theor. Appl. Genet. 98: 845852.[CrossRef]
INNAN, H., R. TERAUCHI, G. KAHL and F. TAJIMA, 1999 A method for estimating nucleotide diversity from AFLP data. Genetics 151: 11571164.
JACCARD, P., 1908 Nouvelles recherches sur la distribution florale. Bull. Soc. Vaudoise Sci. Nat. 44: 223270.
JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21132 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.
KARP, A., O. SEBERG and M. BUIATTI, 1996 Molecular techniques in the assessment of botanical diversity. Ann. Bot. 78: 143149.
MARIE, D., and S. C. BROWN, 1993 A cytometric exercise in plant DNA histograms, with 2C values for 70 species. Biol. Cell 78: 4151.[CrossRef][Medline]
MATASSI, G., L. M. MONTERO, J. SALINAS and G. BERNARDI, 1989 The isochore organization and the compositional distribution of homologous coding sequences in the nuclear genome of plants. Nucleic Acids Res. 17: 52735290.
MEKSEM, K., E. RUBEN, D. HYTEN, K. TRIWITAYAKORN and D. A. LIGHTFOOT, 2001 Conversion of AFLP bands into high-throughput DNA markers. Mol. Genet. Genomics 265: 207214.[CrossRef][Medline]
MONTERO, L. M., J. SALINAS, G. MATASSI and G. BERNARDI, 1990 Gene distribution and isochore organization in the nuclear genome of plants. Nucleic Acids Res. 18: 18591867.
MUELLER, U. G., and L. LAREESA WOLFENBARGER, 1999 AFLP genotyping and fingerprinting. Trends Ecol. Evol. 14: 389394.[CrossRef][Medline]
MUNFORD, A. G., 1977 A note on the uniformity assumption in the birthday problem. Am. Stat. 31: 119.
NAGL, W., and B. STEIN, 1989 DNA characterization in host-specific Viscum album subspecies (Viscaceae). Plant Syst. Evol. 166: 243248.[CrossRef]
NEI, M., and W.-H. LI, 1979 Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76: 52695273.
NUSSINOV, R., 1981 Nearest neighbor nucleotide patterns. J. Biol. Chem. 256: 84588462.
NUSSINOV, R., 1991 Compositional variations in DNA sequences. Comput. Appl. Biosci. 7: 287293.
O'HANLON, P. C., and R. PEAKALL, 2000 A simple method for the detection of size homoplasy among amplified fragment length polymorphism fragments. Mol. Ecol. 9: 815816.[CrossRef][Medline]
PETERS, J. L., H. CONSTANDT, P. NEYT, G. CNOPS, J. ZETHOF et al., 2001 A physical amplified fragment-length polymorphism map of Arabidopsis. Plant Physiol. 127: 15791589.
ROHLF, F. J., 1993 NTSYS-pc, Numerical Taxonomy and Multivariate Analysis System. Exeter Software, Setauket, NY.
ROUPPE VAN DER VOORT, J. N. A. M., P. VAN ZANDVOORT, H. J. VAN ECK, R. T. FOLKERTSMA, R. C. B. HUTTEN et al., 1997 Use of allele specificity of comigrating AFLP markers to align genetic maps from different potato genotypes. Mol. Gen. Genet. 255: 438447.[CrossRef][Medline]
SALINAS, J., G. MATASSI, L. M. MONTERO and G. BERNARDI, 1988 Compositional compartmentalization and compositional patterns in the nuclear genomes of plants. Nucleic Acids Res. 16: 42694285.
SASAKI, T., T. MATSUMOTO, K. YAMAMOTO, K. SAKATA, T. BABA et al., 2002 The genome sequence and structure of rice chromosome 1. Nature 420: 312316.[CrossRef][Medline]
SOKAL, R. R., and P. H. A. SNEATH, 1963 Principles of Numerical Taxonomy. W. H. Freeman, San Francisco/London.
VEKEMANS, X., T. BEAUWENS, M. LEMAIRE and I. ROLDAN-RUIZ, 2002 Data from amplified fragment length polymorphism (AFLP) markers show indication of size homoplasy and of a relationship between degree of homoplasy and fragment size. Mol. Ecol. 11: 139151.[CrossRef][Medline]
VOS, P., R. HOGERS, M. BLEEKER, M. REIJANS, T. VAN DE LEE et al., 1995 AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res. 23: 44074414.
YU, J., S. N. HU, J. WANG, K. S. G. WONG, S. G. LI et al., 2002 A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 7992.
This article has been cited by other articles:
![]() |
A. Caballero, H. Quesada, and E. Rolan-Alvarez Impact of Amplified Fragment Length Polymorphism Size Homoplasy on the Estimation of Population Genetic Diversity and the Detection of Selective Loci Genetics, May 1, 2008; 179(1): 539 - 554. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Blaich, J. Konradi, E. Ruhl, and A. Forneck Assessing Genetic Variation among Pinot noir (Vitis vinifera L.) Clones with AFLP Markers Am. J. Enol. Vitic., December 1, 2007; 58(4): 526 - 529. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Koopman, W. J. M.
- Articles by Gort, G.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Koopman, W. J. M.
- Articles by Gort, G.





