## Abstract

Many AFLP studies include relatively unrelated genotypes that contribute noise to data sets instead of signal. We developed: (1) estimates of expected AFLP similarities between unrelated genotypes, (2) significance tests for AFLP similarities, enabling the detection of unrelated genotypes, and (3) weighted similarity coefficients, including band position information. Detection of unrelated genotypes and use of weighted similarity coefficients will make the analysis of AFLP data sets more informative and more reliable. Test statistics and weighted coefficients were developed for total numbers of shared bands and for Dice, Jaccard, Nei and Li, and simple matching (dis)similarity coefficients. Theoretical and *in silico* AFLP fragment length distributions (FLDs) were examined as a basis for the tests. The *in silico* AFLP FLD based on the *Arabidopsis thaliana* genome sequence was the most appropriate for angiosperms. The *G* + *C* content of the selective nucleotides in the *in silico* AFLP procedure significantly influenced the FLD. Therefore, separate test statistics were calculated for AFLP procedures with high, average, and low *G* + *C* contents in the selective nucleotides. The test statistics are generally applicable for angiosperms with a *G* + *C* content of ∼35–40%, but represent conservative estimates for genotypes with higher *G* + *C* contents. For the latter, test statistics based on a rice genome sequence are more appropriate.

AFLP is a DNA fingerprinting technique developed by Keygene N.V. (Vos *et al.* 1995). The technique consist of four steps: (1) digestion of DNA with two restriction enzymes, (2) ligation of double-stranded oligonucleotide adapters to the restriction fragments, (3) selective PCR amplification of the ligated fragments with specific PCR primers that have selective nucleotides at their 3′ end, and (4) separation of the amplified fragments on a denaturing polyacrylamide gel. On this gel, the fragments are separated by their length. Inclusion of a base-pair ladder enables determination of the exact length of each fragment.

In recent years, AFLPs have become a popular tool for relationship studies (Mueller and LaReesa Wolfenbarger 1999). In these studies, the AFLPs are scored as dominant anonymous markers. Dominant scoring of AFLPs means that each fragment is scored as either present or absent and that the fragments are assumed to occur independently of each other. Scoring as anonymous markers means that the fragments are recognized only by their length, while their sequence is unknown. Fragments of the same length, which are comigrating on a gel, are assumed to be identical. The fraction of fragments comigrating across genotypes, expressed in some way by a similarity or dissimilarity coefficient, is used as a measure for genetic or phenetic relationship. Various coefficients have been developed to quantify (dis)similarity, mainly differing in the weighting of comigrating relative to noncomigrating fragments (see, *e.g.*, Nei and Li 1979; Rohlf 1993).

The assumption that all comigrating fragments are identical is an oversimplification of the actual situation (Vekemans *et al.* 2002). In reality, a certain fraction of fragments will be comigrating by chance only, while having distinct sequences. Because these fragments will be scored as identical, their presence leads to an overestimation of the similarity among genotypes. The presence of nonidentical fragments comigrating across genotypes was demonstrated in actual data sets of *Solanum tuberosum* (Rouppe van der Voort *et al.* 1997), Carduinae thistles (O'Hanlon and Peakall 2000), and Hordeum species (El-Rabey *et al.* 2002). The presence of nonidentical fragments comigrating within genotypes was demonstrated in Beta (Hansen *et al.* 1999) and *Glycine max* (Meksem *et al.* 2001). The proportion of comigrating nonidentical fragments ranged from at least 10% within genotypes or among closely related genotypes (Rouppe van der Voort *et al.* 1997; Hansen *et al.* 1999; Meksem *et al.* 2001) to 100% for pairs of genotypes from more distantly related taxa (O'Hanlon and Peakall 2000). Given the proportions of comigrating nonidentical bands, a serious overestimation of pairwise similarities among genotypes can be expected. Indeed, Karp *et al.* (1996) noted that the occurrence of nonidentical comigrating AFLP fragments may pose serious problems for the application of AFLPs in relationship studies, but the issue was largely ignored in literature thereafter.

In this study, we quantify the occurrence of nonidentical comigrating AFLP fragments for AFLP procedures with restriction enzymes *Eco*RI/*Mse*I. The estimates are used to (1) determine the expected numbers of comigrating nonidentical bands and (2) develop significance tests for AFLP similarities. As a basis for the significance tests we determine and evaluate theoretical AFLP fragment length distributions based on Innan *et al.* (1999) and *in silico* AFLP fragment length distributions (FLDs) based on the complete *Arabidopsis thaliana* (L.) Heynh. genome sequence (Arabidopsis Genome Initiative 2000). Using the *A. thaliana* (hereafter, Arabidopsis) FLD, we estimate the probability distribution of the number of nonidentical AFLP bands comigrating across genotypes. From this distribution, we determine expectations and 95 and 99% critical values for band numbers and (dis)similarity coefficients Dice, Jaccard, Nei and Li, and simple matching (Nei and Li 1979; Rohlf 1993). The critical values can be used to test the significance of a given pairwise similarity among angiosperm genotypes. If desired, genotypes that do not contribute significant relationship information can be removed from a data set. Determination of the expected numbers of comigrating nonidentical bands also yielded information on the underlying band length distribution probabilities. However, the usual similarities calculated using the Dice, Jaccard, Nei and Li, and simple matching coefficients ignore this information, assuming identical probabilities for all bands. As an alternative, we propose similarity coefficients that weight the AFLP bands according to their band length distribution probabilities. It is expected that the use of the significance tests and weighted similarities will make the analysis of AFLP data sets more informative and more reliable.

## METHODS AND RESULTS

### General strategy:

The number of nonidentical AFLP bands comigrating across genotypes depends on the number of bands scored for each genotype, the number of possible band lengths for the genotypes (*i.e.*, the number of discrete band positions possible within a selected scoring range), and the length distribution of the AFLP fragments. Note that one AFLP band may contain multiple fragments (discussed later). In empirical data sets, the number of possible band positions and the number of bands for each genotype are known; only the FLD remains to be determined. The distribution can be obtained in several ways, *e.g.*, (1) derived from AFLP band data in empirical data sets, (2) calculated using theoretical FLDs, and (3) determined *in silico*, if representative genome sequence data (preferably entire genomes) are available.

The use of empirical data involves the risk of introducing methodological error into the calculations resulting from the AFLP procedure itself. Such errors may include, *e.g.*, biases in fragment amplification or in scoring of bands. Theoretically derived or *in silico*-generated FLDs do not have this drawback.

Theoretical distributions may be preferred over *in silico* distributions, because they are exactly formulated, using explicit assumptions and parameter settings. In this article, we examine the length distribution for AFLP fragments proposed by Innan *et al.* (1999) as a theoretical basis on which to estimate the proportion of nonidentical bands comigrating across genotypes. To our knowledge, no alternative AFLP FLD has been proposed yet.

Use of *in silico* AFLP FLDs has the drawback that the distribution itself has to be estimated from the available genome data. Therefore, it is inherently subject to uncertainty because of estimation error and limited by the availability and representativeness of the genome data. However, *in silico* AFLP data also have two major advantages. First, the AFLP fragments represent an actual genome. Thus, their distribution is not subject to assumptions that underlie theoretical models. Second, when the procedure is performed properly, no fragments will be lost due to methodological errors, and all possible fragments will be represented in the AFLP data set. Here, we examine an *in silico* FLD based on the genome sequence of the model plant Arabidopsis as an alternative to the theoretical distribution of Innan *et al.* (1999). All statistical procedures were performed in SAS Release 8.00 (SAS Institute, Cary, NC).

### Theoretical AFLP fragment length distributions:

Innan *et al.* (1999) describe AFLP FLDs for *Eco*RI and *Mse*I restriction enzymes under the assumption of (1) a random nucleotide sequence under the Jukes and Cantor model [equal base frequencies *C* = *A* = *T* = *G* = 0.25, and all substitutions equally likely (Jukes and Cantor 1969)]; (2) nucleotide changes as sole cause of changes in DNA sequence; and (3) a haploid genome. They showed that both *Eco*RI/*Eco*RI and *Eco*RI/*Mse*I fragments follow the same truncated geometric distribution , in which *L* is the length of the AFLP fragments, *L*_{min} and *L*_{max} are the minimum and maximum possible lengths of the fragments considered, and *A* = (1 − probability of formation of new *Eco*RI site)(1 − probability of formation of new *Mse*I site). The probability of formation of a restriction site equals the multiplied relative frequencies of the individual nucleotides required for such a site (*GAATTC* for *Eco*RI, *TTAA* for *Mse*I). Under the assumption of equal frequencies of occurrence for all four nucleotides as made by Innan *et al.* (1999), *A* = (1 − 0.25^{6})(1 − 0.25^{4}).

To examine the influence of nucleotide frequencies on the AFLP FLD, we calculated distributions for various ratios of *A* + *T vs. G* + *C*. A literature survey revealed that the *G* + *C* contents of the majority of plants ranged between 35 and 50% (see, *e.g.*, Marie and Brown 1993; Barow and Meister 2002). However, various plant groups showed different *G* + *C* contents. The average *G* + *C* content was 37% for gymnosperms, 40% for dicotyledons, 41% for ferns, 44% for monocotyledons, and 45% for algae. *Viscum album* possibly occupies a special position with only 30% *G* + *C* (Nagl and Stein 1989), although Marie and Brown (1993) reported 39% *G* + *C*. We covered the *G* + *C* range by calculating separate AFLP FLDs for 35, 40, 45, and 50% *G* + *C*. The nucleotide frequencies of *A* in the formula of Innan *et al.* (1999) were adjusted accordingly, with equal splitting of percentages over *A* + *T* and *G* + *C* nucleotides. For easy comparison with empirical data sets, all fragment and band lengths that are reported in this article include adapter sequences.

Figure 1 depicts the AFLP FLDs for 35–50% *G* + *C*. The distributions show that the probability that a fragment will occur decreases with increasing fragment length for all *G* + *C* contents. The shape of the distribution is also influenced by the base composition: low *G* + *C* contents yield relatively high frequencies of smaller fragments, while high *G* + *C* contents yield relatively high frequencies of longer fragments. The uniform distribution (all fragment lengths equally likely) is given as a reference.

### Arabidopsis *in silico* AFLP fragment length distributions:

Sequence data of the entire Arabidopsis genome sequence were obtained from The Institute for Genomic Research through the web site at http://www.tigr.org. The Arabidopsis *in silico* AFLP was performed using the restriction enzyme sequences of *Eco*RI/*Mse*I without any selective nucleotides. The probability distribution of the fragment lengths was estimated by fitting a cubic smoothing spline and rescaling properly, using SAS PROC IML. The smoothing parameter of the spline (200.000) was chosen by eye. The more objective approach of cross-validation (SAS PROC INSIGHT) resulted in an unsatisfactory smoothing level and a spline oscillating around the one chosen by eye. The smoothing spline and the relative frequency distribution of the *in silico* AFLP fragments are depicted in Figure 2. Fragment lengths range from 32 to 1024 bp.

To compare the *in silico* AFLP FLD with the theoretical distribution of Innan *et al.* (1999), we calculated a theoretical distribution using the nucleotide frequencies from the Arabidopsis genome sequence (*G* = *C* = 0.18 and *A* = *T* = 0.32 for all five chromosomes). Figure 2 shows a clear difference between the theoretical and the *in silico* FLD. Compared to the theoretical distribution, the *in silico* distribution shows a lack of smaller bands (<179 bp) and an excess of larger bands (>179 bp). The difference may originate in the nucleotide sequence model employed by Innan *et al.* (1999), which was probably too simple to adequately describe the Arabidopsis *in silico* FLD (see discussion). Given the limitations of the theoretical model and the fact that, in contrast, the Arabidopsis *in silico* FLD reflects an actual genome sequence, we consider the Arabidopsis distribution to be the more accurate basis for our significance tests for AFLP similarities.

The *in silico* AFLP FLD was generated without selective nucleotides to obtain the highest possible number of AFLP fragments. In practice, however, selective nucleotides are always employed in AFLP procedures on plants. To test the influence of selective nucleotides on the AFLP FLD, we performed additional *in silico* AFLP runs with three +1/+1 selective nucleotide combinations: *A*/*C* (the most commonly used single-nucleotide combination), *T*/*A* (the nucleotides with the highest frequency in the Arabidopsis genome), and *C*/*G* (the nucleotides with the lowest frequency in the Arabidopsis genome). A two-sample Kolmogorov-Smirnov test (SAS PROC NPAR1WAY) showed a significant influence of *T*/*A* (*P* = 0.002) and *C*/*G* (*P* = 0.001) selective nucleotides on the FLD. The distribution for selective nucleotides *A*/*C* did not differ significantly from that without selective nucleotides (*P* = 0.62). Figure 2 illustrates the influence of selective nucleotides on the *in silico* AFLP FLD. The use of *T*/*A* selective nucleotides results in an overrepresentation of shorter fragments (<107 bp) and an underrepresentation of longer fragments (>107 bp). The use of *C*/*G* selective nucleotides results in an overrepresentation of longer fragments (>107 bp) and an underrepresentation of shorter fragments (<107 bp). The difference indicates that selection of AFLP fragments using selective nucleotides is not a random process (see discussion).

Each fragment in an AFLP profile contains a discrete number of nucleotides. If properly measured, the length of a fragment equals this number of nucleotides. Given the discrete nature of the AFLP fragment lengths, the AFLP FLDs are discrete distributions. In Figures 2 and 4, however, the AFLP FLDs appear as continuous distributions, because the large number of possible lengths makes it impossible to visualize the actual discreteness. For the *in silico* AFLPs without selective nucleotides, Figures 2 and 4 show both the smoothed discrete FLDs (line A in Figure 2; lines A and B in Figure 4) and the nonsmoothed discrete FLDs (probability in each length class depicted by a dot). All statistical procedures in this study are based on the discrete smoothed distributions. As a consequence, band lengths used as input for the statistical tests developed in our study should be discrete (*i.e.*, integer) values.

### AFLP fragments and AFLP bands:

Similarities in AFLP patterns result from fragments that are comigrating across genotypes, and two types of such fragments can be distinguished: first, fragments that share the same sequence and originate from the same loci (comigrating identical fragments; these fragments reflect the genetic similarity among genotypes); and second, fragments having different sequences, originating from different loci (comigrating nonidentical fragments; these fragments comigrate by chance only, and do not reflect genetic similarity). Genotypes that are too distantly related for the AFLP technique to detect any relationship information (called “unrelated” hereafter) share only the second type of fragments. Therefore, an estimate of the number of nonidentical fragments comigrating across genotypes is an estimate of the lower boundary for fragment similarity to indicate relationship. We use this number to derive test statistics for significance tests on pairwise AFLP similarities between genotypes.

In an ideal situation, each AFLP band consists of only one AFLP fragment, enabling a one-to-one translation of AFLP fragments into AFLP bands. In that case, test statistics for significance tests can be based directly on the numbers of nonidentical fragments comigrating across genotypes. In practice, however, an AFLP band often contains multiple fragments that are comigrating within the same genotype. As a result, identical bands comigrating across genotypes may contain both identical and nonidentical fragments, while nonidentical bands comigrating across genotypes each may contain multiple nonidentical fragments. The phenomenon of nonidentical comigrating fragments (both within and across genotypes) is known as size homoplasy (Vekemans *et al.* 2002). In most relationship studies this size homoplasy is ignored, and only the presence or absence of AFLP bands is recorded. As a result, the similarities calculated in these studies are based on AFLP band similarities rather than on AFLP fragment similarities. For significance tests to be readily applicable in such relationship studies, the test statistics should be derived from the numbers and positions of nonidentical bands comigrating across genotypes. To account for the size homoplasy, however, information on the numbers and positions of nonidentical fragments comigrating across genotypes should be included as well. We constructed a series of significance tests that meet both demands. To our knowledge, there is no straightforward analytical procedure to calculate the relationship between the numbers of AFLP fragments and numbers of AFLP bands. Therefore, we estimated this relationship using Monte Carlo simulations.

### Significance tests for pairwise AFLP band similarities:

The significance tests for pairwise AFLP band similarities were developed in three steps. In the first step, probability distributions, *P*, of the numbers of nonidentical bands comigrating across genotypes were determined. In the second step, from *P* the expectation, standard deviation, and approximate critical values (95 and 99%) of numbers of nonidentical bands comigrating across genotypes were determined. In the third step, the same quantities were determined for four widely employed (dis)similarity coefficients.

For each pairwise comparison, two independent AFLP band patterns were generated with the appropriate numbers of bands (

*e.g.*, 50 and 60). The band patterns were generated by randomly drawing fragments from the smoothed Arabidopsis AFLP FLD. Note that the fragments are drawn only from the part of the Arabidopsis AFLP FLD corresponding to the scoring range of interest (*e.g.*, 50–500 bp). The numbers of fragments needed for each band pattern were often higher than the numbers of bands in the patterns, because some of the fragments ended up in the same bands. The difference between the numbers of fragments and the numbers of bands indicates the amount of size homoplasy in the band pattern (see also*Nonidentical AFLP fragments comigrating within genotypes*). To determine the number of fragments to be drawn from the AFLP FLD in an unbiased way, we repeatedly drew a fragment count from a uniform distribution. Next, a number of fragments equal to the fragment count was drawn from the smoothed Arabidopsis FLD, and the resulting number of AFLP bands was determined. The procedure was repeated until the appropriate numbers of bands (*e.g.*, 50 and 60) were reached in both AFLP patterns. For these numbers of bands, the number of bands comigrating across both AFLP patterns was determined and recorded. The entire procedure was repeated 1,000,000 times, and the probability distribution*P*was estimated from the scores of all 1,000,000 replications.In the second step, expected numbers of nonidentical bands comigrating across genotypes (

*i.e.*, expected numbers of bands comigrating by chance), standard deviation, and approximate critical values (95 and 99%) were determined from the probability distribution*P*. Because the variables under study are discrete, exact 95 and 99% critical values could not be calculated. Instead, approximate values were determined by interpolation.In most relationship studies, similarity among genotypes is reported using (dis)similarity coefficients rather than numbers of comigrating bands. These coefficients somehow express the proportion of comigrating relative to noncomigrating bands. A literature survey showed that the majority of studies employed Dice similarity (Dice 1945) or Nei and Li distance (Nei and Li 1979), while Jaccard (Jaccard 1908) and simple matching (Sokal and Sneath 1963) similarity are also widely employed. For a given pair of genotypes, let

*x*= 0 when no AFLP band is present at position_{i}*i*in genotype 1, and*x*= 1 when an AFLP band is present at position_{i}*i*in genotype 1. Likewise,*y*= 0 or 1 for genotype 2. For a scoring range 1–_{i}*N*, let*s*= 1 when a certain band position is scored a data set and_{i}*s*= 0 when a band position is not scored. Let , and . Then Dice = 2_{i}*a*/(2*a*+*b*+*c*), Jaccard =*a*/(*a*+*b*+*c*), and simple matching = (*a*+*d*)/(*a*+*b*+*c*+*d*). Nei and Li = (1 − Dice). To make our tests readily applicable in relationship studies employing the above coefficients, we used the numbers of nonidentical bands comigrating across genotypes to get (dis)similarity values. The recalculations involved two steps. First, probability distributions for all four coefficients were calculated, on the basis of the probability distribution of the number of comigrating bands,*P*. Next, expected values and approximate critical values (95 and 99%) were determined from these distributions as described previously.

The entire procedure has been incorporated in the computer program AFLSIM, which can be downloaded from http://www.dpw.wur.nl/biosys/AFLSIM_UK.html. The program can be used to test the significance of AFLP similarities in empirical data sets with scoring ranges between 34 and 1024 bp (related to the limits of the Arabidopsis AFLP FLD). The minimum number of AFLP bands per genotype should be 1, and the maximum equals half the number of band positions available within the employed scoring range. Band lengths should be input as discrete (*i.e.*, integer) values. As an example, Figure 3 and Table 1 show results for the widely employed scoring range 50–500 bp and an AFLP procedure with *A*/*C* selective nucleotides. Figure 3 shows the relationship between the number of bands scored in each of two genotypes and the expected number of bands shared. Table 1 gives an overview of the test statistics. The expected (dis)similarities in the table indicate the level of (dis)similarity expected in unrelated genotypes. Pairwise (dis)similarities exceeding the critical values indicate significant phenetic or genetic similarity.

For the calculations in Table 1, we assumed that all band positions available in the scoring range were present in the data set. As a result, a relatively large proportion of the band positions showed 0/0 matches (*i.e.*, no band present in either of the genotypes compared). Because 0/0 matches are counted as similarity in the simple matching coefficient, this causes a relatively high minimum simple matching value (Table 1, bottom, column 10). The number of 0/0 matches does not influence the Dice, Nei and Li, and Jaccard similarity. Consequently, the theoretical minimum value of these coefficients is always 0, regardless of the number of 0/0 matches in the data set.

The maximum possible (dis)similarity values (given the observed band numbers; see Table 1) illustrate an often overlooked peculiarity of Dice, Jaccard, Nei and Li, and simple matching pairwise (dis)similarities: they can be unity (or 0 in the case of Nei and Li distance) only when AFLP band numbers in both genotypes are identical. Table 1 shows that the maximum possible similarity rapidly decreases with increasing difference in band number between genotypes. Comparison with the critical values corresponding to the unequal band numbers shows that such (dis)similarities, although low, may still be significant.

### Nonidentical AFLP fragments comigrating within genotypes:

When simulating band patterns for the probability distribution *P*, we were surprised by the high amount of size homoplasy. The number of bands containing multiple fragments was much higher than we intuitively anticipated. However, the phenomenon that a co-occurrence of events (in this case the appearance of two AFLP fragments of equal length) is more likely than intuitively expected is well known in statistics and commonly referred to as the birthday paradox. The paradox is often summarized as follows: in a group of only 23 persons, the probability of at least one coinciding birthday, assuming uniformly distributed birthdays over all 365 days of the year, is already >0.5.

Translated to AFLP patterns for a scoring range of, *e.g.*, 50–500 bp (451 positions), this means that only 26 fragments are needed to have a probability >0.5 that at least one AFLP band contains multiple fragments. In reality, however, the probability distribution of fragment lengths is highly skewed instead of uniform (Figure 2), rendering even higher probabilities of fragments with identical lengths (Munford 1977).

Analogous to the situation for nonidentical AFLP bands comigrating across genotypes, the number of nonidentical AFLP fragments comigrating within a genotype (*i.e.*, the amount of size homoplasy) depends on the number of bands scored, the number of discrete band positions available within the scoring range, and the AFLP FLD. Table 2 illustrates the size homoplasy for a wide series of scoring ranges and band numbers. The table shows that the amount of size homoplasy increases with increasing numbers of bands and with decreasing scoring range. In empirical data sets, the occurrence of multiple fragments in AFLP bands has already been demonstrated for Beta and *G. max* (Hansen *et al.* 1999; Meksem *et al.* 2001).

### Weighted similarity coefficients including band position information:

In the previous sections, a procedure was developed to test the significance of AFLP-based similarities. The procedure can be used to test similarities that were calculated according to various well-known similarity coefficients. The relationship between band length and band presence is incorporated in the tests using the Arabidopsis AFLP FLD. However, this relationship is not accounted for in the similarity coefficients themselves, since all bands are equally weighted in the existing coefficients.

To make the existing similarity coefficients more informative, we propose an adjustment of these coefficients by weighting the bands with the inverse probabilities of their occurrence in an AFLP profile. The rationale behind this is that long bands have a smaller probability of occurring than short bands do, and therefore they have a larger probability of contributing reliable information to a data set. Consequently, long bands should contribute more to the overall similarity values. A proper weighting scheme can be derived from the Arabidopsis AFLP FLD. In the section on Arabidopsis *in silico* AFLP FLDs, we demonstrated that the Arabidopsis AFLP FLD is a reliable basis for describing the probabilities of occurrence of AFLP fragments and hence of AFLP bands. Therefore, the inverse probabilities from the Arabidopsis AFLP FLD are the logical basis for constructing weighted similarity coefficients.

The weighted coefficients are constructed in two steps, analogous to the construction of the unweighted coefficients. In the first step, weighted similarities are calculated for numbers of bands shared between two genotypes (*a*_{w}), for numbers of bands unique to one of the genotypes (*b*_{w} and *c*_{w}), and for band positions that are not occupied in either of the genotypes (*d*_{w}). Again, for a given pair of genotypes, let *x _{i}* = 0 when no AFLP band is present at position

*i*in genotype 1, and

*x*= 1 when an AFLP band is present at position

_{i}*i*in genotype 1. Likewise,

*y*= 0 or 1 for genotype 2. For a scoring range 1–

_{i}*N*, let

*s*= 1 when a certain band position is scored a data set, and

_{i}*s*= 0 when a band position is not scored. Then, , , and ; with inverse weights , , , and ; with

_{i}*p*the probability that genotype 1 has a band at position

_{i}*i*; and

*q*the probability that genotype 2 has a band at position

_{i}*i*. The band probabilities are derived from the fragment probabilities in the Arabidopsis

*in silico*AFLP FLD according to

*p*= 1 − [1 −

_{i}*p*(fragment at

*i*)]

^{N}^{1}, and

*q*= 1 − [1 −

_{i}*p*(fragment at

*i*)]

^{N}^{2}, where

*N*1 and

*N*2 are the total numbers of fragments in the scoring range in genotypes 1 and 2, respectively. The number of fragments

*N*for each genotype depends on the scoring range, the total number of bands within the scoring range, and the fragment length distribution and was determined by Monte Carlo simulation as described in

*Significance tests for pairwise AFLP band similarities*. In the second step, weighted similarity coefficients are calculated according to: weighted Dice = 2

*a*

_{w}/(2

*a*

_{w}+

*b*

_{w}+

*c*

_{w}), weighted Jaccard =

*a*

_{w}/(

*a*

_{w}+

*b*

_{w}+

*c*

_{w}), and weighted simple matching = (

*a*

_{w}+

*d*

_{w})/(

*a*

_{w}+

*b*

_{w}+

*c*

_{w}+

*d*

_{w}). Weighted Nei and Li = (1 − weighted Dice).

### The Arabidopsis sequence as a model system:

The test statistics in this study are based on *in silico* AFLP FLDs from the Arabidopsis genome sequence. This sequence is generally considered to be representative of the genome of an angiosperm species (*e.g.*, Arabidopsis Genome Initiative 2000; Barnes 2002), and therefore the test statistics based on the Arabidopsis genome sequence should be valid for angiosperms in general.

A limitation of the Arabidopsis sequence is that a significant part is still missing. According to the Arabidopsis Genome Initiative (2000), ∼8.5% of the genome has not yet been aligned (∼10 of an estimated 125 Mb). This 8.5% mainly consists of repeat sequences in centromeric and rDNA regions. Genetic mapping studies in Arabidopsis (*e.g.*, Alonso-Blanco *et al.* 1998) showed a clustering of AFLP fragments around the centromeres, which could indicate that the actual percentage of AFLP fragments missing from the Arabidopsis AFLP FLD is much higher than the 8.5% of missing sequence. In a recent study, however, Peters *et al.* (2001) found that Arabidopsis *Sac*I/*Mse*I *in silico* AFLP fragments do not cluster around the centromeres, but are evenly dispersed over the genome. They argued that the apparent overrepresentation of AFLP fragments in genetic mapping studies must originate in a higher mutation frequency in the (peri)centromeric regions rather than in an actual overrepresentation of AFLP fragments. Assuming that the findings of Peters *et al.* (2001) are representative for AFLP fragments in general, the missing 8.5% of repeat regions in the Arabidopsis genome sequence corresponds to 8.5% of missing AFLP fragments in the Arabidopsis AFLP FLD. These missing regions contain mainly repeat sequences. Estimating the influence of the missing repeats on the Arabidopsis AFLP FLD is highly speculative, but one could argue that their influence on the significance tests may be only limited. Given the fact that the average size of the individual repeat units is relatively small, the size of AFLP fragments resulting from restriction sites in the repeat regions will also be small. The possible underrepresentation of small fragments will mainly influence the lower part of the Arabidopsis AFLP FLD. In most AFLP studies, these smaller fragments are discarded. Consequently, they do not influence the results.

Specific features of the Arabidopsis genome that may limit its general applicability as a model system for angiosperms are its small size (120 Mb) and its relatively low *G* + *C* content (36%). We examined the representativity of the Arabidopsis sequence using sequences of *Oryza sativa* L. Apart from sequences of Arabidopsis, sequences of *O. sativa* L. subspecies *indica* (Yu *et al.* 2002) and *japonica* (Feng *et al.* 2002; Goff *et al.* 2002; Sasaki *et al.* 2002) are the only complete angiosperm sequences presently available. However, at the time of our study the *O. sativa* sequences were still very fragmented. We used sequences from chromosomes 3 (43.4% *G* + *C*) and 10 (43.6% *G* + *C*) of *O. sativa* subsp. *japonica* (hereafter, rice), covering nearly complete chromosomes contained in a limited number of BAC assemblies. Sequence data were obtained from the web site of The Institute for Genomic Research at http://www.tigr.org. To generate the rice FLD, we performed the *in silico* AFLP as described for Arabidopsis, without selective nucleotides. Vector sequences and sequences of suspect origin were removed from the BAC assemblies prior to *in silico* AFLP, using the National Center for Biotechnology Information VecScreen web tool. The probability distribution of the AFLP fragment lengths was estimated by fitting a cubic smoothing spline as before. The smoothing spline and the relative frequency distribution of the rice *in silico* AFLP fragments are depicted in Figure 4. Fragment sizes range from 32 to 1024 bp.

The Arabidopsis FLD without selective nucleotides is included as a reference. A two-sample Kolmogorov-Smirnov test showed that the rice FLD differs significantly from the Arabidopsis FLDs with *A*/*C*, *T*/*A*, or without selective nucleotides (*P* < 0.0001), but not from that with *C*/*G* selective nucleotides (*P* = 0.09). The most obvious reason for the difference is the high *G* + *C* content of the rice sequences relative to those of Arabidopsis. As predicted by the theoretical model of Innan *et al.* (1999), the higher *G* + *C* content in rice yields a more even FLD. Additionally, there may be other genome differences between rice and Arabidopsis that influence the AFLP FLD. Most notably, these could be differences related to the evolutionary distinct position of Poaceae within the angiosperms (*e.g.*, Montero *et al.* 1990; Devos *et al.* 1999; Freeling 2001). However, the influence of these additional factors cannot be studied separately from that of *G* + *C* content until more evolutionary distinct genome sequences with similar nucleotide compositions become available.

Comparison of the test statistics for Arabidopsis and rice in the scoring range 50–500 bp (supplemental Table 3, available at http://www.dpw.wur.nl/biosys/AFLSIM_UK.html) showed that the expected number of nonidentical bands comigrating across genotypes is on average 10% lower for rice. Although the numbers are in the same order of magnitude, the difference between Arabidopsis and rice illustrates the need for more than one model species. Given the fact that Arabidopsis and rice cover most of the *G* + *C* range for angiosperms, together they probably suffice as model species for the angiosperms in general. Therefore, we propose that the tests statistics based on the Arabidopsis sequence be considered generally applicable for angiosperms with *G* + *C* contents between ∼35 and 40% *G* + *C*, and tests based on the rice sequence be considered generally applicable for angiosperms with *G* + *C* contents between ∼40 and 50%. For angiosperms with unknown *G* + *C* content, the test statistics for the Arabidopsis genome can be applied as a conservative test. Test statistics based on a more complete rice genome sequence will be developed at a later stage.

## DISCUSSION

Theoretical and *in silico* AFLP FLDs were examined as a basis for significance tests for AFLP similarities. Comparison of the theoretical AFLP FLD of Innan *et al.* (1999) with a FLD based on *in silico* AFLP of the complete Arabidopsis genome sequence demonstrated that the theoretical distribution is not representative of that of an actual genome. This is not in accordance with Vekemans *et al.* (2002), who concluded that the theoretical distribution of Innan *et al.* (1999) was representative of empirical distributions of *Phaseolus lunatus* and *Lolium perenne* in a scoring range between 75 and 450 bp. The difference in conclusions may be explained by (1) errors in the empirical data sets, resulting from the AFLP procedure (discussed previously), and (2) fragment numbers in the empirical data sets (801 and 1599, respectively) being too low to yield a representative FLD. The variation in the FLD resulting from the low numbers of fragments probably obscured systematic differences between the theoretical and empirical distributions. In this study, the Arabidopsis *in silico* AFLP FLDs are based on much larger numbers of fragments (23,556 between 75 and 450 bp), enabling a more detailed comparison. This new comparison demonstrated a clear discrepancy between the theoretical and the *in silico* distributions, indicating that theoretical distributions based on Innan *et al.* (1999) do not adequately describe AFLP FLDs based on an actual genome.

The discrepancy between the theoretical and the *in silico* distribution may be explained by two assumptions made by Innan *et al.* (1999). The first is that of a random nucleotide sequence under the Jukes and Cantor (1969) model. In actual genomes the nucleotides are not randomly distributed, but organized in distinct patterns of dinucleotides and oligonucleotides (Nussinov 1981, 1991). At a larger scale, the genome is organized in isochores, showing large blocks of *G* + *C*-rich sequences alternated by large blocks of more *A* + *T-*rich sequences (Salinas *et al.* 1988; Matassi *et al.* 1989; Montero *et al.* 1990). Moreover, the Jukes and Cantor model assumes equal base frequencies and equal chances on substitution among all nucleotides, while in reality base frequencies are unequal and substitution rates vary. The second assumption that may explain the deviation between the theoretical and the *in silico* distribution is that of nucleotide changes as the sole cause of changes in DNA sequence. Under this second assumption, processes such as insertions and deletions are ignored. Obviously, this is a simplification of the dynamics in actual genomes, as was already noted by Innan *et al.* (1999). Both assumptions introduce restrictions in the model of Innan *et al.* (1999) that may be too limiting to allow for an adequate description of an AFLP FLD.

Our analysis of the Arabidopsis *in silico* AFLP FLD demonstrated that the type of selective nucleotides influences the shape of the distribution. Use of only *G* + *C* nucleotides favors the selection of long fragments over short ones, yielding a relatively even distribution of fragments over length classes. Use of only *A* + *T* nucleotides favors the selection of short fragments over long ones, giving a more asymmetrical distribution. The effect probably results from the isochore structure of the genome in combination with the nucleotide composition of the restriction enzymes. The enzymes employed in this study are a frequent cutter (*Mse*I) and a rare cutter (*Eco*RI). Because *Mse*I cuts are much more frequent than *Eco*RI cuts, the average AFLP fragment size will be determined mainly by the frequency of *Mse*I cuts. The restriction site of *Mse*I contains no *G* + *C* nucleotides, and therefore this enzyme will preferably cut in *A* + *T*-rich isochores. Given the preference of the frequent-cutting *Mse*I enzyme to cut in *A* + *T*-rich isochores, and the fact that the fragment size is inversely proportional to the frequency of cuts, AFLP fragments resulting from *A* +*T*-rich isochores will on average be smaller than fragments resulting from other parts of the genome. Because these fragments originate in *A* + *T*-rich stretches of the genome, the fragments themselves will contain relatively high proportions of *A* + *T* nucleotides. Inversely, fragments resulting from *G* + *C*-rich isochores will on average be longer and contain relatively high proportions of *G* + *C* nucleotides (the relation between fraction *G* + *C* and fragment length in the Arabidopsis *in silico* AFLP data is approximately *G* + *C* = 0.34379 + 0.00012036 × length). Using *T*/*A* selective nucleotides in the AFLP procedure will favor the shorter *A* + *T*-rich sequences over the longer *G* + C-rich sequences, yielding an asymmetric AFLP FLD with mainly short sequences. Using *C*/*G* selective nucleotides will favor *G* + *C*-rich sequences, yielding a more even distribution of AFLP fragments over length classes. The FLD resulting from an AFLP procedure with *A*/*C* selective nucleotides did not differ significantly from the FLD generated without selective nucleotides, illustrating that the selective nucleotides effect is avoided when mixed *A* + *T*/*G* + *C* selective nucleotides are used.

On the basis of the Arabidopsis *in silico* AFLP FLDs, the numbers of nonidentical bands comigrating across genotypes were calculated as a basis for significance tests for AFLP similarities. Table 1 shows that the proportion of nonidentical bands comigrating across genotypes increases with the number of bands scored per genotype. When 10 bands are scored in each genotype and *A*/*C* selective nucleotides are used, the proportion of comigrating nonidentical bands is ∼4%. For 30 bands, this proportion is 12%, for 60 bands it is 22%, for 90 bands it is 31%, and for 120 bands it is 40%. The increase results from the fact that the probability for nonidentical AFLP fragments to comigrate at the same position increases with increasing numbers of total fragments. Relative to the proportion of comigrating nonidentical bands for *A*/*C* nucleotides, the proportions for *T*/*A* selective nucleotides are somewhat higher (4, 13, 24, 33, and 42%), while the proportions for *C*/*G* nucleotides are somewhat lower (4, 10, 20, 29, and 37%). However, all are in the same order of magnitude. The differences for the various combinations of selective nucleotides probably result from selection bias due to the isochore structure of the genome and the use of different types of selective nucleotides, as discussed before.

The high numbers of nonidentical comigrating bands apparent from Table 1 and supplemental Table 3 illustrate that overestimation of phenetic or genetic similarities based on AFLP band patterns is a serious problem when 50–100 bands per genotype are scored, as recommended by Vos *et al.* (1995). However, even for lower numbers of bands per genotype, a considerable percentage of comigrating bands are nonidentical. Therefore, overestimation of similarities based on AFLP band patterns cannot be completely ruled out by limiting the number of bands within a scoring range. However, the influence of the overestimation on the final analyses can be diminished by using corrected similarities, or weighted similarities, or by removing from the data sets those genotypes without any significant similarity to other genotypes. This article provides the procedures that enable this, all of which are available in the program AFLSIM. The procedures can be applied in, *e.g.*, genetic diversity studies or phylogenetic studies, which often include less-related genotypes as reference groups. For any genotype to be useful as a reference, at least some genetic similarity with the group under study is required. In many genetic diversity studies, however, the genetic similarities between the groups under study and the reference group are below the 95% critical values indicated in our tests. Such similarities, usually in the order of 0.15 or 0.20, are mistakenly taken to indicate a proper level of similarity for a reference group. To select a proper reference group, pairwise similarities between genotypes in the reference group and in the group under study should be tested, and at least some similarities between genotypes of both groups should be significant. Reference genotypes without significant similarity to the group under study should be discarded prior to further analysis.

By enabling the detection of unrelated genotypes and by the use of corrected and weighted similarity values, application of the procedures proposed in this article will make the analysis of AFLP data sets more informative and more reliable.

## Acknowledgments

We thank Ronald van den Berg of the Biosystematics Group, Wageningen University, The Netherlands; Hans de Jong of the Laboratory of Genetics, Wageningen University, The Netherlands; and two anonymous reviewers for useful suggestions. Herman van Eck of the laboratory of Plant Breeding, Wageningen University, The Netherlands, is acknowledged for discussing the theoretical aspects of the AFLP technique; and Qiaoping Yuan of The Institute of Genome Research (TIGR) is acknowledged for providing additional information on the Arabidopsis and rice genome sequences. Sequence data of Arabidopsis and rice were obtained from TIGR through their web site at http://www.tigr.org.

## Footnotes

Communicating editor: M. A. F. Noor

- Received March 26, 2003.
- Accepted April 29, 2004.

- Genetics Society of America