- Split View
-
Views
-
Cite
Cite
Armando Caballero, Humberto Quesada, Emilio Rolán-Alvarez, Impact of Amplified Fragment Length Polymorphism Size Homoplasy on the Estimation of Population Genetic Diversity and the Detection of Selective Loci, Genetics, Volume 179, Issue 1, 1 May 2008, Pages 539–554, https://doi.org/10.1534/genetics.107.083246
- Share Icon Share
Abstract
AFLP markers are becoming one of the most popular tools for genetic analysis in the fields of evolutionary genetics and ecology and conservation of genetic resources. The technique combines a high-information content and fidelity with the possibility of carrying out genomewide scans. However, a potential problem with this technique is the lack of homology of bands with the same electrophoretic mobility, what is known as fragment-size homoplasy. We carried out a theoretical analysis aimed at quantifying the impact of AFLP homoplasy on the estimation of within- and between-neutral population genetic diversity in a model of a structured finite population with migration among subpopulations. We also investigated the performance of a currently used method (DFDIST software) to detect selective loci from the comparison between genetic differentiation and heterozygosis of dominant molecular markers, as well as the impact of AFLP homoplasy on its effectiveness. The results indicate that the biases produced by homoplasy are: (1) an overestimation of the frequency of the allele determining the presence of the band, (2) an underestimation of the degree of differentiation between subpopulations, and (3) an overestimation or underestimation of the heterozygosis, depending on the allele frequency of the markers. The impact of homoplasy is quickly diminished by reducing the number of fragments analyzed per primer combination. However, substantial biases on the expected heterozygosity (up to 15–25%) may occur with ∼50–100 fragments per primer combination. The performance of the DFDIST software to detect selective loci from dominant markers is highly dependent on the number of selective loci in the genome and their average effects, the estimate of genetic differentiation chosen to be used in the analysis, and the critical bound probability used to detect outliers. Overall, the results indicate that the software should be used with caution. AFLP homoplasy can produce a reduction of up to 15% in the power to detect selective loci.
THE amplified fragment length polymorphism (AFLP) technique (Voset al. 1995) is becoming one of the most popular methods in the fields of conservation and evolutionary genetics and ecology (Mueller and Wolfenbarger 1999; Bensch and Akesson 2005; Boninet al. 2007; Meudt and Clarke 2007), as it combines a high reproducibility and information content with the possibility of making genomewide screenings. Because of the anonymous nature of the fragments generated by the AFLP technique, however, one major concern is the incidence of size homoplasy due to the lack of homology of comigrating fragments. This implies that fragments of a given size migrating in a band may involve more than one locus of the genome and, therefore, the inferences obtained from the band can produce misleading conclusions.
Several empirical approaches have been used to estimate levels of homoplasy in AFLP data sets. A few studies have demonstrated the presence of homoplasy in AFLP data by sequencing comigrating fragments within the same individual or from different individuals of the same or different species (RouppevanderVoortet al. 1997; Peterset al. 2001; El-Rabeyet al. 2002; Rombautset al. 2003; Mechandaet al. 2004; Mendelson and Shaw 2005). Hansenet al. (1999) and O'Hanlon and Peakall (2000) developed a simplified method for detecting size homoplasy comparing the AFLP banding patterns resulting from several rounds of selective amplification using PCR primers differing in the number of selective nucleotides. This method revealed that the proportion of comigrating nonhomologous fragments within single individuals of sugar beet was 13% (Hansenet al. 1999) and as high as 100% for comparisons among pairs of individuals from distantly related taxa of Carduinae thistles (O'Hanlon and Peakall 2000). Moreover, it has been shown that in interspecific studies of Echinacea (Mechandaet al. 2004), Hordeum (El-Rabeyet al. 2002), and Carduinae (O'Hanlon and Peakall 2000), homoplasy increases with increasing taxonomic rank. Empirical approaches, however, are technically demanding and time-consuming, and only a small number of fragments have been examined (Meudt and Clarke 2007).
Computer simulations have allowed surveying a larger number of markers by modeling the distribution of fragment lengths and generating probabilities of comigration and homoplasy. The size distribution of AFLP fragments is strongly asymmetrical, with a much higher proportion of smaller than larger fragments, particularly so for genomes with low GC content (Innanet al. 1999; Vekemanset al. 2002; Koopman and Gort 2004; Gortet al. 2006). Thus, the possible impact of homoplasy is larger for small- than for large-size fragments. For example, Vekemanset al. (2002) analyzed AFLP fragments from samples of one population of Phaseolus lunatus and two populations of Lolium perenne and divided the observed fragments into four size categories (75–124, 125–199, 200–299, and 300–450 bp). From this experimental data and from simulated data, they found a highly significant negative correlation between fragment size and the frequency of the marker allele associated with the fragments, which suggests that there is an upward bias in the allelic frequency for the small-size fragments. From the empirical data they also showed that the estimates of nucleotide diversity within populations of each species and those of nucleotide divergence between populations of L. perenne increased with increasing fragment sizes, suggesting that homoplasy produces an underestimation of both within- and between-population diversity estimated from AFLP fragments.
Koopman and Gort (2004) found that size homoplasy determines a lack of concordance between the empirical AFLP fragment-length distributions and the predicted distributions generated in silico for the Arabidopsis thaliana genome. A general observation is that the larger the number of bands detected per primer combination of selective nucleotides, the larger the possible magnitude of homoplasy (Vekemanset al. 2002; Koopman and Gort 2004; Gortet al. 2006). This is in agreement with the observations of Althoffet al. (2007), who used a computer program that mimics the AFLP procedure to examine the homology of fragments in whole genomes from eight species representing a range of genome sizes. They found that, for a given primer combination, both the number of fragments and the degree of homoplasy increased with genome size. The average number of homoplasious fragments was 11% for small genomes (<400 Mb), but increased to 41.5% for large genomes (>2 Gb).
The empirical and simulation studies described above indicate that homoplasy is a real component of AFLP data sets. This suggests that homoplasy may produce biases in the estimation of genetic diversity from AFLP markers, but the magnitude and direction of the bias remains to be investigated in detail. In addition, the use of AFLPs to identify selective loci in genome scans is becoming an active area of research in evolutionary studies (Wildinget al. 2001; Storz and Nachman 2003; Campbell and Bernatchez 2004; Scotti-Saintagneet al. 2004; Storzet al. 2004; Storz and Dubach 2004; Acheréet al. 2005; Beaumont 2005; Nielsen 2005; Storz 2005; Boninet al. 2006; Jumpet al. 2006; Mealor and Hild 2006; Murray and Hare 2006; Milleret al. 2007; Papaet al. 2007; Nosilet al. 2008; Smithet al. 2008). Genomewide scans of selective loci generally use information on heterozygosity and genetic differentiation (Beaumont 2005; Storz 2005), and the putative biases in these parameters due to homoplasy could produce some misinterpretations. The purpose of this work was to carry out a comprehensive theoretical analysis on the impact of fragment-size homoplasy upon the estimation of genetic diversity within and between populations. We also investigated the performance of a currently used method (DFDIST software) to detect selective loci from the comparison between genetic differentiation and heterozygosis of dominant molecular markers, as well as the impact of AFLP homoplasy on its effectiveness.
ANALYTICAL AND SIMULATION PROCEDURE
An analytical procedure was used to simulate either neutral or selected AFLP fragments in a subdivided population. Genetic-diversity parameters of the population were calculated for all simulated AFLP fragments and for those that would actually be observed in an experimental setting, where fragments of the same size comigrate in the same electrophoretic band (homoplasy). This allowed us to assess the bias imposed by homoplasy on the estimation of the genetic-diversity parameters in the population. An independent method based on coalescent simulations of DNA sequences was used to check the analytical procedure. In the second part of the study, a program used to detect selective loci, based on estimates of population genetic diversity, was used to check the impact of homoplasy on the detection of selective AFLP markers. Each of these procedures is detailed below.
Allele frequency distributions for neutral loci:
Assignment to AFLP fragments:
True loci vs. experimental fragments:
Irrespective of the length of the fragments, all simulated loci were obviously identified individually. Thus, true fragment frequencies and differentiation could be calculated assuming Hardy–Weinberg equilibrium. This group of fragments is called true fragments or loci hereafter. In an experimental setting, however, PCR fragments (bands) may correspond to different loci with the same fragment length (homoplasy). Fragments detected as different PCR bands are called experimental fragments or bands hereafter. The number of experimental bands depends on the number of true loci and the distribution of fragment sizes (Equation 2). For example, if g = 0.5, for NL = 1000 true loci there are ∼345 experimental bands, whereas for NL = 200 true loci there are ∼151 experimental bands (see also Figure 4 of Vekemanset al. 2002 for a graph showing the expected number of detectable fragments as a function of the total number of true fragments). The number of true loci considered varied from 60 to 1000.
Coalescent simulations:
To check the previous analytical procedure by a different, independent method, simulations were carried out by the retrospective coalescent approach. The program SIMCOAL 2.1.2 (Excoffieret al. 2000; http://cmpg.unibe.ch/software/simcoal2/) was used to simulate DNA sequences corresponding to two subpopulations of equal size (2N = 1000 haploid individuals) separated for a period of 500 generations and carrying out reciprocal migration at a rate m per generation. Twenty sequences were sampled from each subpopulation and cut in fragments at the motifs GAATTC (enzyme EcoRI) and TTAA (enzyme MseI). The number of nucleotides simulated was chosen so as to produce a total number of ∼1000 fragments (true loci). In particular, 34,000 nucleotides were required (distributed in four chromosomes with 8600 nucleotides each). The default transition rate of 0.333 and no recombination rate within chromosomes were used. To compare this coalescent method and the previous approach for a particular common scenario, a migration rate m = 0.0008 between the two subpopulations and a mutation rate per nucleotide u = 0.00011 were found to produce an average allelic frequency of true loci of
Selective loci:
Identification of selective fragments by population comparisons:
To assess the impact of homoplasy on the identification of selective loci in AFLP genome scans, data obtained with the analytical approach above was analyzed using the DFDIST software (http://www.rubic.rdg.ac.uk/∼mab/stuff/), a modification for dominant markers of the software developed by Beaumont and Nichols (1996). Briefly, this program implements a Bayesian method developed by Zhivotovsky (1999) to estimate allelic frequencies from the proportion of recessive phenotypes in the sample. It also estimates the Weir and Cockerham (1984),FST between the subgroups defined in the sample. Coalescent simulations are then performed to generate an FST null sampling distribution from upon the neutral expectations. The simulated FST data are used to identify those loci that may not fit the neutral drift model given their unusually low or high FST values. The software DFDIST is the method most frequently used when using AFLPs (see Scotti-Saintagneet al. 2004; Acheréet al. 2005; Boninet al. 2006; Jumpet al. 2006; Mealor and Hild 2006; Murray and Hare 2006; Milleret al. 2007; Papaet al. 2007; Nosilet al. 2008; Smithet al. 2008).
The parameter conditions used in the DFDIST analyses were as follows:
The critical frequency for the most common allele was 0.99 (loci where the most frequent allele had a frequency ≥0.99 were excluded).
The scale for the Zhivotovsky (1999) parameters for estimating allele frequencies was 0.25 (it was checked that the estimation of frequencies was accurate with this parameter value).
The number of resamplings used to obtain the confidence intervals for outliers was 10,000 (several runs were made with 100,000 resamplings and results did not change).
The smoothing proportion used was 0.04.
The estimate of average FST to be used in the DFDIST simulations was obtained in three ways: (i) the average estimated FST calculated by the Ddatacal program (one of the programs of the Dfdist package); (ii) a trimmed mean FST (provided also by the Ddatacal program) obtained excluding 30% of the highest and 30% of the lowest FST values (this trimmed mean FST is supposed to be an estimate of the average “neutral” FST uninfluenced by outlier loci; Boninet al. 2006); and (iii) the median of FST values from the file data_fst_outfile provided by the Ddatacal program, after removing those loci for which the heterozygosity is <0.1. This was the procedure used in the simulations of Beaumont and Balding (2004).
Different values of the parameter 4Neμ were used (from 0.004 to 0.1), finding no differences in the performance of the method within this range. Thus, a value of 0.04 was used throughout.
The critical probability for detecting outliers was α = 5 or 0.05%.
To make fair comparisons between true and experimental data sets, a given number of runs were pooled in each case to provide the same number of true and experimental loci (bands) for analysis with the program DFDIST (see Table 1). For example, for the case with NL = 1000 true fragments, for which only ∼345 experimental fragments become available from each run because of fragment-size homoplasy, three simulation replicates (that would be equivalent to using three different primer combinations in a PCR setting) were pooled. From each replicate, all experimental fragments (∼345) and 345 randomly taken true fragments were used. Thus, ∼345 × 3 = 1035 true or experimental loci were analyzed by the DFDIST program and compared. It should be noted that although the proportion of the genome with selective loci is 5%, the 1035 experimental fragments implied 15% of selective loci because each experimental fragment represents on average three true loci (i.e., the 1035 experimental fragments actually correspond to ∼1035 × 3 = 3105 true loci). An analogous procedure was carried out for other numbers of true loci (see Table 1).
. | . | . | % of homoplasious bandsd . | . | Total no. of bands (true or experimental)e . | Expected no. of bands with at least a selective locus . | |
---|---|---|---|---|---|---|---|
NL per primera . | Selective loci (5%)b . | Bands per primerc . | No. of runs (primers) . | True bands . | Experimental bands . | ||
1000 | 50 | 345 | 74 | 3 | 1035 | 52 | 150 |
200 | 10 | 151 | 26 | 7 | 1057 | 53 | 70 |
100 | 5 | 87 | 14 | 12 | 1044 | 52 | 60 |
60 | 3 | 55 | 9 | 19 | 1045 | 52 | 57 |
. | . | . | % of homoplasious bandsd . | . | Total no. of bands (true or experimental)e . | Expected no. of bands with at least a selective locus . | |
---|---|---|---|---|---|---|---|
NL per primera . | Selective loci (5%)b . | Bands per primerc . | No. of runs (primers) . | True bands . | Experimental bands . | ||
1000 | 50 | 345 | 74 | 3 | 1035 | 52 | 150 |
200 | 10 | 151 | 26 | 7 | 1057 | 53 | 70 |
100 | 5 | 87 | 14 | 12 | 1044 | 52 | 60 |
60 | 3 | 55 | 9 | 19 | 1045 | 52 | 57 |
Number of true loci assumed per simulation replicate (equivalently to per primer combination).
Number of selective loci per primer combination (5% of NL).
Number of experimental bands observed per primer combination (average over 100 replicates assuming a GC content of g = 0.5).
Percentage of homoplasious experimental bands, i.e., percentage of experimental bands that include two or more true loci.
Product between number of bands per primer combination and number of primer combinations.
. | . | . | % of homoplasious bandsd . | . | Total no. of bands (true or experimental)e . | Expected no. of bands with at least a selective locus . | |
---|---|---|---|---|---|---|---|
NL per primera . | Selective loci (5%)b . | Bands per primerc . | No. of runs (primers) . | True bands . | Experimental bands . | ||
1000 | 50 | 345 | 74 | 3 | 1035 | 52 | 150 |
200 | 10 | 151 | 26 | 7 | 1057 | 53 | 70 |
100 | 5 | 87 | 14 | 12 | 1044 | 52 | 60 |
60 | 3 | 55 | 9 | 19 | 1045 | 52 | 57 |
. | . | . | % of homoplasious bandsd . | . | Total no. of bands (true or experimental)e . | Expected no. of bands with at least a selective locus . | |
---|---|---|---|---|---|---|---|
NL per primera . | Selective loci (5%)b . | Bands per primerc . | No. of runs (primers) . | True bands . | Experimental bands . | ||
1000 | 50 | 345 | 74 | 3 | 1035 | 52 | 150 |
200 | 10 | 151 | 26 | 7 | 1057 | 53 | 70 |
100 | 5 | 87 | 14 | 12 | 1044 | 52 | 60 |
60 | 3 | 55 | 9 | 19 | 1045 | 52 | 57 |
Number of true loci assumed per simulation replicate (equivalently to per primer combination).
Number of selective loci per primer combination (5% of NL).
Number of experimental bands observed per primer combination (average over 100 replicates assuming a GC content of g = 0.5).
Percentage of homoplasious experimental bands, i.e., percentage of experimental bands that include two or more true loci.
Product between number of bands per primer combination and number of primer combinations.
Analyses were carried out for different values of FST and mean selection coefficients (
RESULTS
Impact of homoplasy on neutral gene diversity and differentiation among subpopulations:
Table 2 compares the results obtained with the analytical method and the simulation (SIMCOAL) method with the objective of checking the former. The scenario corresponds to a population subdivided into two subpopulations (n = 2) of size N = 500 each and a migration rate such that the differentiation between subpopulations is FST = 0.1. The average allelic frequency in both cases was set up to be
. | Analytical approach . | Simulation approach (SIMCOAL) . | ||||
---|---|---|---|---|---|---|
. | True . | Experimental . | Bias . | True . | Experimental . | Bias . |
No. loci | 1000.0 | 345.5 | 1000.5 | 351.3 | ||
\({\bar{p}}\) | 0.100 | 0.251 | 2.503 | 0.102 | 0.249 | 2.445 |
HS | 0.147 | 0.286 | 1.941 | 0.140 | 0.266 | 1.894 |
FST | 0.100 | 0.090 | 0.893 | 0.100 | 0.092 | 0.925 |
. | Analytical approach . | Simulation approach (SIMCOAL) . | ||||
---|---|---|---|---|---|---|
. | True . | Experimental . | Bias . | True . | Experimental . | Bias . |
No. loci | 1000.0 | 345.5 | 1000.5 | 351.3 | ||
\({\bar{p}}\) | 0.100 | 0.251 | 2.503 | 0.102 | 0.249 | 2.445 |
HS | 0.147 | 0.286 | 1.941 | 0.140 | 0.266 | 1.894 |
FST | 0.100 | 0.090 | 0.893 | 0.100 | 0.092 | 0.925 |
Bias is the ratio between true and experimental estimates for the different parameters. Results are based on 100 replicates for each approach. Standard errors <0.002 for
. | Analytical approach . | Simulation approach (SIMCOAL) . | ||||
---|---|---|---|---|---|---|
. | True . | Experimental . | Bias . | True . | Experimental . | Bias . |
No. loci | 1000.0 | 345.5 | 1000.5 | 351.3 | ||
\({\bar{p}}\) | 0.100 | 0.251 | 2.503 | 0.102 | 0.249 | 2.445 |
HS | 0.147 | 0.286 | 1.941 | 0.140 | 0.266 | 1.894 |
FST | 0.100 | 0.090 | 0.893 | 0.100 | 0.092 | 0.925 |
. | Analytical approach . | Simulation approach (SIMCOAL) . | ||||
---|---|---|---|---|---|---|
. | True . | Experimental . | Bias . | True . | Experimental . | Bias . |
No. loci | 1000.0 | 345.5 | 1000.5 | 351.3 | ||
\({\bar{p}}\) | 0.100 | 0.251 | 2.503 | 0.102 | 0.249 | 2.445 |
HS | 0.147 | 0.286 | 1.941 | 0.140 | 0.266 | 1.894 |
FST | 0.100 | 0.090 | 0.893 | 0.100 | 0.092 | 0.925 |
Bias is the ratio between true and experimental estimates for the different parameters. Results are based on 100 replicates for each approach. Standard errors <0.002 for
Figure 2 shows the bias for the parameters estimated with the experimental fragments relative to the true loci for a range of values of genetic differentiation between subpopulations (FST) for a scenario of NL = 1000 true loci (continuous line) and 200 loci (broken line). These correspond to ∼345 and 151 experimental fragments, i.e., observed PCR bands (see Table 1). For 1000 true fragments the biases in the average allelic frequency and heterozygosity are very substantial and almost invariable for different values of FST. The bias in the estimated level of differentiation is maximal at low values of FST and decreases at higher ones. Note, however, that this bias is a relative one, and the absolute bias (absolute difference in FST between experimental and true fragments) is higher for intermediate values of FST. All biases are substantially lower with 200 true fragments, as the degree of homoplasy is relatively small in this case.
Figure 3 shows the bias of the experimental parameters relative to the true ones under a range of situations. The first column of graphs shows the biases for different values of the GC content (g). The continuous line refers to NL = 1000 true loci (345 bands) and the broken line to 200 loci (151 bands). The graphs show that the bias is generally larger for small values of g, as expected, but the variation is not very large. The second column of graphs shows results for different average allele frequencies in the whole set of subpopulations (
Because small-length fragments are more frequent (see Figure 1B), the bias in the estimates provided by the experimental fragments is larger the shorter the fragment length. This is illustrated in Figure 4, where the estimates of parameters from experimental fragments (open circles) and true loci (solid circles) are shown for fragments of different lengths. The left column of graphs corresponds to an average allelic frequency of
The previous results are somewhat conservative because they refer to genomes with a GC content of g = 0.5. For many species, however, the overall genome GC content is lower, e.g., rice (0.43), humans (0.41), Arabidopsis (0.35) (Yuet al. 2002), Caenorhabditis (0.35) (Steinet al. 2003), Drosophila (0.43) (Akashiet al. 1998), and the biases incurred can be larger. For example, for g = 0.35 and assuming
Detection of selective loci:
Assessment of the efficiency of the DFDIST software for detecting selective loci:
We first assessed the efficiency of the DFDIST software for detecting selective loci considering only the true loci, i.e., without yet introducing the problem of homoplasy. Figure 5 shows results for a variable proportion of selective loci assumed in the genome (from 0 to 10%), considering an average selection coefficient for the selective genes of
Figure 6 presents results analogous to those of Figure 5 but using a critical probability value of α = 0.05% to detect outliers. The percentage of outliers detected gives an approximately right estimate of the true proportion of selective loci in the genome only for scenarios with a large
Figure 7 extends the above results assuming a trimmed mean FST and 5% of selective loci in the genome for a wider range of average selection coefficients, a different value of neutral FST (0.1), and different sample sizes (M). The percentage of outliers detected with a critical probability of α = 5% is again ∼10% for FST = 0.025 (Figure 7A), irrespective of the average selection coefficient, and this value is reduced to ∼7% for FST = 0.1 (Figure 7D). Runs corresponding to a higher value of FST = 0.2 actually produce even lower proportions of outliers (data not shown). For a critical probability of α = 0.05 (Figure 7, G and J) the correct proportion of selective loci (5%) is predicted only for a large average selection coefficient (
Impact of homoplasy on the detection of selective loci:
Because the levels of heterozygosity and FST are affected by homoplasy (Figures 2–Figures 4), and these are the parameters used for the detection of selective loci from molecular data, we investigated the impact of homoplasy on the detection of selective loci by comparing the analyses of true and experimental fragments. Table 3 gives average parameters from true loci and estimated parameters from experimental fragments in a case where 5% of the true loci (i.e., 50 of 1000) have an average selection coefficient of
. | All loci . | Selected loci . | ||||
---|---|---|---|---|---|---|
. | True . | Experimental . | Bias . | True . | Experimental . | Bias . |
No. loci | 1000.0 | 346.2 | 50.0 | 46.3 | ||
\({\bar{p}}\) | 0.211 | 0.454 | 2.148 | 0.405 | 0.657 | 1.620 |
HS | 0.255 | 0.338 | 1.325 | 0.224 | 0.273 | 1.218 |
FST | 0.134 | 0.110 | 0.824 | 0.434 | 0.228 | 0.525 |
. | All loci . | Selected loci . | ||||
---|---|---|---|---|---|---|
. | True . | Experimental . | Bias . | True . | Experimental . | Bias . |
No. loci | 1000.0 | 346.2 | 50.0 | 46.3 | ||
\({\bar{p}}\) | 0.211 | 0.454 | 2.148 | 0.405 | 0.657 | 1.620 |
HS | 0.255 | 0.338 | 1.325 | 0.224 | 0.273 | 1.218 |
FST | 0.134 | 0.110 | 0.824 | 0.434 | 0.228 | 0.525 |
Five percent of the loci (50) are selective with a selection coefficient sampled from an exponential distribution with average
. | All loci . | Selected loci . | ||||
---|---|---|---|---|---|---|
. | True . | Experimental . | Bias . | True . | Experimental . | Bias . |
No. loci | 1000.0 | 346.2 | 50.0 | 46.3 | ||
\({\bar{p}}\) | 0.211 | 0.454 | 2.148 | 0.405 | 0.657 | 1.620 |
HS | 0.255 | 0.338 | 1.325 | 0.224 | 0.273 | 1.218 |
FST | 0.134 | 0.110 | 0.824 | 0.434 | 0.228 | 0.525 |
. | All loci . | Selected loci . | ||||
---|---|---|---|---|---|---|
. | True . | Experimental . | Bias . | True . | Experimental . | Bias . |
No. loci | 1000.0 | 346.2 | 50.0 | 46.3 | ||
\({\bar{p}}\) | 0.211 | 0.454 | 2.148 | 0.405 | 0.657 | 1.620 |
HS | 0.255 | 0.338 | 1.325 | 0.224 | 0.273 | 1.218 |
FST | 0.134 | 0.110 | 0.824 | 0.434 | 0.228 | 0.525 |
Five percent of the loci (50) are selective with a selection coefficient sampled from an exponential distribution with average
To assess the overall impact of homoplasy, Figure 8 presents results analogous to those of Figures 5–7, but comparing true and experimental fragments. The solid bars refer to the analysis of a set of ∼1000 fragments that are true loci, whereas the open bars refer to the corresponding analysis of ∼1000 fragments that are experimental fragments, hypothetically obtained with a variable number of primer combinations. Thus, for example, the first open bar in each graph (55 bands per primer for g = 0.5 or 50 for g = 0.35) would refer to a case where there are 19 primers producing 55 (or 50) experimental bands each (see Table 1). The second bar (87 or 76 bands per primer depending on g) corresponds to a case where there are 12 primers producing 87 (or 76) experimental bands each, and so on. Higher levels of homoplasy, corresponding to larger numbers of experimental bands, imply a slight overestimation of the percentage of outliers (Figure 8, A, D, and G) and an underestimation of the percentage of selective loci that are detected (Figure 8, C, F, and I). Homoplasy has no appreciable impact on the percentage of outliers carrying true selective loci (Figure 8, B, E, and H). The overestimation of the percentage of outliers detected can be understood from the information presented in Table 1. Note that a larger number of experimental bands implies a progressively larger number of true loci (NL). Thus, for example, for NL = 1000 (implying 345 experimental bands for g = 0.5 or 252 for g = 0.35), the 1000 experimental bands analyzed correspond to ∼3000 true loci, from which ∼150 are selected (5%), as in this case each band includes an average of three true loci of the same fragment length. Thus, ∼15% of the experimental fragments could include at least one selective true fragment. This explains why the percentage of outliers is larger for the experimental fragments than for the true loci.
DISCUSSION
AFLP homoplasy and gene diversity and differentiation among subpopulations:
Our results demonstrate that the biases due to fragment homoplasy can be rather high for the estimates of
Our simulations indicate that the bias associated with size homoplasy rapidly increases with the number of true loci. Voset al. (1995) recommended that, to avoid size homoplasy, the number of fragments or bands in an AFLP profile must be between 50 and 100, whereas Gortet al. (2006) suggested that even band numbers as low as 20 are not a guarantee of the absence of band homoplasy. These results are in line with an empirical study on the sugar beet by Hansenet al. (1999), who scored 456 bands in 16 AFLP lanes, giving an average of 28.5 bands per line. They reported that 13.2% of the bands were likely to be homoplasious. In our study, the comparison between the biases associated with the estimation of population parameters with a number of true loci ranging from 60 to 1000 indicates that ∼100 experimental fragments per primer combination may represent an upper limit to fragment homology among bands. From a literature review of 90 AFLP surveys on diverse animal, plant, fungi, and bacteria species (in the period 2004–2006), we found that the average number of fragments detected per primer combination ranges from 4 (Enjalbertet al. 2005) to 453 (Kalita and Malek 2006). If we take the median value of 58.8 as the more common number of detected fragments per primer combination, biases associated with the estimation of population parameters such as heterozygosity, FST and allele frequency are almost negligible in most studies. However, 18 (20%) studies showed an average number of fragments per primer combination >100 and are expected to have a high risk of biases due to size homoplasy (e.g., Laitunget al. 2004; Snället al. 2004; Albachet al. 2006; Fanget al. 2006; Kalita and Malek 2006; Kiddet al. 2006; Troellet al. 2006). Furthermore, as shown above, the overestimation of the expected heterozygosity can be substantial (15–25%) even for as low as 50–75 fragments per primer combination.
Vekemanset al. (2002) demonstrated that smaller-sized fragments are more likely to be homoplasious. Indeed, it is possible that the high degree of monomorphism often observed for small (<125 bp) fragments could be due to elevated levels of comigration of nonhomologous fragments (Vekemanset al. 2002; Busselet al. 2005). Consistent with this, our results show that the bias due to size homoplasy increases in the small-size classes (Figure 4), although there is no clear cutoff point. Thus, shorter fragments have lower quality data. Inclusion of fragment-length information in the estimation of heterozygosity,
An additional difficulty is that the relationship between fragment size and homoplasy bias is highly dependent on the average allele frequency, leading to either overestimates or underestimates of heterozygosity. For example, Vekemanset al. (2002) deduced that, for small-size fragments, homoplasy would produce underestimation of both within- and between-population genetic diversity. This is in agreement with the results observed in Figure 4 with high
Our analysis demonstrates a moderate influence of the genome GC content on the homoplasy bias. The enzymes employed in this study are a frequent cutter (MseI) and a rare cutter (EcoRI) commonly used in AFLP surveys. Because MseI cuts are more frequent than EcoRI cuts, the average AFLP fragment size will be determined mainly by the frequency of MseI cuts. The restriction motif of MseI contains no G + C nucleotides, thus increasing the contribution of the small-sized class to the proportion of nonhomologous comigrating bands at low GC contents. This effect has also been observed with experimental data and probably results from the isochore structure of the genome in combination with the restriction motif recognized by the enzymes (i.e., MseI will preferably cut in A + T-rich isochores; Koopman and Gort 2004). Vekemanset al. (2002) found that the substitution of the commonly used MseI enzyme by a restriction enzyme recognizing a sequence motif with a balanced GC content would make the AFLP size distribution insensitive to variation in GC content. Hence, the effect of GC content on the homoplasy bias will depend on the choice of the restriction enzyme and may be avoided by using enzymes containing a balanced GC content for the shortest restriction motif, such as, for example, TaqI (Vekemanset al. 2002). However, the impact of homoplasy cannot be removed altogether even if a set of enzymes is used that guarantees a flat distribution of fragment lengths. This can be seen in Figure 3. When g = 0.7, the distribution of fragment lengths is almost uniform for the EcoRI–MseI tandem (see Figure 1C) but, even in this case, it is predicted to overestimate
The model investigated assumes that the level of homoplasy depends only on the number of fragments evaluated (see Table 1) and, therefore, is independent of the level of differentiation between populations or the average allele frequency of the markers. It is however possible that increasing differentiation is associated with increasing homoplasy, as has been repeatedly observed empirically (O'Hanlon and Peakall 2000; El-Rabeyet al. 2002; Mechandaet al. 2004). O'Hanlon and Peakall (2000) showed a positive correlation of the proportion of nonhomologous bands with species divergence and a negative relationship between the number of comigrating fragments and DNA sequence divergence. For distantly related species, many of the fragments tend to be unique to a species rather than shared among species, and a portion of these would be homoplasious. Thus, as the number of comigrating AFLP bands decreased, the chance that they were nonhomologous increased. If this can be generalized, the impact of homoplasy on the bias of population genetic diversity among distantly related species could be of much higher magnitude than that among conspecific populations.
To avoid the effects of homoplasy on the estimates of neutral variation, Vekemanset al. (2002) suggested (i) avoiding fragments in the small-size classes, (ii) using fragments of known map position that should have documented Mendelian inheritance when possible, and (iii) increasing the number of selective nucleotides in the primer sequences to reduce the number of fragments and thus the probability of nonhomology. However, although the exclusion of fragments of small size (i.e., <125 bp) would help to reduce levels of homoplasy by approximately one-third (Gortet al. 2006), there could still be fragments of larger size that are homoplasious. Alternatively, mapping studies are technically demanding, time-consuming, and out of the scope of many AFLP surveys (Meudt and Clarke 2007). The reduction of the number of scored bands per AFLP profile is also in conflict with the consideration that very small numbers of bands are undesirable in the sense that they reduce the information content of the AFLP profile. Therefore, even for a low number of bands per lane, a considerable percentage of comigrating bands are still nonhomologous (Vekemanset al. 2002; Gortet al. 2006).
Detection of selective loci from gene diversity and differentiation:
On the basis of the classical Lewontin and Krakauer (1973) multilocus neutrality test, several related approaches have recently been proposed for the detection of selective loci in genome scans (Bowcocket al. 1991; Beaumont and Nichols 1996; Vitaliset al. 2001; Schlötterer 2002; Beaumont and Balding 2004). One of the most popular methods, and for which software to analyze dominant data is available (DFDIST), is that from Beaumont and Nichols (1996). Beaumont and Balding (2004) investigated, by computer simulations, the performance of the codominant version of the method (FDIST program) in comparison with a hierarchical-Bayesian approach, in a structured population scenario for a set of biallelic codominant loci predominantly neutral but some of them subject to directional or balancing (heterozygote advantage) selection. They concluded that both methods could identify loci subject to directional selection when the selection coefficient was at least five times the migration rate, but that neither method was able to identify loci subject to balancing selection. We extended the analysis of Beaumont and Balding (2004) by evaluating the performance of the DFDIST software, considering dominant markers, a distribution of variable selective effects, and a scenario of two subpopulations with different levels of neutral genetic differentiation, where a variable proportion of the loci (that was set up from 1 to 10% of the genome) are subject to directional selection in one subpopulation. We assessed a number of alternatives used in different empirical or theoretical studies:
A high critical probability threshold level of 5% vs. a much lower value of 0.05%: Most authors have considered critical values of 5% (Scotti-Saintagneet al. 2004; Acheréet al. 2005; Boninet al. 2006; Mealor and Hild 2006; Milleret al. 2007; Papaet al. 2007; Nosilet al. 2008; Smithet al. 2008) or 1% (Jumpet al. 2006; Murray and Hare 2006; Papaet al. 2007; Nosilet al. 2008).
The mean FST used to build the null model neutral distribution was a trimmed mean (used by Boninet al. 2006; Milleret al. 2007; Nosilet al. 2008; Smithet al. 2008), the raw (nontrimmed) mean (used by Scotti-Saintagneet al. 2004; Mealor and Hild 2006), or the median FST (used in simulations by Beaumont and Balding 2004).
We found that the potentiality of the method to estimate the proportion of the genome subject to selection through the proportion of outliers detected has some reliability only for the case where (i) the selection coefficients are very large (
It is highly remarkable that when a high critical probability threshold (α = 5%) is applied, the proportion of outliers detected is basically independent of the proportion of selective loci in the genome (Figure 5) as well as of the value of
In most studies the main objective of the genome scans is to detect true selective loci among the outliers. Our results suggest that a low critical probability (α = 0.05%) is necessary (Figures 5–7) for this objective to be achieved successfully, but the method is not reliable for small values of
Beaumont and Balding (2004) simulations indicated an appreciable discrimination for selective loci when the selective effect is larger than about five times the migration rate. Thus, they found that ∼60% of outliers were truly selective loci considering selection coefficients of s = 0.02, and this rate of success was improved up to 80–100% for selective effects between 0.05 and 0.2 (results deduced from their Table 3). This result is in agreement with our observations. For a neutral FST = 0.025 the migration rate was m = 0.006, so that selective effects larger than 5m = 0.03 should be more likely detected. Figure 7H indicates that the percentage of selective loci in outliers is close to 100% for
We also found that subpopulation sample size has some impact on the detection of selective loci, but this is limited beyond M = 40 individuals or so (Figure 7, M–O). This is in agreement with the conclusion of Beaumont and Nichols (1996) that there is little improvement in the performance of the FDIST method by having sample sizes larger than 25 diploids per subpopulation.
We can conclude that the detection of selective loci among outliers depends enormously on the conditions of the analysis and the proportion and effects of the selective loci. Using low critical probability threshold values (e.g., α = 0.05%) and a trimmed or the median FST seem the most appropriate procedures. We have not investigated the performance of an iterative process to obtain the mean FST of the null neutral model, suggested by Beaumont and Nichols (1996) (see Beaumont and Balding 2004), whereby the raw mean FST is first calculated, then the outliers removed, and the process repeated until no outliers are detected (see, e.g., Papaet al. 2007 for an empirical application). However, we have seen that a high degree of success is accomplished only when the proportion of selective loci in the genome is large and their average effect is also large. If these circumstances do not hold, the number of false positives can be very substantial and, therefore, the sequential removal of outliers is unlikely to work successfully. Another more reliable way to detect truly selective loci may be by combining the results of different tests (Vasemägiet al. 2005; Boninet al. 2006), but this has yet to be checked by further analyses and simulations.
Impact of homoplasy on the detection of selective loci:
The impact of homoplasy on the identification of selective loci was investigated assuming different numbers of loci and, consequently, of experimental fragments analyzed per primer combination (Figure 8). The main observation is that homoplasy causes a slight overestimation in the proportion of outliers and a more substantial underestimation in the proportion of selective loci detected. This latter can reach up to 15% when the number of experimental bands per primer combination is on the order of ≥75–100 (Figure 8, C, F, and I). Thus, this aspect inherent to AFLP data should be taken into consideration in genomewide scans of selective loci.
Two additional important issues should be taken into account when using AFLP fragments in the detection of selective loci. First, in the cases of substantial homoplasy, some bands detected as outliers, even if they really correspond to true selective genes, do not necessarily carry a single selective fragment but they can also carry one or more neutral fragments. Therefore, a further experimental analysis of these bands should be carried out to disentangle selective and neutral comigrating fragments. The second issue is that the above threshold range refers to the number of experimental bands observed at the population level. If sample sizes are very small and the frequency of the allele producing the band is also low, the number of experimental bands observed can be much lower than the real one in the population, producing a false impression that the impact of homoplasy in the study will be low. For example, assuming genotypic frequencies at Hardy–Weinberg equilibrium and being
Footnotes
Communicating editor: J. B. Walsh
Acknowledgement
We thank Juan Galindo and two anonymous referees for useful comments on the manuscript. This work was funded by the Ministerio de Educación y Ciencia and Fondos Feder (CGL2006-13445-C02/BOS, CGL2004-03920/B0S), Plan Estratégico del Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (CPE03-004-C2), and Xunta de Galicia.
References
Acheré, V., J. M. Favre, G. Nesnard and S. Jeandroz,
Akashi, H., R. M. Kliman and A. Eyre-Walker,
Albach, D. C., P. Schönswetter and A. Tribsch,
Althoff, D. M., M. A. Gitzendanner and K. A. Segraves,
Beaumont, M. A.,
Beaumont, M. A., and D. J. Balding,
Beaumont, M. A., and R. A. Nichols,
Bensch, S., and M. Akesson,
Bonin, A., P. Taberlet, C. Miaud and F. Pompanon,
Bonin, A., D. Ehrich and S. Manel,
Bowcock, A. M., J. R. Kidd, J. L. Mountain, J. M. Hebert, L. Carotenuto et al.,
Bussel, J. D., M. Waycott and J. A. Chappill,
Campbell, D., and L. Bernatchez,
El-Rabey, H. A., A. Badr, R. Schaffer-Pregl, W. Martin and F. Salamini,
Enjalbert, J., X. Duan, M. Leconte, M. S. Hovmøller and C. de Vallavielle-pope,
Excoffier, L., J. Novembre and S. Schneider,
Fang, J., J. Tao and C. T. Chao,
Fernández, B, A. García-Dorado and A. Caballero,
Gort, G, W. J. M. Koopman and A. Stein,
Hansen, M., T. Kraft, M. Christiansen and N.-O. Nilsson,
Innan, H., R. Terauchi, G. Kahl and F. Tajima,
Jump, A. S., J. M. Hunt, J. A. Martínez-Izquierdo and J. Peñuelas,
Kalita, M., and W. Malek,
Kidd, M. R., C. E. Kidd and T. D. Kocher,
Koopman, W. J. M., and G. Gort,
Laitung, B., E. Chauvet, N. Feau, K. Fève, L. Chikhi et al.,
Lewontin, R. C., and J. Krakauer,
Mealor, B. A., and A. L. Hild,
Mechanda, S. M., B. R. Baum, D. A. Johnson and J. T. Arnason,
Mendelson, T. C., and K. L. Shaw,
Meudt, H. M., and A. C. Clarke,
Miller, N. J., M. Ciosi, T. W. Sappington, S. T. Ratcliffe, J. L. Spencer et al.,
Mueller, U. G., and L. L. Wolfenbarger,
Murray, M. C., and M. P. Hare,
Nei, M.,
Nielsen, R.,
Nosil, P., S. P. Egan and D. J. Funk,
O'Hanlon, P. C., and R. Peakall,
Papa, R., E. Bellucci, M. Rossi, S. Leonardi, D. Rau et al.,
Peters, J. L., H. Constandt, P. Neyt, G. Cnops, J. Zethof et al.,
Rombauts, S., Y. van de Peer and P. Rouzé,
Rouppe van der Voort, J. N., P. van Zandvoort, H. J. van Eck, R. T. Folkertsma, R. C. Hutten et al.,
Schlötterer, C.,
Scotti-Saintagne, C., S. Mariette, I. Porth, P. G. Goicoechea, T. Barreneche et al.,
Smith, T. B., B. Milá, G. F. Grether, H. Slabbekoorn, I. Sepil et al.,
Snäll, T., J. Fogelqvist, P.J. Ribeiro and M. Lascoux,
Stein, L. D., Z. Bao, D. Blasiar, T. Blumenthal, M. R. Brent et al.,
Storz, J. F.,
Storz, J. F., and J. M. Dubach,
Storz, J. F., and M. W. Nachman,
Storz, J. F., B. A. Payseur and M. W. Nachman,
Troell, K., A. Engstrom, D. A. Morrison, J. G. Mattsson and J. Hoglund,
Vasemägi, A., J. Nilsson and C. R. Primmer,
Vekemans, X., T. Beauwens, M. Lemaire and I. Roldán-Ruiz,
Vitalis, R., K. Dawson and P. Boursot,
Vos, P., R. Hogers, M. Bleeker, T. van de Lee, M. Hornes et al.,
Weir, B., and C. Cockerham,
Wilding, C. S., R. K. Butlin and J. Grahame,
Wright, S.,
Yu, J., S. Hu, J. Wang, G. K.-S. Wong, S. Li et al.,
Zhivotovsky, L. A.,