Abstract
A new method for assigning individuals of unknown origin to populations, based on the genetic distance between individuals and populations, was compared to two existing methods based on the likelihood of multilocus genotypes. The distribution of the assignment criterion (genetic distance or genotype likelihood) for individuals of a given population was used to define the probability that an individual belongs to the population. Using this definition, it becomes possible to exclude a population as the origin of an individual, a useful extension of the currently available assignment methods. Using simulated data based on the coalescent process, the different methods were evaluated, varying the time of divergence of populations, the mutation model, the sample size, and the number of loci. Likelihood-based methods (especially the Bayesian method) always performed better than distance methods. Other things being equal, genetic markers were always more efficient when evolving under the infinite allele model than under the stepwise mutation model, even for equal values of the differentiation parameter Fst. Using the Bayesian method, a 100% correct assignment rate can be achieved by scoring ca. 10 microsatellite loci (H ≈ 0.6) on 30–50 individuals from each of 10 populations when the Fst is near 0.1.
INTRASPECIFIC or juxtaspecific taxonomy has long been based on genetic studies using moderately variable markers (Ayala 1975; Nei 1975; Nevo 1978; Nei and Graur 1984). Populations have been discriminated mainly through differences in allele frequencies and occasionally through diagnostic alleles. Nowadays, molecular technology provides almost unlimited numbers of highly variable markers (e.g., microsatellite loci) available for analyzing the genetic variability of populations and the evolutionary structure of species (Jarne and Lagoda 1996; Estoup and Angers 1998). Although it is still possible to use methods based on allele frequencies, the high number of alleles segregating at these loci quickly achieves a differentiation of all individuals, even with a small number of loci and a large number of individuals analyzed.
Following the suggestion of Vrana and Wheeler (1992), one is thus tempted to consider the individual, not the population, as the ultimate taxon, replacing allele frequencies by individual multilocus genotypes. Considering that individuals will have genotypes more similar when they come from the same population, one can evaluate the way in which they cluster in sets on the basis of their genotypes. The first examples of this approach are the analyses performed by Bowcock et al. (1994) and Estoup et al. (1995), who transposed phylogenetic-like methods to the individual's level to study the genetic differentiation of populations. The resulting neighbor-joining trees exhibited clusters of individuals belonging to different infraspecific taxa such as subspecies, populations, or even subfamilies. They provided an idea or a measure [through the index of classification of Estoup et al. (1995)] of how well the individuals of a given taxon were clustered. They might as well be used to see which individuals were correctly classified in their population, although that was not among the objectives of these studies.
In contrast, the question of assigning an individual to a population was specifically addressed in several other studies with purposes such as classifying individual fish (Tayloret al. 1994) or honey bees (Cornuetet al. 1996), evaluating population differentiation in polar bears (Paetkauet al. 1995), comparing dispersal rates between sexes in shrews (Favreet al. 1997), or detecting recent immigration in humans (Rannala and Mountain 1997). Thus, assigning an individual to a given group on the basis of its genotype can have a wide range of applications in population genetics and the range extends to several other fields such as forensics, conservation genetics, and stock management (reviewed in Waser and Strobeck 1998).
In the few examples cited above, a rather large variety of methods have been used, which can be grouped into two categories. The first category includes “general methods,” i.e., methods that can be applied to almost any kind of data. Two methods of this category, the discriminant analysis and the method of neural networks, have been used and their performances compared by Taylor et al. (1994) and Cornuet et al. (1996) with somewhat contradictory results. The second category includes “genetic methods,” which can be applied only to genotype data (Paetkauet al. 1995; Favreet al. 1997; Rannala and Mountain 1997). All are based on the likelihood that the multilocus genotype of the individual to be assigned occurs in each of two or more candidate taxa. These methods have proven to be quite effective for various applications (Waser and Strobeck 1998). However, the computation of the likelihood relies on two explicit assumptions: loci should be at Hardy-Weinberg equilibrium and at linkage equilibrium. Thus, it would be useful to develop other “genetic methods” based on less restrictive assumptions and hence not subject to the same limitations, and to compare their performances to the existing methods.
A common limitation to all existing assignment methods is that they always designate a single population as the probable source of the individual being assigned. They just give an answer to the question: Among these particular populations, which is the most likely to be the individual's population of origin? If the population of origin of the individual is not represented in the set of reference populations, the methods will still designate a (wrong) population of origin. In other words, the existing genetic methods do not provide any clear indication of the confidence we can put in the designated population. In some contexts, it can be more important to exclude a given population than to designate a most likely one. The questions of the confidence in the choice or the exclusion of a population would be solved if we could dispose of a measure of the probability that the individual belongs to a population. We might then exclude a population because the probability that the individual belongs to it is lower than a given threshold.
In this article, we propose a new assignment method based on genetic distances between an individual and a population without the assumptions of Hardy-Weinberg and linkage equilibrium. The performance of this method is evaluated for different genetic distances and is compared to the assignment methods of Paetkau et al. (1995) and Rannala and Mountain (1997). For all these genetic methods, we also present an extension that provides a measure of the probability that an individual belongs to a given population.
METHODS
We first present three methods for assigning individuals to populations. To each assignment method corresponds an “exclusion” method, the principle of which, being common, is only discussed once. All methods are based on the knowledge of multilocus genotypes of representative samples taken from the candidate populations and of the individual(s) to be assigned.
Assignment methods: The frequency method: This method, first presented by Paetkau et al. (1995), assigns an individual to the population in which the individual's genotype is most likely to occur. Suppose that J independent loci have been typed in the I reference populations and in the individuals to be assigned. The frequency of allele k at locus j in population i is pijk. Assuming Hardy-Weinberg equilibrium, the likelihood of a genotype AkAk′ occurring in the ith population at the jth locus is proportional to (pijk)2 if k = k′ and to 2 pijk pijk′ otherwise. Because the J loci are assumed to be independent, the likelihood of a multilocus genotype occurring in a given population is the product of likelihoods for each locus. Then, the method consists of three steps: (i) computing the required allelic frequencies in all candidate populations; (ii) computing the likelihoods of the individual's multilocus genotype occurring in each population; and (iii) assigning the individual to the population in which the likelihood of the individual's genotype is the highest.
In addition to the two assumptions of Hardy-Weinberg equilibrium and independence of loci, there is a third unstated assumption that the allelic frequencies deduced from the population samples are close to their exact values. A particular case, noted by Paetkau et al. (1995), is when one allele in an individual is absent from a candidate population sample. In this case, the estimate of the corresponding allelic frequency in the candidate population is equal to zero, leading to a likelihood of zero and hence eliminating de facto this population. However, the allele in question may be rare in the population so that it was not represented in the sample, but that should not eliminate a priori the population. To circumvent this drawback of the method, Paetkau et al. (1995) suggested systematically adding the individual to all population samples, thus eliminating null frequencies. On their web site (http://www.biology.ualberta.ca/jbrzusto/Doh.html), they propose two other ways of dealing with this difficulty. Null frequencies can be replaced either by a (small) constant value or by the inverse of the number of gene copies sampled in each population.
The Bayesian method: This second method, similar to the previous one, is largely inspired from Rannala and Mountain (1997), who used a Bayesian approach to detect immigrants by using multilocus genotypes (more precisely, the Bayesian approach concerns essentially the derivation of the probability density of population allele frequencies from sample population frequencies). Assuming an equal prior probability density to the allelic frequencies of each locus in each population, Rannala and Mountain showed that the marginal probability of observing an individual with genotype AkAk′ at locus j in population i was equal to (Formula 9 of Rannala and Mountain)
This method is performed in the same way as the frequency method by simply replacing the formulas for computing the likelihood of a genotype by the above formulas 1 and 2. Note that the difficulty raised by null frequencies in the previous method disappears here because of the coefficient 1/Kj, which results from the computations.
The distance method: Whereas the two previous methods are based on the probability of observing a given genotype in the various reference populations, the distance method assigns the individual to the “closest” population. Because data here are genotypes, the distance will be a genetic distance. Numerous genetic distances have been defined (cf. Nei 1987 for a review). Most of them are interpopulation distances (e.g., Nei's distances, Cavalli-Sforza and Edwards chord distance, etc.), and at least one is an interindividual distance (DAS, shared allele distance; Chakraborty and Jin 1993). However, we need distances between an individual and a population. Thus, for the interpopulation distances, we simply considered the individual to be assigned as a sample of two genes (possible values of allelic frequencies are 0, 0.5, and 1). For the shared allele distance, the distance of an individual to a population was taken as the average of distances between the individual and the members of the population sample.
Apart from providing a different basis for conducting assignment methods, the distance method can be adapted to different categories of genetic markers. For instance, some distances have been defined especially for microsatellites such as the (δμ)2 of Goldstein et al. (1995) or the distance of Shriver et al. (1997). We could then expect better results with some distances than with others depending on the modalities of evolution of the markers. Eventually, note that in contrast to the previous methods, the distance-based methods do not require Hardy-Weinberg equilibrium or absence of linkage disequilibrium among loci. By that, we only mean that there are no references to these two hypotheses in the computation of distances (in contrast with likelihood-based methods) but these hypotheses may be required if the genetic distances are used for other purposes such as estimating divergence times.
Exclusion methods: The above three methods have two characteristics in common: (i) they are based on a criterion relating the individual to each population (e.g., a genetic distance between the individual and the population), the best candidate population being the one with the highest/lowest value of the criterion; and (ii) they always designate a population to which the individual can be assigned, because there is always a most likely or a closest population in any reference set. However, the set of reference populations may not include the true population of origin of the individual. Therefore, a measure of confidence that the individual truly belongs to a given population is needed. This can be achieved by comparing the value of the criterion of the individual (relative to the given population) with values of the criterion for individuals that belong to the population. More precisely, we need to locate the criterion value of the individual within the distribution of values for individuals of the population. If the individual's criterion is well outside the distribution, it seems logical to consider that the individual does not belong to the population. Furthermore, the proportion of the distribution with values “worse” (higher for a distance criterion or lower for a probability/likelihood criterion) than the tested individual's value can be considered as a measure of the probability that this individual belongs to the population. For instance, suppose that the distance between the tested individual and the population is 0.9 and that 97% of the distribution of distances between population individuals and the population is <0.9. We consider that the probability that the tested individual belongs to the population is only 3%. Note that it is possible to use an exclusion method as an assignment method (the individual is assigned to the population for which its probability of belonging is the highest).
The question is how to compute the distribution of the chosen criterion in each population. Taking only the individuals sampled in each population cannot provide the appropriate distribution because the number of genotype combinations becomes large very quickly even with a moderate number of loci and alleles (e.g., five loci with three alleles can make 7776 possible diploid genotypes). One way to generate this distribution without examining all genotype combinations (weighted by their probability of occurrence) is to simulate multilocus genotypes by randomly taking alleles according to their frequencies in the population. However, only the frequencies in population samples are known. A first way is to simply take population sample frequencies and this is what was performed in the following. One can also follow Rannala and Mountain (1997). This amounts to replacing the allelic frequencies pijk (= nijk/nij) by (nijk + 1/Kj)/(nij + 1) when drawing the first of the two alleles (at locus j in population i) and by (nijk + 1/Kj + 1)/(nij + 2) or (nijk′+ 1/Kj)/(nij + 2) when drawing the second allele according to whether the first allele drawn was allele k or another allele, respectively (formulas 23, 24, and 25 in Rannala and Mountain 1997). Figure 1 provides examples of such distributions of the criterion (here minus the decimal logarithm of the genotype likelihood according to the Bayesian method). In Figure 1, A and B, we considered two populations that diverged 200 and 2000 generations ago, respectively. The histograms on the left represent the distribution of the log likelihood that multilocus genotypes, simulated according to population 1 allele frequencies, occur in population 1 whereas the histograms on the right represent the distribution of the log likelihood that genotypes simulated according to population 2 allele frequencies occur in population 1. Only left histograms are computed and used in actual analyses. Right histograms are just shown here as examples of genotypes of alien populations that may (or may not) be excluded by the method.
Examples of distributions of the assignment criterion from simulated genotype data. A and B correspond to the situation in which two populations diverged 200 and 2000 generations ago, respectively. Parameter values were as follows: 10 independent loci; mutation rate, 0.0005; effective population size, 1000; sample size, 50 diploid individuals; mutation model, IAM. In A and B, the histogram on the left represents the distribution of the log-likelihood that genotypes, simulated according to population 1 allele frequencies, occur in population 1, whereas the histogram on the right represents the distribution of the log-likelihood that genotypes simulated according to population 2 allele frequencies occur in population 1.
Simulation procedures: To compare the performances of the different methods, we generated samples from 10 populations by simulating the coalescent process of genes and then making diploid genotypes by pairing gene copies at random within a population. This allows the generation of population samples while controlling various factors such as mutation rates, effective population sizes, time of divergence, sample sizes, and mutation model of markers. To simulate the coalescent process of genes in more than one population, we followed the method of Simonsen et al. (1995).
Data files were simulated with 10 populations diverging simultaneously from a common ancestral population. Each of the 11 populations (10 observed + 1 ancestral) were modeled to have an effective population size of 1000 diploid individuals. The mutation rate was considered equal to 0.0005 (an average value for microsatellites, reviewed in Estoup and Angers 1998) for all loci. hese loci were independent and evolved according to one of two classical mutation models, the infinite allele model (IAM, Kimura and Crow 1964) and the stepwise mutation model (SMM, Ohta and Kimura 1973). The data files were set up with 20 loci, but some analyses considered only the first 5 or 10 of these 20 loci. In some analyses, the times of population divergence were set to three different values (20, 200, and 2000 generations) and the sample sizes were equal to either 10, 30, or 90 diploid individuals per population.
In some other analyses, the performance of assignment/exclusion methods was assessed as a function of the Fst parameter, a widely used measure of interpopulation genetic differentiation (Wright 1951). Data files were simulated with increasing times of divergence according to a geometric progression (ratio = 20.25) starting at 10 generations (10, 11, 14, 16, 20, 22, 28, 32, 40, …, 48,709). For each data file, Fst was estimated through the θ estimator of Weir and Cockerham (1984). The resulting relationships were approximated through a logistic regression using Statistica/w 5.0 (StatSoft 1997).
RESULTS
Comparison of assignment methods: In the first analysis, we considered all possible combinations of mutation models, sample sizes, numbers of loci, and times of divergence (i.e., 2 × 3 × 3 × 3 = 54 combinations). The corresponding data files were analyzed with four assignment methods when loci evolved under the IAM and five methods when they evolved under the SMM. The five methods included the frequency method, the Bayesian method, and three different distance methods based on the shared allele distance, the Cavalli-Sforza and Edwards chord distance, and the (δμ)2 of Goldstein et al. (1995). The latter distance method, developed for SMM loci, was not considered for data files based on the IAM.
The performances of a given method were measured as the average proportion of individuals correctly assigned to their population across 50 data files including each from 100 to 900 individuals. All individuals were tested using the “leave one out” procedure (Efron 1983); i.e., they were individually excluded from their population when performing their assignment. For the frequency method, null allele frequencies were set to 0.01, systematically. This value is arbitrary but corresponds approximately to the precision obtained on allele frequencies taking the average of the sample sizes [87 genes = (20 + 60 + 180)/3)] per population in our simulations.
Percentage of individuals assigned to the correct population (y-axis) as a function of the assignment method (x-axis) under various conditions. On the x-axis, methods noted B, F, C, D, and G correspond to the Bayesian, frequency, chord distance (Cavalli-Sforza and Edwards 1967), the shared allele distance, and Goldstein et al.'s (1995) δμ2 distance, respectively. Each individual dot is the average value over 50 data sets made of 10 simulated population samples each (see text for details). (Left) Loci evolving under the IAM model of mutation; (right) evolving under the SMM model of mutation. The two upper graphs correspond to a sample size of 10 diploid individuals per population, the two center graphs to a sample size of 30, and the two lower graphs to a sample size of 90. Within each graph, the curves on the left were obtained with 5 loci, the central curves with 10 loci, and the curves on the right with 20 loci. Finally, the thin lines correspond to a divergence time of 20 generations (Fst ≈ 0.01), the medium lines to 200 generations (Fst ≈ 0.08), and the bold lines to 2000 generations (Fst ≈ 0.30).
The results are summarized in Figure 2. For 10 populations, we expect ~10% of individuals to be correctly assigned to their population of origin because this is the base line that corresponds to a random assignment. Note that for populations that diverged recently (20 generations ago in our conditions, resulting in Fst ≈ 0.01), curves are close to the base line. When populations have sufficiently diverged, scores can reach 100% even with as few as 10 loci and 10 individuals sampled per population. Loci evolving under the IAM are largely more favorable to assignment scores than those evolving under SMM. Increasing the number of loci increases the performance of any method (when possible). The same is true for the sample size.
For loci evolving under the IAM, the Bayesian method provides the best scores and is followed, respectively, by the frequency method, the chord, and the DAS distance methods. The largest difference between the scores (Bayesian method and DAS method) amounts to 30.7% (when considering 90 individuals/population, 5 loci, and 200 generations of divergence). For loci evolving under the SMM, Goldstein et al.'s (1995) distance method always has the lowest scores. The performance ranking of the other four methods is the same, the largest difference among their scores being 14.6% (90 individuals/population, 10 loci, 200 generations of divergence).
Relationship between the percentage of individuals correctly assigned and the Fst estimates for different assignment methods and different mutation models. For each combination (method × mutation model), 50 data sets of 10 population samples of 30 diploid individuals characterized at 10 loci were simulated with increasing divergence times allowing a range of Fst values between 0 and 0.35. The relationship was approximated through a logistic regression (R > 0.995 for all curves except for the curve noted G for which R = 0.93). Thick lines correspond to loci evolving under the IAM and thin lines to loci evolving under the SMM. Curves noted B, F, C, D, and G correspond to the Bayesian, frequency, Cavalli-Sforza and Edwards chord distance, the shared allele distance, and Goldstein et al.'s (δμ)2 distance, respectively.
In a second analysis, we analyzed the relationship between the proportion of individuals correctly assigned and the genetic differentiation among populations measured by the Fst coefficient. We simulated 50 data files, each containing 10 populations represented by a sample of 30 diploid individuals scored at 10 loci (per population). The time of divergence varied among data files according to a geometric progression allowing a range of Fst values between 0 and 0.35. Figure 3 summarizes the relationships between both quantities (percentage of correctly assigned individuals and Fst) using the different assignment methods and with loci evolving under the IAM or SMM, respectively. To keep the figure readable with all nine possible combinations, point values were replaced by logistic regressions that closely fit the data [R > 0.995 in any combination except for Goldstein et al.'s (δμ)2 distance method where R = 0.93].
The relative performance of the methods is the same in Figure 3 as in Figure 2: for any value of Fst, the best score is obtained with the Bayesian method, followed by the frequency method and the distance methods always in the same order (chord then DAS). For SMM loci, Goldstein et al.'s (1995) δμ2 distance method is far below other methods. Surprisingly, for any given Fst value and whatever the assignment method, scores are always better when loci evolve under the IAM compared to the SMM. This suggests that the differentiation measured by Fst is not sufficient to predict the score of an assignment method that is also sensitive to the way by which loci evolve. Figure 3 also shows that a perfect assignment can be obtained with Fst values as moderate as 0.1 (in the case where all populations have diverged at the same time and are represented by at least 30 individuals scored at 10 loci evolving under the IAM). There are some ranges of Fst values for which the choice of the assignment method can be critical. For instance, in conditions of Figure 3, ~85% of individuals will be correctly assigned on average with the Bayesian method whereas <50% will be so with the DAS distance method, when the Fst is ~0.05 and all 10 scored loci evolve under the IAM.
Comparison of exclusion methods: For assignment methods, a simple parameter such as the proportion of correctly assigned individuals provides a good idea of their performance. For exclusion methods, we need more parameters. First, note that the output of an exclusion method can have different forms whereas the output of an assignment method is a single population (which can simply be the right or a wrong origin for the individual). Also note that exclusion methods give (i) the probability that the individual belongs to a certain population (or to any reference population in a data base) and (ii) the list of the populations for which the probability of belonging is at least equal to a given threshold. In the list, one possible outcome is that all populations are excluded, i.e., the probability of belonging is below the threshold for all of populations. A second possible outcome is that only the correct population is listed as the potential origin. A third possible case is that just one population, but not the right one, is listed. A fourth case is more than one population is listed, including the correct one. The last (fifth) case is more than one population is listed, not including the correct one. The five possible outcomes can be classified into two overlapping groups, named here A and E. Type A errors are cases where the correct population is absent from the list (first, third, and fifth cases). Type E errors are cases where one or more erroneous population(s) appear(s) in the list (third, fourth, and fifth cases). According to the situation, one may want to preferentially minimize either type A or type E errors.
Figure 4, A and B, provides the frequency of both types of errors over a subset of combinations of the parameters used in Figure 2. In Figure 4, we dropped the shortest time of divergence (20 generations) for which assignment scores are very low, keeping only 200 and 2000 generations as divergence times. The threshold for excluding a population was arbitrarily fixed to 0.01 (Figure 4A) and 0.001 (Figure 4B). Each individual was tested using the leave one out procedure as above. The distribution of the criterion (genotype likelihood or genetic distance) was established by simulating 1000 individuals. Because this computation is very time-consuming, errors were estimated only over 10 simulated data files for each data point on the figures. This may appear to be a low number (five times less than in the assignment methods), but this still represents a total of 1000, 3000, and 9000 individuals because each file contains 10 populations and each one is represented by 10, 30, or 90 individuals, respectively.
Considering first the type A errors, results are almost identical for both times of divergence. Thus, the type A error rate does not seem to be influenced by the amount of divergence among populations. While this error rate is always very low for SMM loci, it can reach very high levels for IAM loci. With the IAM, it is very sensitive to the sample sizes but less sensitive to the number of scored loci. At first sight, it may be surprising that, when sample sizes are small, the error increases with the number of loci (Figure 4, A and B; 10 individuals/population). One possible explanation is that, discarding the tested individual (leave one out procedure), frequency estimates are more biased with small samples and combining the information from more loci increases the overall bias in the exclusion criterion and hence raises the type A error. This seems consistent with the observation that increasing the sample size is very efficient in reducing the type A error. Moreover, the likelihood-based methods, which have the best assignment scores, are more efficient in excluding populations but will also more often exclude the correct one because of the aforementioned bias, and hence have the largest type A error rates. Note that large type A error rates correspond most generally to cases where the list of possible populations is empty. The comparison of Figure 4, A and B, shows that lowering the threshold of exclusion also decreases this kind of error, as expected. In summary, to lower type A errors, one can (i) lower the threshold, (ii) employ SMM loci rather than IAM loci, (iii) use a distance method, and (iv) increase the sample sizes. The latter condition, alone, is sufficient to get negligible type A errors.
Type E errors are much influenced by the time of divergence and the mutation model of loci. After 2000 generations of divergence for IAM loci, this type of error becomes negligible even with small sample sizes and a small number of loci. At 200 generations for SMM loci, errors are maximal (>75%). In the other two combinations, (2000 generations and SMM loci) and (200 generations and IAM loci), errors vary more or less widely with the method, but lowest errors are logically obtained with the best method (Bayesian). The difference of errors between the DAS distance method and the Bayesian method can be as high as 85%. The errors logically decrease when the number of individuals and/or the number of loci increase(s) and when the threshold increases.
As expected, type A and type E errors do not respond in the same direction when parameter (e.g., time of divergence, sample size, number of loci, mutation model) values change. However, at least when populations have diverged for enough time, it is possible to jointly minimize both types of errors by sampling at least 50 individuals per population, scoring at least 10 loci, choosing IAM-like loci if possible, and using the Bayesian method. With SMM-like loci, such as microsatellites, a similar result requires more individuals and/or more loci (e.g., 70–90 individuals per population and 15–20 loci).
To get a better idea of the influence of the time of divergence of populations on type E errors, an analysis was performed with varying values of Fst (Figure 5). As in Figure 3, simulations were performed with 10 populations, 10 loci, and 30 diploid individuals per population. Relationships between type E errors and Fst were approximated through logistic regressions (R > 0.995 for all methods except the one based on Goldstein et al.'s distance for which R = 0.93). There appear to be large differences among methods and between the two mutation models, with the same relative performance as in the assignment methods.
Maximizing assignment scores: When populations are not highly differentiated, e.g., when Fst < ~0.1, the performances of assignment methods always improve with larger population samples and larger numbers of loci. But an important practical question is whether it is more efficient to increase the former or the latter? If the total number of analyses (e.g., PCR analyses) is limited due to economic constraints, what would be the most efficient combination of sample size and number of loci? Figure 6 provides a tentative answer for an assignment method scenario in which, e.g., 240 analyses (individual × locus) can be conducted per population, using the Bayesian method. The figure shows that the most efficient combination varies with the degree of population divergence. With a very low Fst (0.01, curve 1 in Figure 6), the best combination is 8 loci and 30 individuals scored per population. With increasing Fst's, one should reduce the sample size and increase the number of loci. For instance, when Fst is near 0.025 (curve 2), the optimal number of loci is in the range of 15–20 (12–16 individuals sampled per population) and when Fst ≥ 0.05 (curves 3, 4, and 5), it is in the range of 20–30 loci (with as low as 8–12 individuals sampled per population). However, whenever Fst is large (e.g., curve 5, Fst = 0.225), many combinations from (10 loci × 24 individuals per population) to (48 loci × only 5 individuals per population) equally provide a 100% correct assignment.
DISCUSSION
A first conclusion is that differences, sometimes quite large, exist among the “genetic” assignment/exclusion methods. The distance-based methods performed less well than the likelihood-based methods, among which the most efficient was the Bayesian method in all cases studied. Among the genetic distances, we chose to study only three. Additional classical distances proposed by Nei or Nei et al. (standard, minimum, and DA; Nei 1987) were assayed in preliminary studies. Among them, DA gave the best results. However, none of them performed better than the DAS and the chord distance studied here (results not shown). In addition, distances such as Nei's standard distance, which can reach infinity if the individual has no allele in common with the population, were a priori inappropriate. The superiority of likelihood-based methods over distance methods is clear in our study. But it is worth stressing that the way data were simulated fulfills exactly the two assumptions on which are based both likelihood-based methods, i.e., Hardy-Weinberg proportions at all loci and no linkage disequilibrium. It would be interesting to explore situations in which the two assumptions are not fulfilled. Distance methods might indeed show higher performances in such situations because the computations do not rely on either of the assumptions. However, some preliminary tests were performed in which, at all loci, one allele (drawn at random) was considered as a “null” allele, inducing an excess of homozygotes (and hence a deviation from Hardy-Weinberg equilibrium in most populations). Even in these cases, the likelihood-based methods still produced higher scores than distance methods.
Type A and E errors (y-axis) as functions of the assignment/exclusion methods (x-axis) under various conditions. On the x-axis, methods noted B, F, C, D, and G correspond to Bayesian, frequency, Cavalli-Sforza and Edwards chord distance, the shared allele distance, and Goldstein et al.'s (δμ)2 distance, respectively. Each individual dot is the average value over 10 data sets, each containing 10 simulated populations (see text for details). A corresponds to an exclusion threshold of 0.01 and B to an exclusion threshold of 0.001. (Left) Loci evolving under the IAM model of mutation; (right) loci evolving under the SMM model of mutation. The two top graphs correspond to a sample size of 10 diploid individuals per population, the two middle graphs to a sample size of 30, and the two bottom graphs to a sample size of 90. Within each graph, the curves on the left were obtained using 5 loci, the central curves with 10 loci, and the curves on the right with 20 loci. The medium lines/empty symbols correspond to a divergence time of 200 generations and the bold lines/filled symbols to 2000 generations. Type A errors are noted with circles and type E errors with squares. Type A errors correspond to the cases in which the correct population is absent from the list of populations considered as possible origin of the tested individual and type E errors correspond to the cases in which at least one erroneous population appears in this list.
In all our simulations, we considered a unique value for the mutation rate and the effective population size, resulting in a rather constant gene diversity close to 0.67 [= M/(1 + M) with M = 4Neμ = 4 × 1000 × 0.0005 = 2] for IAM markers and close to 0.55 [= 1 − (1 + 2M)−0.5] for SMM markers. This somewhat arbitrary choice is justified by the usefulness of microsatellites for conducting assignment methods (Waser and Strobeck 1998) and the fact that most microsatellites have gene diversity levels of 0.50–0.70 in many natural populations (Estoup and Angers 1998). However, all microsatellites do not have a mutation rate equal to 0.0005, and other types of markers with different gene diversities can also be used. Preliminary simulation studies with different mutation rates indicate that assignment scores are much influenced by the variability of markers, the best scores being obtained with the most variable markers (for an equal value of Fst; J. M. Cornuet, unpublished results). This result agrees with that of Estoup et al. (1998), who observed a much higher assignment score with the frequency method when using highly variable microsatellite markers than when using moderately variable allozymes, although there was no significant difference between Fst's computed for each class of markers.
When populations with identical Ne's have diverged for a given number of generations, the level of differentiation is higher for loci evolving under the IAM than for those evolving under the SMM (for a given mutation rate) because homoplasy is absent under the IAM and present under the SMM. Because homoplasy reduces differences among taxa, it is logical that everything being equal, IAM markers provide better assignment scores than SMM markers. However, the large difference in the performance of the assignment/exclusion methods between the two types of loci for the same Fst value was rather unexpected. A possible explanation is that assignment methods are sensitive to the distributions of allele frequencies, which are different according to the mutation model of the locus. For instance, with equal mutation rates and effective population sizes, IAM loci have higher heterozygosities than SMM loci (e.g., 0.66 vs. 0.55 on average in our conditions; cf. previous paragraph). If, as already observed, assignment scores increase with the variability of the markers (measured here by the heterozygosity) then more variable IAM loci will provide better assignment scores. Note that microsatellite markers are considered to follow a rather SMM-like model of evolution with possible size constraints that can increase homoplasy (reviewed in Estoup and Cornuet 1999). Additional studies are needed to evaluate the relative influence of the mutation model and the mutation rate on the performances of assignment methods.
Relationship between type E errors and the parameter Fst for different exclusion (P < 0.01) methods and different mutation models. For each combination (method × mutation model), 50 data sets of 10 population samples of 30 diploid individuals characterized at 10 loci were simulated with increasing divergence times. The relationship was approximated through a logistic regression (R > 0.995 for all curves except for the curve noted G for which R = 0.93). Thick lines correspond to loci evolving under the IAM and thin lines to loci evolving under the SMM. Curves noted B, F, C, D, and G correspond to the following methods: Bayesian, frequency, Cavalli-Sforza and Edwards chord distance, shared allele distance, and Goldstein et al.'s (δμ)2 distance, respectively. Type E errors correspond to the cases in which at least one erroneous population appears in the list of populations considered as possible origin of the tested individual.
However, even with the imperfections mentioned above, the knowledge of the Fst value for a set of populations should provide a useful prediction of the performance of assignment methods. The range of Fst's for which the methods perform well with reasonable sample sizes and numbers of loci (Fst ≥ 0.05; Figure 5) are within the range found among many natural populations (e.g., mountain sheep, Forbes and Hogg (1999); bears, Paetkau et al. (1997); wolves, Roy et al. (1994); fish, Estoup et al. (1998); bees, Estoup et al. (1995); see also Lugon-Moulin et al. (1999) for a review). For example, our simulations suggest that a 100% correct assignment rate can be achieved by scoring 10 loci (with H ≈ 0.6) on 30–50 individuals from each of 10 populations when the Fst is ~0.1. Good assignment scores can also be obtained for lower values of Fst, but will require larger samples of individuals and loci. However, achieving 100% accuracy (i.e., zero error rates) when using the exclusion methods will require >20 loci and 50 individuals, especially when the threshold for excluding a population is very low (e.g., P < 0.001) and when Fst ≈ 0.1 among 10 populations (Figure 4). Such low thresholds will provide a very high certainty of correct exclusion and will be necessary for some forensics applications (e.g., convicting poachers). Fewer loci will be sufficient for applications requiring less stringent exclusion thresholds. More research is needed to quantify the power of the exclusion and assignment methods when the number of populations is different from 10.
Influence of the number of loci on the performance of the Bayesian assignment method when the product (number of loci × population sample size) is constant and equal to 240. As in all other analyses, we considered 10 populations having diverged simultaneously from a common ancestral population. Curves noted 1–5 correspond to different divergence times: 20, 50, 100, 300, and 1000 generations, respectively. Each data point used to construct the curves is the average of 100 simulated data sets. (Left) IAM, (right) SMM.
Assignment/exclusion methods can be conducted in two different ways according to whether the individuals to be assigned are those of the reference set or not. Both ways generally differ in objectives, but there are no conceptual differences in the assignment procedure if choosing the leave one out option. In the Introduction, we gave some examples of studies performed with assignment methods. But a larger scope of applications is likely to develop. In population genetics, many statistics (e.g., gene diversity, fixation indices, genetic distances, etc.) are computed from allele frequencies that are estimated from population samples. A basic and seldom-tested assumption is that population samples do not include “abnormal” individuals. Assignment methods applied to individuals from a reference population using the leave one out procedure can help detect such abnormalities in samples. These abnormal data can result from errors in individual records. They can also correspond to immigrants or their descendants, in which case the question is whether immigration is artificial or natural. When immigration is natural and sufficiently low, assignment tests can be used to estimate dispersal rates in natural populations via direct methods (i.e., computing the proportion of individuals that are identified as immigrants) while simultaneously estimating dispersal rates indirectly (e.g., via estimating Nm from Fst in the island model of migration). The importance of simultaneously using both the direct and the indirect methods has been thoroughly discussed by several authors (Slatkin 1987; Neigel 1997).
Another potential use of assignment methods in population genetic studies is in measuring population differentiation. The estimation of the level of differentiation is classically performed with the Fst parameter. Figure 3 clearly illustrates the fact that when differentiation is low, the proportion of correctly assigned individuals varies between 1.0 and 1/p (with p populations) when the Fst is close to 0 (e.g., 0 < Fst < 0.05). Could the assignment score be used as a differentiation index, especially useful in the case of low differentiation? There seem to be several obstacles to the direct use of this parameter. First, we showed a large influence of the sample sizes, the number of loci, and the mutation model of markers. Second, we suspect an effect of the level of variability of markers. Third, the range of levels of differentiation for which the proportion of correctly assigned individuals would be useful seems quite narrow.
In addition to pure scientific objectives, assignment/exclusion methods may have various practical applications. Identifying immigrants or their descendants would be useful in conservation biology for detecting both (unwanted) introgression of foreign genes and (desirable) reproduction by transplanted or natural immigrants that can help maintain a population's genetic variation and evolutionary potential. Assignment methods can also be useful in crop pest management to identify the origin of a newly introduced pest or in forensics science to identify the origin of illegally killed animals or illegally obtained plant and animal parts, and thereby help prosecute poachers and minimize poaching.
Acknowledgments
This research was supported by a grant from the French Bureau des Ressources Génétiques. We thank D. Paetkau and two anonymous referees for critically reading the manuscript and providing helpful suggestions. A computer program (GeneClass), written in Delphi v4 professional (for Windows 95), performs all the computations required to apply the assignment/exclusion methods described in this article. The executable version is available at http://www.ensam.inra.fr/URLB.
Footnotes
-
Communicating editor: G. B. Golding
- Received February 8, 1999.
- Accepted August 16, 1999.
- Copyright © 1999 by the Genetics Society of America