## Abstract

Microsatellite DNA loci or short tandem repeats (STRs) are abundant in eukaryotic genomes and are often used for constructing phylogenetic trees of closely related populations or species. These phylogenetic trees are usually constructed by using some genetic distance measure based on allele frequency data, and there are many distance measures that have been proposed for this purpose. In the past the efficiencies of these distance measures in constructing phylogenetic trees have been studied mathematically or by computer simulations. Recently, however, allele frequencies of 783 STR loci have been compiled from various human populations. We have therefore used these empirical data to investigate the relative efficiencies of different distance measures in constructing phylogenetic trees. The results showed that (1) the probability of obtaining the correct branching pattern of a tree (*P*_{C}) is generally highest for *D*_{A} distance; (2) *F*_{ST}*, standard genetic distance (*D*_{S}), and give similar *P*_{C}-values, *F*_{ST}* being slightly better than the other two; and (3) (δμ)^{2} shows *P*_{C}-values much lower than the other distance measures. To have reasonably high *P*_{C}-values for trees similar to ours, at least 30 loci with a minimum of 15 individuals are required when *D*_{A} distance is used.

BECAUSE microsatellite DNA loci or short tandem repeat (STR) loci are highly polymorphic, they are very useful for studying the evolutionary relationships of closely related populations or species. However, a number of statistical problems should be solved before this approach can be used effectively. Some of the major problems are: (1) What kind of distance measures should be used for constructing phylogenetic trees?, (2) How many loci should be used?, and (3) How many individuals per locus should be used? Some of these problems have been studied theoretically or by computer simulation with specific mathematical models (*e.g.*, Zhivotovsky and Feldman 1995; Takezaki and Nei 1996; Zhivotovsky *et al.* 2001; Kalinowski 2002). However, the assumptions for constructing mathematical models are not always realistic, and for this reason these studies may give wrong conclusions. Fortunately, a large number of data about STR loci have recently been accumulated from various human populations (Ramachandran *et al.* 2005; Rosenberg *et al.* 2005). We have therefore decided to study the above problems using these empirical data.

## MATERIALS AND METHODS

#### Microsatellite data:

Genotypic data of 783 STR loci of 1027 individuals from 53 populations that were used by Ramachandran *et al.* (2005) for studying isolation by distance were downloaded from http://rosenberglab.bioinformatics.med.umich.edu/data/ramachandranEtAl2005/combinedmicrosats-1027.stru. Allele frequencies were computed by excluding missing data. Information regarding the STR loci such as the unit of nucleotide repeats and genomic locations of STR loci was obtained from http://rosenberglab.bioinformatics.med.umich.edu/data/rosenbergEtAl2002/diversityloci.txt and by internet search including databases of Marshfield (http://marshfieldclinic.org/research/pages/index.aspx) and UniSTS at NCBI (http://www.ncbi.nlm.nih.gov/). These data were essentially the same as the microsatellite data used by Rosenberg *et al.* (2005), although Rosenberg *et al.*'s data included insertion/deletion polymorphism data and data from the Surui population in Brazil. In our data set of 783 STR loci, there were 45 dinucleotide, 175 trinucleotide, 555 tetranucleotide, and 8 pentanucleotide repeat loci. Examining 1056 individuals from 52 populations, Zhivotovsky *et al.* (2003) studied the extent of variation in 377 STR loci, of which 45 were dinucleotide, 58 were trinucleotide, and 274 were tetranucleotide repeat loci. The same data set was also used by Rosenberg *et al.* (2002). The dinucleotide repeat loci used by Zhivotovsky *et al.* (2003) were the same as those in this study, and their trinucleotide and tetranucleotide repeat loci were a subset of the loci used in this study.

Genomic locations were available for ∼300 loci, but only a few pairs of loci were within a genomic distance of 1 Mbp. Therefore, the extent of linkage disequilibrium between the loci seems to be negligible.

The heterozygosity for a locus was calculated by , where is the frequency of the *i*th allele in the sample, and *n* is the number of diploid individuals examined at the locus. In this study null alleles were excluded when allele frequencies were computed. Therefore, the actual 2*n* value could be smaller than two times the number of individuals examined. The average heterozygosity (*H*) for *r* loci was estimated by , whereas the sampling variance of *H* is computed by , where (Nei 1987). The expected value of heterozygosity [*E*(*H*)] in an equilibrium population is for the infinite-allele model (IAM) (Kimura and Crow 1964) and for the stepwise mutation model (SMM) (Ohta and Kimura 1973), where *M* = 4*Nv*, and *N* and *v* are the effective population size and mutation rate, respectively. We also examined the distribution of the average of the number (*n*_{a}) of alleles per locus within populations.

#### Genetic distance measures:

We examined the statistical properties of several genetic distance measures that are often used in actual data analysis. Most of them were originally developed for classical genetic markers such as blood group and isozyme loci that approximately follow the IAM, in which a new allele is always created by mutation. The *D*_{A} distance (Nei *et al.* 1983) is defined aswhere *x _{ij}* and

*y*are the frequencies of the

_{ij}*i*th allele at the

*j*th locus in populations

*X*and

*Y*, respectively, and

*m*is the number of alleles at the

_{j}*j*th locus. Cavalli-Sforza and Edwards (1967) proposed the chord distance (

*D*

_{C}) defined byIn the computer simulation by Takezaki and Nei (1996), the

*D*

_{A}and

*D*

_{C}distances showed the highest probabilities of obtaining the correct tree. However, since

*D*

_{A}performed slightly better than

*D*

_{C}in constructing phylogenetic trees, we used only

*D*

_{A}in this study.

Nei's (1972) standard genetic distance (*D*_{S}) is given bywhere and are the average homozygosities over the loci for populations *X* and *Y*, respectively, and . Under the IAM with mutation–drift balance, *E*[*D*_{S}], the expectation of *D*_{S}, is given by 2*vt*, where *v* is a mutation rate per locus per generation and *t* is the time measured in generations after the two populations diverged. Therefore, *D*_{S} is expected to increase linearly with time under the IAM.

The *F*_{ST}* distance proposed by Latter (1972) is as follows: (Latter 1972; Reynolds *et al.* 1983) was also proposed as a distance measure. However, our previous studies showed that the probabilities of obtaining correct tree topologies were very similar for *D*_{L} and *F*_{ST}* or slightly lower for *D*_{L} than for *F*_{ST}*. Therefore, *D*_{L} is not considered in this study. In this study we examine the following distance measure related to *F*_{ST}*, which was not included in our previous computer simulation:*F*_{ST} in the above formula is equivalent to Nei's (1973) *G*_{ST} (Slatkin 1995), and *G*_{ST} can be estimated by *F*_{ST}* in the case of two populations. Therefore, *F*_{ST}′ was computed by *F*_{ST}*/(1 − *F*_{ST}*) in this study. *F*_{ST}′ was originally proposed as a measure of isolation by distance (Slatkin 1993; Rousset 1997). It was shown to increase approximately linearly with time after the two populations separated for the case of low mutation rate (Slatkin 1995).

Some distance measures were developed specifically for STR loci in which the number of short nucleotide repeats was assumed to change following the SMM. Distance measure (δμ)^{2} was developed for this model in which the number of nucleotide repeats (*i*) changes to *i* − 1 or *i* + 1 with an equal probability (Goldstein *et al.* 1995). It is defined bywhere and are the average repeat numbers of alleles at the *j*th locus, and *x _{ij}* and

*y*are the frequencies of the allele with repeat number (allele size)

_{ij}*i*at the

*j*th locus in populations

*X*and

*Y*, respectively. Under the SMM with mutation–drift balance, the expected value of (δμ)

^{2}is 2

*vt*, where

*v*is a mutation rate per locus per generation and

*t*is the time measured in generations after the two populations diverged. Therefore, (δμ)

^{2}is expected to increase linearly with time under the SMM.

#### Measures of the efficiency of phylogenetic construction:

In this study, phylogenetic trees were constructed by the neighbor-joining method (Saitou and Nei 1987) with the genetic distance measures considered above. We examined the efficiencies of phylogenetic construction of different genetic distance measures by the nonparametric bootstrap approach. In this approach a certain number of loci (10, 20, 30, 50, 100, 300, 500, and 700) were chosen at random with replacement from the actual STR data set and neighbor-joining trees were constructed with the different distance measures mentioned above. In the examination of the effect of sample size on the efficiency of phylogenic construction, 5, 10, 15, 20, or 25 individuals were chosen at each locus, but in the other cases all the individuals were used. To examine the accuracy of a constructed tree, we need a true tree or a model tree with which the topology of the tree constructed can be compared. In the case of computer simulation, this model tree can be predetermined as in the case of Takezaki and Nei's (1996) study. In a study with empirical data there is no way to know the true tree that would reflect the historical relationship of the populations used. We have therefore decided to use a crude model tree based on the biogeographical distribution of populations and the consensus tree constructed by using all 783 STR loci with all distance measures used. Such a convergent topology is presented in Figure 1, and we used this tree topology as a model tree. We are aware of many arguments that can be raised against this approach. Interestingly, however, the same tree topology as this one was obtained by all distance measures when all STR loci were used. Therefore, this tree is not an arbitrary one but is supported by both biogeographical and genetic data. As far as the STR data are concerned, this is the tree to which all tree estimates are supposed to converge as the number of loci increases.

To measure the accuracy of a reconstructed tree, we computed the average of the topological distances (*d*_{T}) (Robinson and Foulds 1981; Nei and Kumar 2000) of constructed trees from the model tree, the probability (*P*_{C}) of obtaining the correct topology, and the probabilities (*P*_{1}) of obtaining the groupings (partitions) of the populations separated by each interior branch of the model tree with 10,000 replications. It should be noted that *d*_{T} is twice the number of partitions in a constructed tree different from those in the model tree. In this study we did not consider branch length errors, because it was difficult to determine the true or model branch lengths in the present case.

We computed the means and coefficients of variation (CVs) (standard deviation/mean) for the distance measures by the nonparametric bootstrap approach. The averages and the standard deviations of the distance values were computed for the distance values for 30 loci chosen at random in each replication by carrying out 10,000 replications.

## RESULTS

#### Extent of variation in di-, tri-, and tetranucleotide repeat loci:

Because the extent of genetic polymorphism affects the efficiency of phylogenetic construction, we first examined the levels of heterozygosity for three different groups of STR loci with respect to the size of the unit of nucleotide repeats (Table 1). Average heterozygosities (*H*) were similar for three kinds of loci across all populations (0.715 for dinucleotide, 0.696 for trinucleotide, and 0.718 for tetranucletide repeat loci).

Table 1 also shows the average number (*n*_{a}) of alleles per locus for populations in different geographical regions. The average *n*_{a} for the entire set of populations is somewhat higher for dinucleotide repeat loci (6.23) than those for trinucleotide (5.48) and tetranucleotide repeat loci (5.58). The *n*_{a}'s for trinucleotide and tetranucleotide repeat loci are similar, but the peak of the frequency distribution of *n*_{a} for tetranucleotide repeat loci is slightly shifted to the left and wider in comparison to that of trinucleotide repeat loci (Figure 2). Zhivotovsky *et al.* (2003) showed the frequency distributions of the number of alleles per locus for the entire population, not the average of the number of alleles per locus per population (*n*_{a}) that we examined in this study. We also computed the number of alleles per locus for the entire population (see supplemental Figure 1 at http://www.genetics.org/supplemental/). The shapes of the distributions of this quantity for the different repeat sizes are similar to those in Zhivotovsky *et al*. (2003), who used the subset of the loci used in this study, and to those of the averages of *n*_{a} in Figure 2.

Although the distributions of the average of *n*_{a} seem to be somewhat different among different types of STR loci of different repeat sizes, *H* values are similar for all three types of loci. Considering the large standard deviation associated with the *n*_{a} and relatively small numbers of dinucleotide (45) and trinucleotide (175) repeat loci, we decided to use all 783 loci together in the analysis of this study.

The values of *H* are the highest for the African populations and the smallest for the Oceanian and American populations (Table 1; for the heterozygosity values for all 53 populations see supplemental Tables 1–4 at http://www.genetics.org/supplemental/). The values of *H* are likely to reflect the large population size of Africans and the small population size of Oceanians and Americans. *n*_{a} is also smaller in these populations. But the *n*_{a} values of the Middle Eastern and Central/South Asian populations are higher than those of the African population. This could be due to the recent gene admixture that occurred in these regions.

#### Efficiency of constructing phylogenetic trees using different distance measures:

Table 2 shows the average topological distance () (measure of the extent of difference in topologies of constructed trees and the model tree) and the probabilities (*P*_{C}) of obtaining the topology of the tree of 12 populations constructed from the 783 loci (model tree) (Figure 1) for the cases of 10, 20, 30, 50, 100, 300, 500, and 700 loci computed by the nonparametric bootstrap approach (see materials and methods).

Among the distance measures compared, *D*_{A} generally shows the smallest and the highest *P*_{C}-values. Following *D*_{A}, *F*_{ST}* has the second smallest and the second highest *P*_{C}. and *P*_{C}-values of *D*_{S} and *F*_{ST}′ are similar to each other, but *F*_{ST}′ has slightly lower and higher *P*_{C}-values than *D*_{S}. (δμ)^{2} shows much higher and lower *P*_{C}-values than the other distance measures. This is because (δμ)^{2} has low efficiency in phylogeny construction because of its large sampling error (Zhivotovsky and Feldman 1995; Takezaki and Nei 1996; Goldstein and Pollock 1997; Zhivotovsky *et al*. 2001). With 700 loci, while the *P*_{C}-value of *D*_{A} becomes almost 100% and *P*_{C}-values of *F*_{ST}*, *D*_{S}, and *F*_{ST}′ are >70%, the *P*_{C}-value of (δμ)^{2} is only <30%.

Table 3 shows the probabilities (*P*_{1}) of obtaining the grouping of the populations (partitions) separated by each interior branch of the model tree (Figure 1). For example, at branch 1 of the model tree the populations are separated into Yoruba and Mandenka and the rest. The *P*_{1}-value at branch 1 is a frequency that a branch separating populations into Yoruba and Mandenka and the rest appeared in a constructed tree in the nonparametric bootstrap replications. The general tendency of relative efficiencies of the different distance measures at each interior branch is similar to that observed in and *P*_{C} as described above. *D*_{A} has the highest efficiency in phylogeny construction among all the distance measures examined, *F*_{ST}* has the second highest, *D*_{S} and *F*_{ST}′ have similar values, and (δμ)^{2} has much lower efficiency than the other distance measures. However, *F*_{ST}* has the highest *P*_{1}-value and the *P*_{1}-value of *D*_{A} becomes the second highest at branch 6.

*P*_{1}-values at the branches that separate the clusters of the four geographic regions require smaller numbers of loci to reach high values (*e.g.*, 90%) than those at the branches within the major geographic clusters. *P*_{1}-values for *D*_{A}, *D*_{S}, *F*_{ST}*, and *F*_{ST}′ are already ≥98% with 10 loci at branch 2 that separates Africans and non-Africans and ≥97% with 20 loci at branch 8 for the cluster of Americans. However, *P*_{1}-values of the four distance measures for the clusters of Europeans and East Asians require 50 and 100 loci to reach 90%. *P*_{1}-values of (δμ)^{2} are lower than those for the other distances, but they reach 90% with 20 loci at the branch for the African cluster and with 50 loci for the American cluster. *P*_{1}-values of (δμ)^{2} for the cluster of East Asians and for the clusters of East Asians and Americans need 300–500 loci to reach 90%. *P*_{1}-values of *D*_{A}, *D*_{S}, *F*_{ST}*, and *F*_{ST}′ at the branches within the clusters of the four geographic regions become ≥90% with 100 loci at branches 3 and 9 and with 300 loci at branch 1. *P*_{1}-values at branch 5 become 95% for *D*_{A} with 300 loci, but 70–80% for *D*_{S}, *F*_{ST}*, and *F*_{ST}′. The *P*_{1}-value of (δμ)^{2} at branch 5 is <50% even with 700 loci and needs ≥500 loci to reach 90% at the other branches within the clusters of the major geographic regions.

#### Means and sampling errors of distance measures:

The efficiency of constructing phylogenetic trees by distance measures depends on their linearity with time and sampling errors (Goldstein and Pollock 1994; Tajima and Takezaki 1994; Takezaki and Nei 1996). Therefore, we examined the means and CVs of the distance measures. Figure 3 shows the means and CVs of the distance measures of each pair of the 12 populations (Figure 1) in relation to the mean (δμ)^{2} computed for the same pair of populations in the abscissa by choosing 30 loci in the nonparametric bootstrap approach. In the data set used in this study, the assumptions for the linear increase of the expected (δμ)^{2} with time apparently do not hold. The different heterozygosity values for the populations of the different geographic origins (Table 1) indicate that population sizes have changed some times. Further, the mutational pattern of microsatellite loci does not strictly follow the SMM (Di Rienzo *et al.* 1994; Ellegren 2000; Xu *et al.* 2000; Huang *et al.* 2002; Ellegren 2004). The other distance measures increase approximately linearly in short-term evolution, but lose the linearity with time quickly under the SMM (Takezaki and Nei 1996). Therefore, the means and CVs of the distance measures are shown in relation to the mean (δμ)^{2} in the abscissa.

The slope of the mean *D*_{S} is the steepest among those of *D*_{A}, *D*_{S}, *F*_{ST}*, and *F*_{ST}′. When the mean (δμ)^{2} in the abscissa is close to zero, the mean *D*_{S} is smaller than that of *D*_{A}, but slightly larger than those of *F*_{ST}* and *F*_{ST}′ (Figure 3A). However, as the average (δμ)^{2} increases, the mean *D*_{S} becomes larger than the mean *D*_{A} and becomes the largest of the means of the four distance measures. The slopes of mean *D*_{A}, *F*_{ST}*, and *F*_{ST}′ are similar to one another, although the mean *D*_{A} is much larger than those of *F*_{ST}* and *F*_{ST}′ for all the (δμ)^{2} values. Means of *F*_{ST}* and *F*_{ST}′ are similar to each other, but the mean *F*_{ST}′ is larger than that of *F*_{ST}* particularly for the larger abscissa values.

CVs of (δμ)^{2} are much higher than those of the other distance measures (Figure 3B) as expected from theoretical and computer simulation studies (Zhivotovsky and Feldman 1995; Takezaki and Nei 1996; Zhivotovsky *et al.* 2001). By contrast, *D*_{A} has the smallest CVs for all the abscissa values. CVs of *D*_{S}, *F*_{ST}*, and *F*_{ST}′ are intermediate between those of (δμ)^{2} and *D*_{A}, although CVs of these three distance measures are extremely high for small abscissa values (<0.2). CVs of *D*_{S}, *F*_{ST}*, and *F*_{ST}′ are very similar to one another, but *D*_{S} generally has the largest CVs of the three, *F*_{ST}′ has the second largest CVs, and CVs of *F*_{ST}* are the smallest.

In general, CV values are inversely related to the *P*_{C}-values of the distance measures. *D*_{A} with the highest *P*_{C}-values has the smallest CVs, and *F*_{ST}* with the second-highest *P*_{C}-values has the second-smallest CVs. (δμ)^{2} that has the largest CVs shows the lowest *P*_{C}-values. Thus, as far as the data set used in this study is concerned, the relative values of CVs of the distance measures are a main factor that affects their performance in constructing phylogenies, and the effect of the linear increase of the distance measure with time seems minor.

#### Sample size:

To investigate the effect of sample size on the efficiency of phylogeny construction, we constructed phylogenetic trees for the 12 populations by sampling a different number of individuals (*n*) per locus. Average *n* values of the 783 loci are 20–45 for most of the populations (see supplemental Tables 1–4 at http://www.genetics.org/supplemental/). Average *n* values of the 12 populations in the model tree (Figure 1) are ∼25 except for Cambodian and Colombian populations whose average *n* values are 11.5 and 12.5, respectively.

We chose 5–25 individuals per locus for each population at random without replacement and recalculated the allele frequencies and computed *P*_{C} for a different number of loci (10–700). If the number of individuals at a locus was <5–25, we used all the available samples. Table 4 shows *P*_{C}-values for *D*_{A} distance. Note that the result of the other distance measures was essentially the same as that for *D*_{A} (data not shown). As expected for the data of high heterozygosities (0.6–0.8) (Table 1 and supplemental Tables 1–4 at http://www.genetics.org/supplemental/) (Takezaki and Nei 1996), the effect of sample size on *P*_{C}-values is quite large. By increasing *n* from 5 to 15, *P*_{C}-values increase to a great extent. In the case of 300–700 loci *P*_{C}-values are >80–90% for *n* = 15–25, but *P*_{C}-values are 40–60% for *n* = 5. It should be noted, however, that the number of loci generally has a larger effect on *P*_{C}-values than the sample size when *n* ≥ 15.

## DISCUSSION

Using microsatellite data of human populations, this study showed that *D*_{A} distance is the most efficient in obtaining a correct branching pattern, followed by *F*_{ST}*, *F*_{ST}′, and *D*_{S}, and that (δμ)^{2} has much lower efficiency than the other distance measures. This result is consistent with the previous computer simulation study, although efficiency of *F*_{ST}′ was not examined (Takezaki and Nei 1996). *D*_{A}, *D*_{S}, *F*_{ST}*, and *F*_{ST}′ were originally developed for classical genetic markers that the IAM can apply. Mean values for these distance measures approximately increase linearly with time after separation of the two populations for a small 2*vt* value (expected number of mutations) even under the SMM, but the rate of the increase slows down for large 2*vt* values. Theoretically, (δμ)^{2} is expected to be proportional to mutation rate under the SMM. However, (δμ)^{2} has a much larger sampling error than the other distance measures. As far as the data examined in this study are concerned, the extent of sampling error is a major factor that affects the efficiencies of phylogenetic construction of the distance measures.

The *P*_{C}-values for the actual STR loci obtained in this study are comparable to those in the case of *H* = 0.8 and the largest 2*vt* between populations = 0.56 in the computer simulation with the SMM (Takezaki and Nei 1996). They tend to be lower than those in the simulation for all the distance measures. But *P*_{C}-values of (δμ)^{2} are particularly lower in this data analysis than in the simulation in comparison to the other distance measures. The mutational pattern of STR loci is believed to roughly follow the SMM. However, there are irregular changes such as a large number of nucleotide repeat changes or disruption of nucleotide repeats, the constraints for the repeat number, and the asymmetric mutation rate for contraction and expansion of the allele (Di Rienzo *et al*. 1994; Goldstein and Pollock 1997; Kruglyak *et al.* 1998) at STR loci. Furthermore, the mutation rates vary among the STR loci (Ellegren 2004). Such complication of the mutational pattern in these loci may have affected the efficiency of (δμ)^{2} in phylogeny construction. Cooper *et al*. (1999) found that the standard deviation of (δμ)^{2} is much larger than the expected value (Zhivotovsky and Feldman 1995; Goldstein and Pollock 1997) by studying 213 dinucleotide STR loci of four human populations. At any rate, the present study showed that (δμ)^{2} is not efficient in reconstructing phylogenetic trees of populations even if a large number of loci are used.

We already have a DOS version of the computer program (POPTREE) for constructing phylogenetic trees for STR data with bootstrapping. It is available free of charge from http://www.bio.psu.edu/People/Faculty/Nei/Lab/software.htm. We are currently in the process of converting the program to a Windows version.

## Footnotes

Communicating editor: N. Takahata

- Received September 5, 2007.
- Accepted October 26, 2007.

- Copyright © 2008 by the Genetics Society of America