## Abstract

Populations often have very complex hierarchical structure. Therefore, it is crucial in genetic monitoring and conservation biology to have a reliable estimate of the pattern of population subdivision. *F*_{ST}'s for pairs of sampled localities or subpopulations are crucial statistics for the exploratory analysis of population structures, such as cluster analysis and multidimensional scaling. However, the estimation of *F*_{ST} is not precise enough to reliably estimate the population structure and the extent of heterogeneity. This article proposes an empirical Bayes procedure to estimate locus-specific pairwise *F*_{ST}'s. The posterior mean of the pairwise *F*_{ST} can be interpreted as a shrinkage estimator, which reduces the variance of conventional estimators largely at the expense of a small bias. The global *F*_{ST} of a population generally varies among loci in the genome. Our maximum-likelihood estimates of global *F*_{ST}'s can be used as sufficient statistics to estimate the distribution of *F*_{ST} in the genome. We demonstrate the efficacy and robustness of our model by simulation and by an analysis of the microsatellite allele frequencies of the Pacific herring. The heterogeneity of the global *F*_{ST} in the genome is discussed on the basis of the estimated distribution of the global *F*_{ST} for the herring and examples of human single nucleotide polymorphisms (SNPs).

INFERRING genetic population structure has been a major theme in population biology, ecology, and human genetics. The fixation index *F*_{ST}, introduced by Wright (1951), is a key parameter for such studies and is most commonly used to measure genetic divergence among subpopulations (Palsbøll *et al*. 2007). It is defined as the correlation between random gametes drawn from the same subpopulation relative to the total population. Another measure used frequently is Cockerham's (1969, 1973) coancestry coefficient, which is the probability that two random genes from different individuals are identical by descent, and the average overall pairs of individuals within the same subpopulation equal Wright's *F*_{ST} (Excoffier 2003). We use the notation θ_{WC} for the average coancestry coefficient and θ_{WC} = *F*_{ST} as shown by Weir and Cockerham (1984). Nei's (1973) *G*_{ST} is analogous to *F*_{ST} and identical to *F*_{ST} for diploid random-mating populations (Excoffier 2003).

Nei and Chesser (1983) proposed an estimator for *F*_{ST} and *G*_{ST}. The estimation of these parameters accounts only for the sampling error within subpopulations and therefore assumes that all subpopulations have been sampled (Cockerham and Weir 1986; Excoffier 2003). Weir and Cockerham (1984) developed the moment estimator for the coancestry coefficient θ_{WC}, which takes the sampling error for the subpopulations into account. Several moment estimators with different weighting schemes have also been derived (Robertson and Hill 1984; Weir and Cockerham 1984). An alternative estimation has been discussed using the method of ordinary least squares (Reynolds *et al*. 1983). Weir and Hill (2002) extended θ_{WC} to a population-specific parameter to allow different levels of coancestry for different populations. They also derived an estimator for θ_{WC} with confidence intervals using a normal theory approach.

Despite the development of methods for assigning individuals to populations (Paetkau *et al*. 1995; Pritchard *et al*. 2000; Huelsenbeck and Andolfatto 2007), the differentiation estimators remain the most commonly used tools for describing population structure (Balloux and Lugon-Moulin 2002). Weir and Cockerham (1984) showed that their estimator provides the smallest bias among the moment estimators. Goudet *et al*. (1996) confirmed this using simulations and showed that generates the least-biased estimate of *F*_{ST} but has the largest variance when *F*_{ST} is small. Raufaste and Bonhomme (2000) showed that is nearly unbiased, with minimal variance for large *F*_{ST}, and that the estimator of Robertson and Hill (1984) is negatively biased, with minimal variance for small *F*_{ST}. They proposed a correction for the bias of but this cannot be corrected properly in the range of [0.05, 0.1]. Therefore, a precise estimate of *F*_{ST} is crucial, especially for small and moderate levels of genetic differentiation.

In addition to the estimation of *F*_{ST} over all subpopulations in a metapopulation (hereafter, we call this global *F*_{ST}), *F*_{ST}'s for pairs of sampled localities or subpopulations (pairwise *F*_{ST}) are usually estimated in conservation biology and ecology. In fact, the computer programs Arlequin (Excoffier *et al*. 2005), FSTAT (Goudet 1995), and Genepop (Raymond and Rousset 1995) estimate these parameters and are used widely in ecological studies. These three software packages produce the same or similar values for pairwise *F*_{ST} estimates and provide the basic statistics for exploratory analyses of population structure, such as cluster analysis and multidimensional scaling. They are also used as a criterion for population differentiation (Waples and Gaggiotti 2006; Palsbøll *et al*. 2007). However, the estimation of *F*_{ST}'s is not precise enough to reliably estimate the population structure and the extent of heterogeneity, especially for large gene flow species.

Small numbers of individuals taken from each locality should also affect the precision of Populations often have very complex hierarchical structures, and geographical samples are usually taken from many localities to include a wide area. Therefore, the numbers of individuals from each locality are frequently limited by the large number of sampling points. Small sample sizes can result in biased estimates of the allele frequencies of each subpopulation. This bias may be larger for cases with larger numbers of alleles, such as microsatellite DNA. Uncertainty in the estimates of allele frequencies should affect the estimation of *F*_{ST}'s. The Bayesian approach provides better estimates of allele frequencies by taking uncertainty into account (Lange 1995; Lockwood *et al*. 2001). Posterior distributions of global *F*_{ST} were simulated from posterior distributions of allele frequencies, assuming common hyperparameters across all loci (Holsinger 1999; Holsinger *et al*. 2002; Corander *et al*. 2003). However, accurate estimation of pairwise *F*_{ST}, the essential parameter in ecological studies, has not been fully investigated.

In this article, we propose an empirical Bayes procedure to estimate locus-specific pairwise *F*_{ST}'s, taking into account the uncertainty of the allele frequencies of subpopulations. The estimation procedure has two stages. First, the hyperparameters of Dirichlet prior distributions for allele frequencies at each locus are estimated from observed allele counts by a maximum-likelihood method. The global *F*_{ST} is then estimated at each locus. Second, on the basis of the estimates of the hyperparameters, and given the allele counts, posterior distributions of the allele frequencies are generated for each locus, from which the posterior distributions of locus-specific pairwise *F*_{ST}'s are simulated. The posterior mean of our empirical Bayes pairwise *F*_{ST} estimates can be interpreted as a shrinkage estimator (Stein 1956; Maritz and Lewin 1989) anchored to the average of the true values among pairs. It performs better than conventional differentiation estimators and robustly estimates the population structure, even for non-Dirichlet cases, as stepping-stone models. The posterior distribution of pairwise *F*_{ST}'s can be used to calculate a criterion of population differentiation. Our maximum-likelihood estimates of the global *F*_{ST}'s can also be used as sufficient statistics to estimate the distribution of *F*_{ST} among loci in the genome. Our model assumes random mating or random sampling of alleles at each locality and that linkage equilibrium holds between loci. It also assumes that allele counts at each locus, given the true allele frequencies, are independent among populations. Our method can be applied to frequency data for common genetic markers, including isozymes, mitochondrial DNA, microsatellites, and single nucleotide polymorphisms (SNPs). We show the efficacy of our model by simulation and by an analysis of microsatellite allele frequencies of the Pacific herring. The heterogeneity of *F*_{ST} in the genome is discussed on the basis of the estimated distribution of global *F*_{ST} for the herring and examples of human SNPs.

## MODELS AND METHODS

#### The model:

Consider a simple random sampling from multiple localities in a metapopulation. Suppose that *K* random-mating demes or subpopulations are drawn from the metapopulation. Let be a vector of the true allele frequencies at locus *l* in subpopulation *k*, where *J _{l}* is the number of different alleles at the locus, and We assume a Dirichlet distribution as the prior distribution of

*p*. The probability density function iswhere are the hyperparameters and is a scale parameter that is specific for the locus. This model describes well a metapopulation that has a continuous structure and consists of an infinite number of subpopulations or demes (Pannell and Charlesworth 2000; Rousset 2003; Hanski and Gaggiotti 2004). Let be the mean allele frequency for the metapopulation at the locus satisfying Hence, we have the relation β

_{kl}_{lj}= α

_{lj}/θ

_{l}.

Under this model, the global *F*_{ST} (hereafter denoted as ) at each locus is expressed simply by the scale parameter, as(1)as given by Wright (1969), Rannala and Hartigan (1996), Balding and Nichols (1997), Lockwood *et al*. (2001), Balding (2003), and Kitada and Kishino (2004). In this model, the variance of the *j*th allele frequency for the locus, *p _{klj}*, is expressed byas given by Weir (1996), Holsinger

*et al*. (2002), Balding (2003), and Kitakado

*et al*. (2006). The Dirichlet distribution assumes an evolutionary equilibrium and an equal mutation rate for all alleles (Weir and Hill 2002; Ewens 2004). Under this assumption, the scale parameter θ

_{l}refers to the rate of gene flow, as given by Rannala and Hartigan (1996). We use the symbol θ for the scale parameter, following Rannala and Hartigan (1996) and Balding and Nichols (1997). Weir and Cockerham (1984) also used the same symbol θ for the coancestry coefficient (=

*F*

_{ST}), so we use θ

_{WC}for their θ. Our is equivalent to θ

_{WC}(Weir 1996, pp. 47–48) and Holsinger's θ

^{B}(Holsinger 1999; Holsinger

*et al*. 2002).

#### Maximum-likelihood estimation of hyperparameters and global *F*_{ST}:

The maximum-likelihood estimation of the hyperparameters has been discussed by Lange (1995), Kitada *et al*. (2000), and Balding (2003). A pseudo-likelihood approach was also taken by Rannala and Hartigan (1996). In the maximum-likelihood framework, a method for the simultaneous estimation of and the linkage disequilibrium coefficient between two SNPs has been proposed (Kitada and Kishino 2004). Kitakado *et al*. (2006) proposed an integrated-likelihood approach to reduce the negative bias of particularly for cases with few sampling points.

Suppose that alleles of diploid organisms (*N _{k}*/2 individuals) are counted at locus

*l*and denotes a vector of observed allele counts at the locus in subpopulation

*k*. We assume that all individuals are successfully genotyped at all loci, so The marginal likelihood of the observed allele counts at a locus

*n*has a Dirichlet-multinomial distribution (Lange 1995; Rannala and Hartigan 1996; Weir 1996; Balding and Nichols 1997; Kitada

_{kl}*et al*. 2000; Balding 2003; Rousset 2003). The parameters to be estimated are Because we assume the independence of subpopulations, the overall likelihood for these parameters is given by the product of the likelihood functions for

*K*samples, as(2)The hyperparameters α

_{l}are estimated by maximizing this marginal likelihood (Lange 1995; Kitada

*et al*. 2000). Our method can be used for both allele and haplotype counts without modification, but some notations differ slightly. For haploid organisms,

*N*refers to the individuals genotyped; and

_{k}*n*should be

_{kl}*n*and α

_{k}_{lj}should be α

_{j}. Henceforth, for simplicity, we focus on diploid organisms.

The locus-specific can be estimated by substituting for θ_{l} in Equation 1. The variance estimator for is calculated from the Fisher information matrix for α_{l}, as The asymptotic variance for is estimated using the Delta method (Seber 1982) as(3)In our metapopulation model or infinite-island model, the sampled localities are regarded as a sample from all possible demes or subpopulations, including those not sampled. Hence, Equation 2 estimates the locus-specific genetic differentiation under the random-effect model of population sampling (Weir 1996). The average estimate of for all loci is calculated as an arithmetic mean across the loci.

#### Empirical Bayes estimation of pairwise *F*_{ST}:

The posterior distribution of allele frequencies *p _{kl}* at locus

*l*in subpopulation

*k*is again a Dirichlet distribution, with parameters modified by the sample allele counts(4)(Lange 1995; Weir 1996). Given the estimates of the hyperparameters and the sampled allele counts, random numbers of

*p*can be generated through this posterior distribution. The posterior distributions for any parametric functions of

_{kl}*p*can then be simulated by the empirical Bayes procedure (Kitada

_{kl}*et al*. 2000).

When population differentiation between or among specific subpopulations is of interest, the selected populations can be regarded as the entire set of populations. Hence, applying the fixed-effect model of population sampling (Weir 1996) is appropriate. Therefore, we use Nei's *G*_{ST} formula (Nei and Chesser 1983), which defines quantities with respect to fixed extant populations (Cockerham and Weir 1986), to estimate the posterior distributions of pairwise *F*_{ST}'s (hereafter denoted as ), as did Holsinger (1999) and Corander *et al*. (2003) in estimating global *F*_{ST}. Nei's gene diversity analysis compares expected heterozygosities under Hardy–Weinberg equilibrium (HWE), and the *G*_{ST} estimator is expressed as a function of allele frequencies. Therefore the posterior distribution of at each locus can easily be generated on the basis of the *G*_{ST} estimator, without using genotype frequencies. We set the number of each simulation to 10,000, so 10,000 's are calculated at each locus from the 10,000 sets of allele frequencies *p _{kl}* between a set of two populations. From the posterior distribution of the posterior mean and 95% credible interval are calculated. We use the posterior mean as the empirical Bayes estimator of locus-specific We can also calculate the probability that is smaller than an arbitrary value [

*e.g*.,

*P*()], which can be used as the criterion for population differentiation (Waples and Gaggiotti 2006; Palsbøll

*et al*. 2007). The average estimate of for overall loci is calculated as an arithmetic mean across the loci.

Rosenberg *et al*. (2003) proposed a general measure for determining the amount of information on individual ancestry on the basis of the Kullback–Leibler information. The informativeness for assignment *I _{n}* is defined as(5)where The authors showed that

*I*and

_{n}*F*

_{ST}are very closely correlated but that

*I*is more informative than the standard SNP-specific pairwise

_{n}*F*

_{ST}. In an additional analysis, we examine how our empirical Bayes method works to measure this under the same simulation protocol.

#### Inferring heterogeneity of global *F*_{ST} among loci:

We estimate locus-specific 's on the basis of estimated at each locus. Evolutionary forces may differ among sites in the genome. Therefore, it is important to investigate the heterogeneity of *F*_{ST} among loci. One practical analysis is to test the null hypothesis H_{0}, the homogeneity of among *L* loci, ) against the alternative hypothesis H_{1}, the heterogeneity of among loci, on the basis of estimates of When a large number of subpopulations are sampled, the maximum-likelihood estimate follows a normal distribution of The maximum likelihood under H_{0} is then given as where and The maximum likelihood under H_{1} is maximizing by The negative twice-log-likelihood ratio is then which follows the χ^{2}-distribution with (*L* − 1) d.f. under the null hypothesis. We can test the heterogeneity of on the basis of the test statistics.

The other approach to investigate the heterogeneity of is to estimate the distribution of in the genome. In recent years, the number of loci analyzed has been increasing in ecological studies, but is still smaller than those used for human SNPs. For such cases, it would be difficult to directly estimate the specific distribution of from the data. Here, we estimate the distribution of in the genome from estimates of randomly selected loci, When the distribution is expressed by the parametric model the unknown parameter which defines the distribution of is estimated by maximizing the log marginal likelihoodwhere is the distribution of Here, we assume the independence of loci Because the maximum-likelihood estimator is a sufficient statistic, it is possible to estimate the distribution of in the genome on the basis of the estimates for randomly sampled loci, instead of using a direct estimation from the data.

For the preliminary discussion here, we assume that is normally distributed in the genome. When the distribution of is expected to be different from 0, a simple approximation may be a normal distribution. We then assume follows *N*(μ, σ^{2}) as a first step in estimating the distribution of in the genome under the limited number of loci analyzed. In this case, the parameter vector ρ refers to μ and σ^{2}. The general form of the log marginal likelihood given above becomes(6)Here, is the variance of the estimates, We estimate μ and σ^{2} numerically, regarding as The distribution of 's can also be calculated with slight modifications to the above procedure.

## RESULTS

#### Improved precision of our empirical Bayes estimator for pairwise *F*_{ST}:

We investigated the performance of our method of estimating pairwise using numerical simulations. Random vectors of allele frequencies at locus *l* in subpopulation *k*, *p _{kl}*'s, were generated independently from the Dirichlet distribution with the parameter α

_{l}(= θ

_{l}β

_{l}). Here, the number of sampling localities (

*K*) was set at 5, 10, and 50, and the mean allele frequencies at a locus were assumed with

*J*= 50. As the true values of the global

_{l}*F*

_{ST,l}'s (= 1/(1 + θ

_{l})), we chose four different levels: = 0.01, 0.05, 0.1, and 0.2. The sample size (

*N*/2) was deemed to be common to all the localities and was set at 20, 30, and 50 individuals. Then, allele counts

_{k}*n*'s were drawn independently from the multinomial distribution Multi(

_{kl}*N*,

_{k}*p*) for

_{kl}*K*localities. The pairwise values between the first and second localities were evaluated by the conventional Nei's

*G*

_{ST}estimator and the empirical Bayes method. In the latter procedure, 500 's were simulated on the basis of Nei's

*G*

_{ST}formula to save computation time, and the posterior mean was calculated as the estimator of the pairwise

*F*

_{ST}. The point estimate for the conventional

*G*

_{ST}estimator and the posterior mean of the empirical Bayes estimates were compared with the true These procedures were iterated

*R*= 1000 times.

Figure 1 and supplemental Figure S1 at http://www.genetics.org/supplemental/ show the general features of the empirical Bayes estimator compared with those of the conventional *G*_{ST} estimator. The former examines the case of a relatively large number of sampling points and a limited sample size from each sampling point (*K* = 50, *N _{k}*/2 = 20). The latter investigates the case of moderately large samples from a small number of sampling points (

*K*= 10,

*N*/2 = 50), which is common in ecological studies. It is clearly apparent that the conventional

_{k}*G*

_{ST}estimator greatly overestimates the true values and has a large variance, especially for small In contrast, the empirical Bayes estimates of 's shrink toward the average of the true and reduce the positive bias of the conventional estimator. This is reasonable because our empirical Bayes estimator can be interpreted as a shrinkage estimator (Stein 1956; Maritz and Lewin 1989).

The effects of the number of sampling points *K* and sampled individuals *N _{k}*/2 are also shown in the supplemental material at http://www.genetics.org/supplemental/. The numbers of sampling points did not affect the positive bias of the conventional

*G*

_{ST}estimator because

*K*was fixed at 2. The empirical Bayes procedure provided larger variation for smaller numbers of subpopulations, especially for small values, and vice versa (supplemental Figure S2 at http://www.genetics.org/supplemental/). Conversely, larger numbers of sampled individuals reduced the positive bias of the conventional

*G*

_{ST}estimator and the variation of the empirical Bayes estimator (supplemental Figure S3 at http://www.genetics.org/supplemental/).

For a quantitative comparison of the two estimators, we used the following two measures: (relative mean bias) and (root relative mean squared error), where and are an estimate and the true value, respectively, for pairwise in the *r* th iteration. As shown in Table 1, smaller *K* and sample sizes (*N _{k}*/2) resulted in larger positive biases for the conventional estimator for various levels of true values. The bias and variation became smaller as became larger. In contrast, the empirical Bayes estimator provided smaller biases and variations for all cases of although smaller

*K*and sample sizes resulted in a slight negative bias.

The negative bias of the empirical Bayes estimator of pairwise *F*_{ST} is large, especially when gene flow is large and the estimation is based on a sample from a few sampling points. Underestimation of population differences should lead to optimistic management strategies. Therefore, it is recommended in the conservation genetics of birds and fish that samples be collected from as many sampling points as possible. It is noted that the bias of the empirical Bayes estimator is reduced by increasing the number of sampling points, whereas the bias of the conventional *G*_{ST} estimator is not.

We estimated Rosenberg *et al*.'s informativeness of assignment *I _{n}* between the first and second localities using Equation 5 (

*K*= 2) and the empirical Bayes method based on the

*I*estimator for the case of The mean allele frequencies were assumed to be with

_{n}*J*= 50. The number of sampling localities (

*K*) was set at 50 and the sample size (

*N*/2 individuals) was set at 20 individuals for each sampling point. Figure 2 shows that the conventional estimator for

_{k}*I*was positively biased, whereas the empirical Bayes estimator of

_{n}*I*performed much better, consistent with the fact that

_{n}*I*produces upwardly biased estimates in small samples (Rosenberg

_{n}*et al*. 2003).

#### Robustness of our shrinkage estimator for pairwise *F*_{ST}:

The robustness of our estimator was explored using numerical simulations for non-Dirichlet distributions of the allele frequencies. We considered cases in which genetic differentiation becomes larger with geographic distance. Such a stepping-stone model is biologically realistic and may be common (Palsbøll *et al*. 2007). We set the number of sampled subpopulations (*K*) to 15 and the between two adjacent populations to 0.001 (case 1) or 0.0005 (case 2). We considered biallelic cases and set the allele frequencies to (0.5, 0.5) at a locus for the middle population. We then calculated the allele frequencies *p _{kl}*'s at the locus for another 14 subpopulations by numerical optimization. The sample size (

*N*/2) was deemed to be common to all localities and was set at 20 individuals. Then, allele counts

_{k}*n*'s were drawn independently from the multinomial distribution Multi(

_{k}*N*,

_{k}*p*) for 15 localities. A total of 105 pairwise

_{kl}*F*

_{ST}values between all sets of the two localities were evaluated by the conventional

*G*

_{ST}and the empirical Bayes estimator following the simulation protocol described above.

As shown in Figure 3 (case 1) and supplemental Figure S4 at http://www.genetics.org/supplemental/ (case 2, top left), our empirical Bayes procedure provided better estimates than those of the conventional *G*_{ST} estimator for smaller 's (∼ <0.06 for case 1 and 0.04 for case 2). In conservation, management units should be defined among subpopulations with little genetic differentiation. Palsbøll *et al*. (2007) concluded that could be used as the criterion for deciding the separate management units of sockeye salmon spawning sites. For such cases with very small genetic differentiation, our method performs more efficiently, even for stepping-stone models. On the contrary, the conventional *G*_{ST} estimator displayed a positive bias for the whole range of 's in both cases. Our method reduced the positive bias for small 's and the bias became negative for larger 's. Reflecting the characteristics of the shrinkage estimator, the relative mean bias of the empirical Bayes estimator, which was the average of the 105 pairwise *F*_{ST} estimates, was slightly (1.06 times) larger than that of the conventional estimator for case 1 (Table 2). The precision of our estimates was much better for the whole range of [Figure 3, top right (case 1) and supplemental Figure S4 (case 2)].

We also investigated the robustness of estimating population structure on the basis of estimated 's. The results of the multidimensional scaling (MDS) (Torgerson 1952; Young and Hamer 1987) analysis of two data sets in the simulation showed that our method describes the true population structure well. The conventional *G*_{ST} estimator also worked, despite the larger positive bias and variance (Figure 3, bottom).

#### The case of Pacific herring:

We analyzed six geographical samples of the Pacific herring *Clupea pallasii* from spawning areas in Lake Akkeshi (AK), Yudonuma Lake (YD), and Funka Bay (FK), which are located off the east coast of Hokkaido in Japan, and Obuchinuma Lake (OB), Miyako Bay (MY), and Matsushima Bay (MT), located off the northern Pacific coast of Honshu. Hatchery fish, released and recaptured in Lake Akkeshi (AKH) and Miyako Bay (MYH), were also distinguished from wild fish on the basis of otoliths stained with alizarin complexon. A total of 2055 mature individuals were genotyped at five microsatellite loci. Allele frequencies are given in supplemental Tables S1–S5 at http://www.genetics.org/supplemental/. HWE was satisfied in each sample except at four localities for three loci, and the assumption of our metapopulation model was considered to be satisfied (supplemental Figure S5). On the basis of the estimates of the hyperparameters (supplemental Figure S6), the scale parameter θ_{l} and were estimated over all subpopulations (Table 3).

We estimated the posterior distributions of for all sets of subpopulations at all loci. As an example, the posterior distributions between FK and MY at the five loci are shown in Figure 4. The posterior means of varied from 0.0064 to 0.0245, with the 95% credible intervals in parentheses (Table 4). The *P*-values in Figures 4 and 5 are for the homogeneity contingency test performed with Genepop (Raymond and Rousset 1995). At all loci, the allelic differences between FK and MY were highly significant (*P* < 0.0000). We used of 0.01 as a population criterion and defined the probability of ≤ 0.01 as *P**. Here, = 0.01 refers to which means that the effective number of migrants is 25 individuals per generation (Waples and Gaggiotti 2006), where *N*_{e} is the effective population size and *m* is the migration rate. The *P**-value was near 1.0 for *Cha* 63, indicating that the genetic differentiation at the locus was small. In contrast, *P**-values were 0 or near 0 for *Cha* 17, *Cha* 113, and *Cha* 123 and 0.5318 for *Cha* 20 (Figure 4). The posterior distribution of the average over all the loci was calculated as the mean of the posterior distributions at five loci (Figure 4, bottom right). Both the *P-* and *P**-values coincided at 0, showing significant genetic differentiation between Funka Bay and Miyako Bay.

The posterior distributions of average over all loci for all pairs of wild subpopulations were also simulated as simple means of the posterior distributions at five loci (Figure 5). The allelic differences were highly significant with *P* = 0.0000, except between MY and MT. The criterion *P** resulted in a very different evaluation of population differentiation, even for the same *P*-value of 0.0000. This result clearly shows the difficulty in the hypothesis-testing framework in evaluating the genetic differentiation between subpopulations (*e.g*., Dizon *et al*. 1995; Ryman and Jorde 2001; Ryman *et al*. 2006). The *P**-based criterion works well in this case and is recommended for use in hypothesis testing.

The variation in over five loci was not trivial, with a coefficient of variation (CV) of 23% (Table 3). However, the negative twice log-likelihood ratio λ was calculated to be 6.9264 and the hypothesis of constant for all loci was not rejected (*P* = 0.8602, d.f. = 4). We estimated the distribution of in the genome, assuming normality based on Equation 6. The maximum-likelihood estimates (MLEs) for μ and σ were obtained with 90, 95, and 99% confidence intervals, which define the distribution of in the genome via the likelihood profile (Figure 6A). The MLE for μ, with the 95% confidence interval, was 0.0160 (0.0128, 0.0206) [Figure 6A (a, e)] and for σ was 0 (0, 0.00716) [Figure 6A (0, c)]. The weighted mean coincided with the MLE of μ (Table 3) and the distribution of the weighted mean shown as the blue line in Figure 6B, described well the 95% confidence interval of μ [Figure 6A (a, e)]. Here, and were estimated with Equation 3. The log-likelihood profile for σ was monotonic and decreasing, indicating that the point estimate of σ was 0 (supplemental Figure S7 at http://www.genetics.org/supplemental/). Hence, the point estimate of the distribution of for the Pacific herring is constant, supporting the hypothesis of constant throughout the genome. However, the distribution of has uncertainty, which accounts for the confidence regions of μ and σ (Figure 6B).

## DISCUSSION

Our empirical Bayes estimator of performed better than the conventional *G*_{ST} estimator for various levels of and various sampling conditions under the infinite-island model. Even for non-Dirichlet distributions of allele frequencies, such as stepping-stone models, our method provided better estimates of than did the conventional *G*_{ST}, especially for cases with large gene flow.

#### Integrated likelihood method:

The empirical Bayes estimator of 's is negatively biased, especially when the population has large gene flow and the estimation is based on a sample from only a few sampling points (Table 1, supplemental Figure S2 at http://www.genetics.org/supplemental/). With a small number of sampling points, the MLE of θ is not precise. Therefore, it is recommended in the conservation genetic analysis of such a population that the samples be collected from as many sampling points as possible. When sampling from many localities is not feasible, the integrated-likelihood method (Kitakado *et al*. 2006) can reduce the negative bias of We estimated pairwise 's by the empirical Bayes method based on integrated-likelihood estimates (ILEs) of θ for cases with = 0.01, *K* = 5, and *N _{k}*/2 = 20, 30, or 50 with the same simulation protocol used for Table 1 (supplemental Figure S8 at http://www.genetics.org/supplemental/). The relative mean biases of the estimates, with root relative mean squared errors in parentheses, were −0.238 (0.506), −0.132 (0.365), and −0.060 (0.272), respectively. These values were much smaller than those estimated on the basis of the MLE of θ, given in Table 1 (top three rows). The negative bias of 's based on the MLE was reduced to 32.8, 24.4, and 18.2% for

*N*/2 = 20, 30, and 50, respectively. The integrated-likelihood method uses a uniform prior for the mean allele frequency β

_{k}_{l}and eliminates β

_{l}by integration regarding it as a nuisance parameter. By using the ILE of θ instead of the MLE, the empirical Bayes method proposed here provides nearly unbiased estimates of 's when the sample sizes (

*N*/2 individuals) are large and works more efficiently, even for cases with a small number of sampling points (supplemental Figure S8).

_{k}#### Weir and Cockerham's θ̂_{WC}:

When we estimate genetic differentiation between two specific subpopulations, selected subpopulations can be regarded as the entire set of populations. Nei's *G*_{ST} formula defines quantities with respect to fixed extant populations (Cockerham and Weir 1986). In addition, *G*_{ST} is a function of allele frequencies under HWE, and the posterior distribution can easily be simulated from only allele frequencies. Therefore, we used the *G*_{ST} estimator to estimate the posterior distribution of pairwise *F*_{ST}.

The citation record suggests that the most widely used estimator for Wright's *F*_{ST} is Weir and Cockerham's (Weir and Hill 2002). This moment estimator takes the sampling error for subpopulations into account and essentially estimates the global *F*_{ST}, The estimator is also widely used to estimate pairwise *F*_{ST}'s among fixed pairs of populations. With the assumption of no local inbreeding, θ_{WC} is estimated only from sample allele frequencies, but these need to be inferred from sample genotype frequencies (Weir and Hill 2002). With *J* alleles at a locus, the number of possible genotypes is *J*(*J* + 1)/2. In microsatellite DNA analyses, *J* is generally large. If *J* = 50, the number of genotypes is 1225. Such a situation makes our simulation more complicated and the uncertainty of the genotype counts becomes large under small or moderate sample sizes.

Here, we investigated the properties of on estimating pairwise *F*_{ST} on the basis of the relationship between *G*_{ST} and θ_{WC}. Weir and Cockerham's estimates can be approximated as a function of *G*_{ST} by Equation 2 in Weicker *et al*. (2001): Using this equation, we calculated the conventional estimates of pairwise Weir and Cockerham's θ_{WC} from *G*_{ST} estimates with the simulation protocol described in results. We examined cases with a global *F*_{ST} of = 0.01, 0.05, 0.1, and 0.2. The mean allele frequencies were assumed with *J* = 50. The number of sampling localities *K* was set at 10 and the sample size *N _{k}*/2 (individuals) was deemed to be common to all the localities and was set at 50 individuals. As shown in supplemental Figure S9 at http://www.genetics.org/supplemental/, the two estimators have linear relationships. Hence, our simulation results on the conventional

*G*

_{ST}estimator can be extended straightforwardly to pairwise θ

_{WC}. In fact, the pairwise -values calculated for the Pacific herring with Genepop (Raymond and Rousset 1995) were 1.93 ± 0.47 times larger than the posterior means of (supplemental Figure S10 at http://www.genetics.org/supplemental/), which coincides with our simulations for small 's (Figure 1, supplemental Figures S1 and S11).

Weir and Cockerham's estimator is nearly unbiased (Raufaste and Bonhomme 2000), although it has a negative bias for the two-allele case (Weir and Hill 2002). Nevertheless, when estimating the pairwise *F*_{ST}, is considered to have a large positive bias, especially for species with large gene flows. Nei's *G*_{ST} and Rosenberg *et al*.'s informativeness of assignment *I _{n}* also showed the same phenomenon, suggesting the positive bias is irrelevant to the estimators. This positive bias of the conventional estimators was larger for smaller genetic differentiation. This might be caused by the large variation in the sample allele frequencies, which is larger for smaller sample sizes (individuals) and largely exceeded the real variation between subpopulations. The shrinkage estimator stabilizes such variation and provides better estimates.

#### Weir and Hill's normal theory:

Weir and Hill's (2002) normal theory approach has the same variance as a Dirichlet distribution when *i* ≠ *i*′ in their notation. Hence, their estimator is equivalent to our is the variance of the allele frequencies among subpopulations relative to the total population. Hence, refers to a sample variance of allele frequencies, and therefore follows a χ^{2}-distribution when the number of sampling points *K* is sufficiently large, as shown by Weir and Hill (2002). The shape of the posterior distribution of at each locus was unimodal and slightly right tailed, which reminded us of χ^{2}-distributions (Figure 4). We estimated Weir and Hill's confidence intervals by substituting the posterior mean of with as where d.f. = (*K* − 1)(*J* − 1) is the degrees of freedom. The confidence intervals of obtained with the χ^{2}-approximation coincided well with the credible intervals calculated from the posterior distributions, although our credible intervals were narrower than the confidence intervals of the χ^{2}-approximation, which were slightly right tailed (Table 4). The slight difference in the intervals might have been the effect of the small *K*(= 8) on the χ^{2}-approximation, although it was not substantial. On the contrary, the confidence interval for the sample mean over all loci was narrower than our credible interval (Table 4). The distribution of a sample mean is normal when the sample size is large with the variance reduced by the central limit theorem. A χ^{2}-distribution approaches a normal distribution as the degrees of freedom become larger. For our case of 1309 d.f. [ as given in Weir and Hill 2002], the two distributions are equal. The property of the sample mean should cause the narrower confidence interval of the normal theory approach. The result shows that the posterior distribution of describes the distribution of well, both for each locus and for the average over all loci.

#### LD among loci:

The case study of the Pacific herring, based on a few microsatellite markers, did not detect significant variation in among loci in the genome (Figure 6). The assumption of normality for the MLE of is valid when more than a few sampling points are surveyed. This assumption might be violated when the data are collected from only a few sampling points or is close to 0 or 1.0. We also assumed the independence of loci However, it is necessary to take into account the linkage disequilibrium (LD) among loci, when the molecular markers are tightly linked.

Recent progress in whole-genome analysis of human populations provides a new perspective on the inference of *F*_{ST} and its distribution in genomes (Garte 2003; Hinds *et al*. 2005; Weir *et al*. 2005; Walsh *et al*. 2006). In their Figure 1, Weir *et al*. (2005) showed that values for the single-locus marker over the whole human genome for three (Perlegen) or four (HapMap) populations had a distribution very much like the χ^{2}-distributions with 2 or 3 d.f. and suggested that values of are genome-region specific. However, follows a χ^{2}-distribution under constant as reported by Weir and Hill (2002) and as demonstrated in our analysis of the Pacific herring. Therefore, the distributions of the values for the single-locus marker in Weir *et al*. (2005) do not necessarily support the genome-region-specific *F*_{ST} hypothesis in the human genome.

Weir *et al*.'s 5-Mb window average values for were close to normal because of the property of the sample mean. The standard deviations (SDs) of the 5-Mb window average values of decreased substantially from single-locus estimates of 0.12 to 0.02 for HapMap and from 0.11 to 0.02 for Perlegen (Tables 2 and 3 in Weir *et al*. 2005). The average number of markers in a 5-Mb window was 1114 for HapMap and 1834 for Perlegen (calculated from Table 1 in Weir *et al*. 2005). Hence, these SDs are expected to be and if all SNPs are independent and distributed identically in the genome. The effective sample size (Kish 1965, p. 259) could be much smaller than the real number of SNPs in the 5-Mb windows if the 's are correlated, because of LD between the SNPs. A coalescent simulation of human population history implies that linkage equilibrium holds for SNPs separated by >10–100 kb (Figure 1 in Kruglyak 1999). Therefore, we can estimate the effective number of human SNPs per 100-kb window to be ∼1–20. If we assumed it to be ∼10, the effective number of SNPs per 5-Mb window becomes 500. Therefore, the SDs of the 5-Mb window are expected to be for HapMap and for Perlegen. The actual value (0.02) is much larger than these, even when the LD among the SNPs is taken into account. This discrepancy can be explained by the large-scale heterogeneity of *F*_{ST} between the 5-Mb windows. New data show that LD is highly structured into discrete blocks of sequences separated by hot spots of recombination (Goldstein 2001; McVean *et al*. 2004) and differs among species (Hernandez *et al*. 2007). The simultaneous estimation of *F*_{ST}'s of SNPs and the LD between the SNPs should give us an accurate picture of the distribution of *F*_{ST} in genomes.

## Acknowledgments

We are grateful to Bruce Weir for valuable comments on the manuscript and encouragement with this work. We also thank the anonymous reviewers for their constructive comments, which improved the earlier version of the manuscript. This work was supported by grants from the Japan Society for the Promotion of Science.

## Footnotes

Communicating editor: R. W. Doerge

- Received June 11, 2007.
- Accepted July 17, 2007.

- Copyright © 2007 by the Genetics Society of America