# An Integrated-Likelihood Method for Estimating Genetic Differentiation Between Populations

- Toshihide Kitakado
^{*},^{1}, - Shuichi Kitada
^{*}, - Hirohisa Kishino
^{†}and - Hans Julius Skaug
^{‡}

^{*}Faculty of Marine Science, Tokyo University of Marine Science and Technology, Minato, Tokyo 108-8477, Japan,^{†}Graduate School of Agriculture and Life Sciences, University of Tokyo, Bunkyo, Tokyo 113-8657, Japan and^{‡}Department of Mathematics, University of Bergen, 5008 Bergen, Norway

- 1
*Corresponding author:*Tokyo University of Marine Science and Technology, 5-7, Konan 4, Minato-ku, Tokyo, 108-8477, Japan. E-mail: kitakado{at}s.kaiyodai.ac.jp

## Abstract

The aim of this article is to develop an integrated-likelihood (IL) approach to estimate the genetic differentiation between populations. The conventional maximum-likelihood (ML) and pseudolikelihood (PL) methods that use sample counts of alleles may cause severe underestimations of *F*_{ST}, which means overestimations of θ = 4*Nm*, when the number of sampling localities is small. To reduce such bias in the estimation of genetic differentiation, we propose an IL method in which the mean allele frequencies over populations are regarded as nuisance parameters and are eliminated by integration. To maximize the IL function, we have developed two algorithms, a Monte Carlo EM algorithm and a Laplace approximation. Our simulation studies show that the method proposed here outperforms the conventional ML and PL methods in terms of unbiasedness and precision. The IL method was applied to real data for Pacific herring and African elephants.

THE study of population structures arises in many contexts in population genetics, and a wide variety of patterns of population structures have been modeled, including the stepping-stone, island, and mixed population models (*e.g.*, Kimura and Weiss 1964; Wright 1969; Millar 1987; Rannala and Hartigan 1995). Statistical methods have also advanced, reflecting the discovery of fine-scale markers and the development of modeling for population structures (*e.g.*, Weir 1996; Balding *et al.* 2003).

To study the genetic structures of natural populations, samples are usually taken from several localities. However, in many situations, there are no specific boundaries or obvious regional units, and therefore subpopulations cannot be well defined beforehand. In the case of mixing populations, it may be possible to apply individual assignment methods based on multilocus genotypes (Pritchard *et al.* 2000; Falush *et al.* 2003; Manel *et al.* 2005) to identify a finite number of subpopulations. However, if the population has a continuous structure and consists of a large number of subpopulations, assuming a metapopulation or an infinite-island model is a natural way to estimate the genetic differentiation between subpopulations (Pannell and Charlesworth 2000; Rousset 2003; Hanski and Gaggiotti 2004).

The metapopulation model considers a universe of demes (subpopulations) that have peculiar genetic structures and introduces a distribution of allele frequencies among these demes. When the allele frequencies are distributed as a Dirichlet distribution, the probability distribution for the sampled counts of alleles at each locus can be explicitly expressed as a Dirichlet-multinomial distribution (Rannala and Hartigan 1996; Weir 1996; Holsinger 1999; Kitada *et al.* 2000; Corander *et al.* 2003; Rousset 2003). In this model, the variance of the allele frequencies among sampling localities is closely related to *F*_{ST} or θ = 4*Nm* (Wright 1969; Rannala and Hartigan 1996; Kitada *et al.* 2000; Balding 2003; Kitada and Kishino 2004). The parameter *F*_{ST} or θ is explicitly included in the statistical model and can therefore be estimated using likelihood methods. The maximum-likelihood (ML) estimation has been discussed by Lange (1995), Kitada *et al.* (2000), and Balding (2003). Furthermore, a pseudolikelihood (PL) estimation was proposed by Rannala and Hartigan (1996) as a variant of the ML method.

Because the parameters that express population differentiation, *F*_{ST} and θ, are related to the variance, as discussed above, the difference in the sampled counts of alleles among sampling localities contains information about these parameters. A larger number of sampling localities improve the precision of the estimates with these parameters (Kitada and Kishino 2004). However, sampling from many localities is often difficult in genetic studies of wildlife populations. In such cases, the ML and PL methods cause difficulties, as in the estimation of the variance in a normal distribution. The ML estimator of σ^{2} based on a sample of size *n*, , from a normal distribution *N*(μ, σ^{2}) is given by , whereas an unbiased estimator is given by . The only difference is in the denominator. However, if the sample size *n* is small, the amount of bias in the ML estimator cannot be ignored. This might be the case for the estimation of *F*_{ST} or θ in the metapopulation model.

An unbiased estimation of σ^{2} described above can be achieved using several different likelihood adjustments, known as conditioning, marginalization, and integration (Lindsey 1996; Berger *et al.* 1999). Unfortunately, the conditional- and marginal-likelihood methods cannot be applied to a Dirichlet-multinomial distribution because no appropriate statistics that separate the likelihood can be obtained. On the other hand, the integrated-likelihood (IL) method does not require such separation. Instead, integration with respect to nuisance parameters must be undertaken. Typically, the IL function does not have a closed form. However, the recent development of computational methods allows us to deal with it: for example, the EM algorithm (McLachlan and Krishnan 1997), the Markov chain Monte Carlo (MCMC) method (Robert and Casella 2004), and the Laplace approximation (Pawitan 2001).

In this article, we use an IL approach to the metapopulation model and develop a new method to precisely estimate important parameters of interest, *F*_{ST} and θ. Two algorithms, a Monte Carlo EM algorithm with an MCMC and the Laplace approximation, are applied to maximize the IL function. The performance of the IL method is evaluated with numerical simulations and compared with the ML and PL estimation methods. Microsatellite allele frequencies in Pacific herring and mtDNA restriction fragment length polymorphism (RFLP) haplotypes of African elephants are analyzed. The impact of bias on the inference of genetic differentiation is also discussed.

## MODELS AND METHODS

### Statistical modeling:

Consider a sample taken from multiple localities in a metapopulation having an infinite-island model. Suppose that *K* demes or subpopulations are drawn from the metapopulation and alleles of individuals are counted for *L* loci. Let be a vector of the true allele frequencies at the *l*th locus in the *k*th subpopulation, where is the number of different alleles at the *l*th locus, and . Let denote a vector of observed allele frequencies at the *l*th locus at the *k*th subpopulation. Under the assumption of random sampling at each locality, the distribution of *n _{kl}*, given

*p*, can be assumed to be multinomial with the probability functionHere, . Note that the allele counts at each locus, given the true allele frequencies, are independent among subpopulations. We also assume linkage equilibrium. Therefore, the allele counts are also independent over all loci within each subpopulation. We assume that the distribution of

_{kl}*p*at the

_{kl}*l*th locus in the

*k*th subpopulation follows a Dirichlet distribution, with the probability density functionwhere θ is a scale parameter that is a vector common to all loci, and are the mean allele frequencies at the

*l*th locus satisfying . This distribution was originally derived by Wright (1945, 1951) as a diffusion approximation to the discrete generation Wright's model. A justification for the continuous generation model has been given by Rannala and Hartigan (1996).

In this model, the variance of the *j*th allele frequency for the *l*th locus, *p _{klj}*, is given by(1)The larger the value of θ, the smaller the genetic differentiation among subpopulations, and vice versa. In fact, genetic differentiation is expressed as(2)which was given by Wright (1969), Rannala and Hartigan (1996), Balding (2003), and Kitada and Kishino (2004). Hence, θ is a parameter that controls the degree of genetic differentiation among subpopulations.

### Estimation by the ML method:

The marginal distribution of the observed random vector *n _{kl}* is given by a Dirichlet-multinomial distribution as follows:(3)The parameters to be estimated are θ and . Because the mutual independence of data is assumed, the overall likelihood for these parameters based on Equation 3 is defined asThe ML estimator of θ is given by maximizing the likelihood function

*L*(θ, β) (Lange 1995; Weir 1996; Kitada

*et al.*2000). The parameter

*F*

_{ST}can be estimated with Equation 2.

### Estimation by the PL method:

Rannala and Hartigan (1996) used a PL approach to reduce computational time in optimizing the original likelihood *L*(θ, β). In their method, the vector of the nuisance parameter is replaced with the average values for observed allele frequencies throughout the population,and is maximized as if were the true parameters.

### Estimation by the IL method:

As an alternative to the two likelihood estimations described above, we propose an IL approach to estimate θ. This means that the free parameters in are regarded as nuisance parameters and are eliminated from *L*(θ, β) by integration. We use an integrated-likelihood function for θ, defined as(4)where *D* is the parameter space of β. This treatment can be regarded as a kind of Bayesian estimation using a noninformative prior for β. In fact, when the prior is assumed, the density function isand therefore holds. Unfortunately, an explicit form for *L*_{I}(θ) cannot be obtained in the metapopulation model.

We consider two alternative algorithms for the maximization of *L*_{I}(θ). One is the Monte Carlo EM (MCEM) algorithm (Wei and Tanner 1990). The EM algorithm converges on the maximum value of *L*_{I}(θ). However, as is well known, the convergence of the EM algorithm is slow. For this reason, we use mainly the Laplace approximation (Pawitan 2001). The program for the latter method is available from the authors upon request. Detailed descriptions of both of the methods are given in appendixes a and b.

## SIMULATION STUDIES

### Simulation scenarios:

We evaluated the estimation performance of the ML, PL, and IL methods with numerical simulations. For simplicity, we considered only cases in which the same number of individuals are collected from each of the sampled subpopulations and the different loci share the same mean allele frequencies.

Simulation data were generated as follows. A simulation scenario was specified by the number of subpopulations sampled (*K*), sample size (*N _{k}*), the number of loci (

*L*), the number of alleles (

*J*), mean allele frequencies (β

_{l}_{l}), and the degree of genetic differentiation (

*F*

_{ST}). The parameters used in this study are as follows:

Number of subpopulations sampled:

*K*= 2, 3, 4, 5, 7, 10, 15Sample size:

*N*≡_{k}*N*= 50, 100, 200Number of loci:

*L*= 1, 3, 5, 10, 20Number of alleles:

*J*≡_{l}*J*= 2, 5, 10Mean allele frequencies: and

Genetic differentiation:

*F*_{ST}= 0.01, 0.05, 0.1, 0.4.

Once specifying a simulation scenario, the observed allele frequencies, , were generated from a Dirichlet-multinomial distribution in Equation 3 for each locus. After the data were generated, we estimated the parameter θ using the ML, PL, and IL methods, and then we estimated *F*_{ST} with the relationship *F*_{ST} = 1/(1 + θ). The number of simulation replicas was fixed at 1000 throughout each simulation scenario. Then, the mean and standard deviation of 1000 estimates of *F*_{ST} were assessed for each method. In the IL method, we used the Laplace approximation because its computation cost is lower than that of the MCEM algorithm. However, before undertaking more comprehensive simulations, we checked the consistency of the two algorithms by using some of the simulation data sets.

### Simulation results:

We first examined the efficacy of the Laplace approximation for our IL approach. Figure 1 compares the estimates based on the two different algorithms used to find the maximum value of the IL function. Simulation data were generated for the case where *K* = 5, *N* = 50, *L* = 10, *J* = 5, and *F*_{ST} = 0.1. As shown in Figure 1, estimates made with the Laplace approximation tended to be close to those made with the MCEM sequence. The difference of computation times between the two algorithms was noteworthy. For the situation above, the estimation with the Laplace approximation took ∼6 sec, which was comparable to those with the ML and PL methods (∼3 and 1 sec, respectively), while MCEM spent ∼40 min up to the 40th step on a Pentium-IV 3.2-GHz PC. Note that the computing time varied little among data sets utilized within a same scenario, but depended much on the number of subpopulations sampled, loci, and alleles.

Figure 2 shows the estimation performance of the ML, PL, and IL methods for *F*_{ST} = 0.05 for various values of *K*, the number of loci *L*, the number of alleles *J*, and the sample size *N*. We used the set of parameters *K* = 5, *N* = 50, *L* = 10, and *J* = 5 as baseline values and changed each of them. Only the results for the case of are shown here, because differences in mean allele frequencies between scenarios had little impact on estimation performance. As expected, severe underestimation of *F*_{ST} was observed with both the ML and PL methods, especially for small values of K. In contrast, less bias was observed with the IL method. These results demonstrate that the IL method outperforms the ML and PL procedures in terms of minimizing bias in all cases.

With all three methods, increasing the number of subpopulations sampled, *K*, reduced both bias and variance. On the other hand, increasing the number of loci, the number of alleles per locus, and the sample size had little effect on bias, but led to a reduction in variance. The latter phenomenon has been reported by Balding (2003) for the ML method. These results demonstrate that the number *K* is very important for the precise estimation of *F*_{ST}.

More detailed results for *N* = 50, *L* = 10, and *J* = 5 are shown in Table 1. All the estimates were negatively biased. The ML and PL estimates were very similar in all cases and had large negative biases, especially when *F*_{ST} was <0.1 and the number of sampling localities was less than five. On the other hand, the PL estimates had a much smaller bias under these conditions. When *F*_{ST} was as large as 0.4, the bias of the ML and PL estimates was moderate and similar to that of the IL estimate.

## APPLICATIONS TO REAL DATA

### Pacific herring:

The first example, a marine fish species, shows a case with a low *F*_{ST}-value. A total of 1263 fish were taken from three localities, Akkeshi, Yudounuma Lake, and Funkawan Bay, which are located on the east coast of Hokkaido in Japan. Table 2 shows the observed allele frequencies at five microsatellite loci, which were in Hardy–Weinberg equilibrium in each sample (Sato 2004). A large value for the gene flow rate, which means a small value for *F*_{ST}, was observed (Table 3). In this case, the estimates made with the ML and PL methods were much smaller than that made with the IL method. This result is consistent with a phenomenon observed in the simulation studies; that is, there is a clear difference between the IL and ML or PL estimates when the *F*_{ST}-value is small.

### African elephants:

Next, we used the mtDNA RFLP haplotype data given in Table 3 of Rannala and Hartigan (1996), which are a modification of the original data of Georgiadis *et al.* (1994). In this study, 10 haplotypes were evaluated in 270 elephants from 10 populations in Kenya, Zimbabwe, Botswana, and South Africa. As shown in Table 3, the relative differences in the estimates of θ and *F*_{ST} were smaller than those observed in the Pacific herring. This is due to the greater number of sampling localities and the higher value of *F*_{ST} compared with those of the Pacific herring population.

## DISCUSSION

In this article, we have proposed a new method for the estimation of genetic differentiation between populations, based on the IL function, to reduce the bias observed in the ML and PL methods, which have been used previously in this field. Our simulation results demonstrate that the estimation performance of the IL method is much better than that of the conventional ML and PL methods in terms of unbiasedness and precision. Differences in estimation performance were observed, especially when the number of subpopulations sampled, *K*, was small or the level of genetic differentiation, *F*_{ST}, was low. In fact, the number of subpopulations sampled is often restricted in studies of the population genetics of wildlife. Furthermore, *F*_{ST} could be small for species that have no specific boundaries, such as birds and fishes, as shown in the data analysis for Pacific herring. In this sense, the IL method is strongly recommended for the estimation of *F*_{ST}.

We focused on the estimation of *F*_{ST} and θ for the metapopulation models. Thus, we regarded the mean allele frequencies as nuisance parameters. The elimination of nuisance parameters has been one of the central problems in statistics. As shown in our simulation studies, the large degree of bias in the ML and PL estimates of *F*_{ST} did not decrease even when the number of loci was increased. These phenomena are typical cases of so-called Neyman–Scott problems (Neyman and Scott 1948), in which the dimension of the nuisance parameter increases with the number of observations. There are several possible ways to eliminate nuisance parameters: the conditional, marginal, and IL methods. Of these, we used the IL method because no suitable statistics for conditional- or marginal-likelihood methods were available in our model. Although a small amount of bias was still observed in the IL estimates, the improvement in estimation performance using the IL method is considerable.

The resources for sampling surveys in the field are always limited. Given the total sample size, it is recommended that the number of sampling localities be as large as possible by sacrificing the number of individuals sampled at each locality (Figure 2). Increased sample size at each locality yielded little reduction in bias in the estimation of *F*_{ST}, although it could decrease the estimation variance. To better understand the role played by the sampling scheme of this study, we conducted further simulation studies, in which the total sample size over all sampling localities was fixed. The simulation scenario was set at *F*_{ST} = 0.05, *L* = 10, and *J* = 5, with . The total sample size, *K* · *N*, was fixed at 600. As shown in Figure 3, an increase in *K* improved estimation performance. Furthermore, for small *K*, the variance was small because of the large sample size. However, the bias was still large when the values of *K* were small, although the sample size was considerably larger than that in the case of a large *K*. In the field, it is often much more difficult to visit many sites than to study many individuals at each site. Therefore, our IL method may be especially valuable in such circumstances.

Our statistical estimation assumes the Dirichlet distribution for allele frequencies. This assumption is well suited to the metapopulation with an island structure. On the other hand, examination of robustness of the IL estimation based on such an assumption is of interest. For this purpose, we conducted a small simulation study assuming a simple scenario for non-Dirichlet with biallelic loci. Now consider sampling from a metapopulation that consists of a large number of subpopulations, and suppose that allele frequencies at a locus in a subpopulation are either (0.6, 0.4) or (0.4, 0.6). The proportion of those two kinds of subpopulations in the metapopulation is assumed to be 1:1, which means that allele frequencies at a locus of a sampled subpopulation are (0.6, 0.4) with the probability . In this case, the mean allele frequencies over the metapopulations are (0.5, 0.5), and the variance is calculated as 0.5(0.4 − 0.5)^{2} + 0.5(0.6 − 0.5)^{2} = 0.01. Hence, *F*_{ST} is 0.04 (= 0.01/0.25). Under this scenario with *K* = 5, *N* = 50, and *L* = 10, we estimated *F*_{ST} using the IL method. The results are summarized in Table 4. Although the estimation performances are slightly different between the cases of Dirichlet and non-Dirichlet, the IL method based on the Dirichlet assumption still performed well in both cases. Note that the ML and PL methods were still heavily biased in this case (not shown here). This simulation is not comprehensive, but it suggests that our method may be useful for other genetic models.

In this article, we assumed that the parameter θ is common throughout all loci. However, as discussed by Balding (2003), θ can vary across loci because of different mutation rates and/or selection effects. Even if the mutation rates are incorporated into the model and assumed to be different among loci, the distributional form of the Dirichlet holds through θ_{l} = 4*Nm* + 4*N*μ_{l} or *F*_{ST,l} = 1/(1 + 4*Nm* + 4*N*μ_{l}), where *m* is a migration rate and μ_{l} is a mutation rate at the *l*th locus (Wright 1969). In this case, it is possible to estimate the value of θ locus by locus. However, assuming a random effect model for θ over all loci is also a promising alternative method. This assumption can make the estimation not only easier but also more efficient. The first advantage is the reduction in the dimension of the parameter. The second advantage arises from the concept of “borrowing strength” across loci in the context of an empirical Bayes procedure (Beaumont and Rannala 2004). Incorporating such random effects leads to further hierarchical modeling. Our IL method, however, can be extended to such modeling. Furthermore, Bayesian computational methods and algorithms are also possible candidate methods (Beaumont and Rannala 2004). Meanwhile, the presence of the selection effects unfortunately breaks down the Dirichlet assumption for the allele frequencies, and even the variance of allele frequency does not have a simple form as in Equation 1 (Wright 1969). Hence, the extension of our method may not be straightforward. These topics are problems to be investigated in the future.

Linkage disequilibrium is an important factor in practice. Under linkage disequilibrium, a probability distribution of composite genotypes for biallelic cases is given in Kitada and Kishino (2004). In this article, a simulation study was conducted for assessing the estimation performance of *F*_{ST} under various levels of linkage disequilibrium. The results (shown in Table 3 in Kitada and Kishino 2004) suggested that the maximum-likelihood estimation based on smaller numbers of sampling localities tended to cause underestimation of *F*_{ST}, which is a similar phenomenon observed in our article under linkage equilibrium. Meanwhile, they showed that the level of linkage disequilibrium had little impact on the amount of estimation bias although it slightly affected the estimation variance for *F*_{ST}. Therefore, we expect that our integrated-likelihood approach is also effective to improve the estimation performance in the linkage disequilibrium model, and the amount of reduction of bias is not related to the level of linkage disequilibrium. Although the likelihood function is complicated in the model, this topic also warrants further investigation.

## APPENDIX A: ESTIMATION WITH THE MCEM ALGORITHM

At first, we briefly illustrate a typical case of the EM algorithm (McLachlan and Krishnan 1997). Let *Y* and *Z* be the observed and latent/missing random vectors, respectively. Let *f*(*y*, *z*; ω) denote the joint probability density function, where ω is a vector of parameters. The log-likelihood function for ω, based on the complete data (*Y*, *Z*), is given by *l*_{c}(ω) = log *f*(*y*, *z*; ω). To find the maximum-likelihood estimator, the likelihood based on the observed data *y*, must be maximized. However, some models do not have closed forms for *l*_{o}(ω). Instead of directly maximizing *l*_{o}(ω), the EM algorithm alternates between the following two steps:

E step: compute

*Q*(ω|ω^{[t]}) =*E*[*l*_{c}(ω)|*Y*, ω^{[t]}].M step: define ω

^{[t+1]}as a value of ω that maximizes*Q*(ω|ω^{[t]}).

In our metapopulation model, the observed data are the set , and the latent variables are . Although β_{l} is originally a parameter vector, it can be regarded as a random vector with a flat prior in the context of Bayesian estimation. The complete log-likelihood function is given byNote that the second term is ignorable because π(β_{l}) ∝ 1. The original EM algorithm expects an explicit formula for the conditional expectation *Q* in the E-step. However, *Q* in our model does not have a closed form because the log-likelihood is complicated by the Dirichlet-multinomial distribution. To evaluate the function *Q*, we use a Monte Carlo integration via the MCMC method.

We calculate . To generate the random variables from the conditional distribution of β_{l}, given and θ^{[t]},the following Metropolis–Hasting sampling method (*e.g.*, Robert and Casella 2004) is used.

Sample a candidate vector β

_{l}* from a proposal distribution .Compute an acceptance ratio defined by

Set as

Let *m _{t}* be the sample size of the Monte Carlo integration at the

*t*th step in the EM algorithm. Precision of the Monte Carlo integration depends on the sample size. To obtain better precisions at later steps in the E-step, which affect convergence of θ

^{[t]}, we increase

*m*by

_{t}*t*asThe first

*m*/5 samples are discarded as constituting the burn-in period to make the MCMC sequences independent of initial values of β. The value

_{t}*k*is the length of thinning; that is, every

*k*th simulation draw is kept as an output of the Monte Carlo data. This is to reduce the autocorrelation in each sequence. We set

*k*= 10. The function

*Q*(θ|θ

^{[t]}) is approximately assessed aswhere

*m*′

_{t}is the length of the effective MCMC sequence. If the standard error of the parameter estimate is required, it can be evaluated using the observed information as follows:

## APPENDIX B: ESTIMATION WITH THE LAPLACE APPROXIMATION

Our integrated likelihood can be expressed as(B1)

We illustrate the approximation by considering the contribution made by the *l*th locus [say ] to *L*_{I}(θ). The Laplace approximation is often used in statistics. The approximation depends on a second-order Taylor expansion of . Let be the value of β_{l} that maximizes log *f*(*n _{l}* | θ, β

_{l}) for fixed θ. Then, can be approximated (ignoring factors not depending on θ) aswhereand det{

*H*(θ)} denotes the determinant of

*H*(θ). Taking the logarithm of , we get(B2)The first term on the right-hand side of Equation B2 is the profile log-likelihood of θ. In fact, Equation B2 can be regarded as an approximate modified profile log-likelihood of θ to reduce the estimation bias (Lindsey 1996; Pawitan 2001). The estimates have of course uncertainty due to their own nature. However, if only the first term of the right-hand side of Equation B2 is present, acts as if the true value, and therefore the uncertainty in is not taken into account. This is a reason why the IL method outperforms the conventional ML and PL methods.

In the metapopulation model, it is not possible to obtain an exact formula for *H*(θ) or . To maximize , we use an iterative method based on automatic differentiation (Skaug and Fournier 2006), which is implemented with the statistical software ADMB-RE (http://otter-rsch.com/admbre/admbre.html). To improve the accuracy of the Laplace approximation, we changed the variables in the integral of Equation B1 using the logistic transformation with the constraint . The computation time is much less than that required for the MCEM algorithm.

## Acknowledgments

The authors are grateful to two anonymous reviewers for their constructive comments on the original version of this manuscript.

## Footnotes

Communicating editor: R. W. Doerge

- Received January 3, 2006.
- Accepted May 18, 2006.

- Copyright © 2006 by the Genetics Society of America