## Abstract

In this article, we develop an admixture F model (AFM) for the estimation of population-level coancestry coefficients from neutral molecular markers. In contrast to the previously published F model, the AFM enables disentangling small population size and lack of migration as causes of genetic differentiation behind a given level of *F*_{ST}. We develop a Bayesian estimation scheme for fitting the AFM to multiallelic data acquired from a number of local populations. We demonstrate the performance of the AFM, using simulated data sets and real data on ninespine sticklebacks (*Pungitius pungitius*) and common shrews (*Sorex araneus*). The results show that the parameterization of the AFM conveys more information about the evolutionary history than a simple summary parameter such as *F*_{ST}. The methods are implemented in the R package RAFM.

IN the fields of animal and plant breeding, coancestry coefficients are often used as measures of relatedness between individuals (Bink *et al.* 2008). For example, in a noninbred population the coancestry between full-sibs or between a parent and an offspring is

Individual-level coancestry coefficients (or probabilities of IBD) are useful in gene mapping, because they tell how much the genomes of two individuals are expected to resemble each other; *i.e.*, they summarize the expected level of genetic similarity. In analogy, population-level coancestry coefficients can be used as measures of relatedness between local populations, and they can be combined with phenotypic data to detect signals of selection in quantitative traits, as opposed to those caused by random drift (Merilä and Crnokrak 2001; Mckay and Latta 2002; Ovaskainen *et al.* 2011).

Coancestry coefficients can be calculated directly, if pedigree information is available, but their estimation for natural populations is often challenging. One approach for doing so is to use the link between coancestry coefficients and coalescence times (Rousset 2004). Coalescence time distributions can be solved, at least numerically, for a population that is in a stationary state, assuming that the demographic parameters are known (Bahlo and Griffiths 2001). However, in the context of evolutionary ecology of natural populations, this is rarely the case, as there is often limited direct information on demographic history, and it can be unrealistic to assume any kind of stationarity. Instead, a common approach is to infer the demographic history using neutral molecular markers genotyped from the present generation. One statistical framework for estimating coancestry coefficients in this way is given by the F model (Falush *et al.* 2003; Gaggiotti and Foll 2010). However, this approach suffers from the structural limitation that the subpopulations are assumed to have radiated independently from the ancestral population, so that there has been no recent gene flow. Consequently, the F model cannot account for limited gene flow and small population size as alternative sources of genetic differentiation (Gaggiotti and Foll 2010).

In animal and plant breeding, a number of alternative methods have been developed for estimating coancestry coefficients from molecular marker data for pairs of individuals. Bink *et al.* (2008) survey seven such methods, concluding that the surveyed estimators have poor statistical properties, except in the special case that the allele frequencies are known for a hypothetical reference population. Furthermore, as Fernandez and Toro (2006) point out, many of these estimators have undesired mathematical properties; *e.g.*, they may yield logically incompatible estimates for different pairs of individuals. Software by Maenhout *et al.* (2009) removes some of these flaws by *post hoc* modification of the parameter estimates.

In this article, we focus on the case where neutral genotypic data are available for a set of subpopulations, and the problem is to infer the matrix of coancestry coefficients among these local populations. We model the demographic histories of the subpopulations by an admixture of evolutionary independent lineages, thus extending the F model in a way that relaxes the structural assumption noted above. We use an admixture of independent lineages as a phenomenological model for the evolutionary history of a metapopulation where local populations experience a limited level of gene flow. Apart from Gaggiotti and Foll (2010), our method is also a generalization of that of Fu *et al.* (2005), because we consider multiallelic loci and a more general population structure than the case of clustered subpopulations. With these extensions, our model contains both gene flow and pure random drift as factors influencing the level of differentiation. Contrary to the “pairwise methods” used in animal and plant breeding, both the original F model and our model permit writing the likelihood of individual-level data directly as a function of population-level coancestry coefficients. In the following, we first introduce the modeling approach and then its Bayesian parameterization that we have implemented in the R-package RAFM, and finally we illustrate the modeling approach with the help of simulated and real data.

## The Modeling Approach

### Coefficients of coancestry

Our main interest is in the estimation of population-level coefficients of coancestry, denoted by *A*, *B*). We define *i* and *i*′, and *n _{A}* is the number of individuals in population

*A*. We note that the definition of Equation 1 allows for the possibility that the level of coancestry is not identical for all pairs of individuals

*i*ε

*A*and

*i*′ ε

*B*.

*A priori*, in lack of this information,

*A*and

*B*, and thus it can be used interchangeably with

We follow Rousset (2004) and call two gene copies IBD if they originate from the same ancestral copy and are identical by state; *i.e.*, they have not mutated since their divergence. The coancestry coefficients and the probabilities of IBD for neutral loci are often used interchangeably, but they have a slight difference (we denote the latter by *A* and *B* as (Rousset 2004), *e.g.*, for a model with discrete generations:*C _{AB}*

_{,}

*is the probability that the two gene copies coalesce exactly*

_{t}*t*generations before present, and

*μ*is the per-locus per-generation probability of mutation. Bahlo and Griffiths (2001) derive formulas that allow the numerical computation of

*et al.*2007).

Sometimes the biological context is such that there has been a major perturbation, such as the last ice age, after which the subpopulations have diverged from a common ancestral pool. In this case, instead of assuming stationarity, it is more natural to consider a finite population history of *T* generations. In this case,

### The relationship between coancestry and *F*_{ST}

*F*_{ST} is one of the most widely used statistics in population genetics, and it is routinely used as a measure of genetic differentiation (Rousset 2002, 2004; Whitlock 2011). Depending on the definition of *θ*, *F*_{ST} can be defined through coancestry, probability of IBD, or probability of identity by state as*F*_{ST} through population-level coancestry. In Equation 5, *F*_{ST} (Rousset 2004), we do not weight the averages, *e.g.*, by the sizes of the local populations. We are chiefly interested in estimating the coancestry coefficients and investigating the properties of the admixture F model (AFM), but we also report *F*_{ST} (defined through the coancestry-based variant of Equations 4 and 5) estimates because of the centrality of *F*_{ST} in the literature.

### The AFM

In this section, we extend the F model (Falush *et al.* 2003; Gaggiotti and Foll 2010) to an AFM that allows for gene flow among the local populations. As is the case with the original F model, we assume that the local populations are derived from a common ancestral population and consider the limit of a small mutation rate, *i.e.*, the situation that relates to Equation 3.

Denoting the frequency of allele *u* at locus *j* in the ancestral generation by *q _{ju}*, the expectation and variance of the allele frequency in population

*A*can be written as

*φ*is a factor that depends on the demographic model (Lynch and Walsh 1998). For an isolated population of a constant effective size,

*a*corresponds to a small effective population size or a large number of generations

*T*, both of which imply a high amount of random genetic drift. The Dirichlet distribution is just a convenient approximation for the distribution of allele frequencies under pure random drift, as their true distribution is difficult to implement in a statistical model (see File S2). Also the truncated normal distribution is often used to approximate this distribution (Nicholson

*et al.*2002; Balding 2003; Coop

*et al.*2010). However, the truncated normal distribution is more difficult to adapt to the multiallelic case than the Dirichlet distribution as the frequency distribution is constrained by the condition

To extend the model for *n _{ε}* evolutionary independent lineages (Figure 1). The allele frequencies in each lineage are distributed as in Equation 8;

*i.e.*, we assume for locus

*j*and lineage

*k*,

*a*measures the amount of drift experienced by this lineage. The allele frequencies in locus

_{k}*j*in local population

*A*are defined as a mixture the lineage-specific frequencies, namely

*k*to sum up to unity over the lineages,

_{Ak}**p**

*is a proper frequency distribution. Setting the lineage-loading matrix to the identity matrix yields the special case of fully independent demes (the F model of Falush*

_{Aj}*et al.*2003). Technically, our construction is analogous to factor analysis (Gorsuch 1983), with lineages as factors and lineage loadings

*k*as factor loadings.

_{Ak}A convenient property of the AFM is that the subpopulation-level coancestry coefficients depend on the model parameters in a very simple way (Table 1). As shown in File S1,

Thus, after fitting the AFM to data it is straightforward to obtain an estimate of the matrix of population-to-population coancestry coefficients. By construction, this matrix will be always positive definite, avoiding the logical problems from which some of the earlier methods suffered (Fernandez and Toro 2006).

Assuming no genetic structure within subpopulations, *i.e.*, a random distribution of alleles among and within individuals, the genotype of each individual in subpopulation *A* is a multinomial random variable, *x _{ij}* ∼ Multinomial(2,

**p**

*). Notably, inbreeding due to a small population size is represented by a high intrapopulation coancestry*

_{Aj}### Parameter estimation with Bayesian inference

To parameterize the AFM with Bayesian inference, prior distributions need to be defined for the primary parameters *q*, *α*, and *k*. We assume the distributional forms*j*, *k*, and *A* refer to loci, lineages, and subpopulations, respectively. In the case studies below, we assume the values *μ _{a}* = 2,

*A*makes the dominant contribution to subpopulation

*A*,

*i.e.*, that the matrix

**k**is diagonally dominant. To do so, we let

*k*≠

*A*. This specification links each population with a particular lineage by assuming that lineage

*A*makes a dominating contribution to population

*A*. It also ensures that label switching is not possible, thus improving the mixing of the Markov chain Monte Carlo (MCMC) algorithm (Gelman and Carlin 2004).

The number of alleles (*n _{j}*) in locus

*j*in the ancestral generation is generally unknown, as some alleles may have disappeared after the lineages have diverged or are not present in the sampled individuals. Due to the aggregation property of the Dirichlet distribution, all of the unobserved alleles can be binned into a single unobserved class. Thus, we define

*n*as the number of distinct alleles observed in locus

_{j}*j*plus one.

The directed acyclic graph that illustrates the link from the primary parameters (**k**, **α**, **q**) through the derived parameters (**z**, **p**) to the data **x** is shown in Figure 2. Given the data **x**, the posterior density can be decomposed as**k**, **α**) (Equation 12). We use the adaptive random-walk Metropolis–Hastings algorithm of Ovaskainen *et al.* (2008) to sample the posterior density π(**k**, **α**, **q** | **x**). More details of the algorithm can be found in File S3, and it is implemented in the R package RAFM.

## Numerical Examples

We tested the performance of the method described above with two kinds of simulated data: data generated by the AFM itself and data generated through individual-based pedigrees that we in turn generated by a demographic model with continuous migration among subpopulations. The first type of data was used to investigate the performance of the estimation scheme in the ideal case that the data follow the structural assumptions of the model. The second type of data was used to examine whether a mixture of independent lineages can yield a good approximation of a more realistic demography in the sense of providing an accurate estimate of the matrix **α** and **k** correlate with the demographic parameters in an intuitive way.

### Case studies with data generated by the AFM

First, we considered *A* and *B* and assumed the parameter values **k** = (0.9, 0.1; 0.1, 0.9) and **α** = (2.7, 2.7), which leads to *F*_{ST} = 0.18. As a default case, we assumed that *n _{A}* =

*n*= 100 individuals from each population were genotyped for

_{B}*n*= 4 allelic variants that were equally common in the ancestral generation. To test the dependency of parameter estimates on sample size, we varied each of these parameters in turn, considering

_{j}*n*=

_{A}*n*= 10, 100, 1000;

_{B}*n*= 2, 4, 8. Figure 3 shows how the accuracy of the estimated

_{j}*F*

_{ST}value increases with sample size. As expected from earlier research (Gaggiotti and Foll 2010; Wang and Hey 2010), increasing the number of loci improves the accuracy much more rapidly than increasing the number of individuals. Analogously, increasing the number of alleles per each locus,

*i.e.*, increasing the level of polymorphism, brings more resolution to the data, and thus it also rapidly improves parameter estimates. Contrary to the case studies of Jost (2008), but consistent with the fact that

*F*

_{ST}is defined through coancestry, the estimates of

*F*

_{ST}do not decrease when the polymorphism of marker loci increases (Figure 3A).

To test whether local drift and lack of gene flow could be separated as alternative causes of genetic differentiation, we repeated the above (with the default sample size) with the off-diagonal value of **k** set to 0.05, 0.15, 0.25 and the value of **α** adjusted so that *F*_{ST} = 0.18 in all cases (Figure 4). Note that gene flow sets an upper limit to population differentiation: given a value of gene flow (*i.e.*, off-diagonal of **k**), there is an upper limit to *F*_{ST}, namely the one produced by **α** = (0, 0). While the separation of gene flow and migration is not possible in the standard F model (Gaggiotti and Foll 2010), Figure 4A shows that the parameters **k** and **α** are identifiable in the AFM, if sufficient data are available. As a consequence, it is possible to estimate a full matrix *F*_{ST}.

### Case studies with an individual-based model

We constructed pedigrees for *q _{ju}* = 0.25. The two parents of each individual in the subsequent generations were randomized (independently of each other) with probability 1 −

*m*among the individuals of the same subpopulation and with probability

*m*among the individuals of the other subpopulation (thus implying a per-capita migration rate

*m*). We modeled diploidic inheritance for 32 unlinked loci. To vary the level of gene flow and genetic drift, we considered three scenarios, in each of which the two subpopulations had diverged 50 generations ago. In the baseline scenario 1, we assumed 200 individuals per population and

*m*= 0.001. In scenario 2, we increased the amount of drift (and thus also

*F*

_{ST}) by assuming 50 individuals per subpopulation. Finally, scenario 3 differed from the baseline scenario 1 by having a higher amount of gene flow,

*m*= 0.02. As the purpose of this simulation study was to examine whether the AFM is able to approximate individual-based pedigrees rather than to test its statistical power (which we demonstrate in Figures 3 and 4), we assumed that large data sets were available,

*i.e.*, 100 individuals per subpopulation genotyped for 32 loci (even for the smaller subpopulations), each having four allelic variants in the ancestral generation. We created four replicate data sets for each of the scenarios 1–3.

Figure 5 shows that the AFM can mimic individual-based pedigrees in the sense that the parameters that measure gene flow (**k**) and genetic drift (**α**) vary in line with the individual-level parameters of the three demographic scenarios. Increasing local population size decreases **α**, and increasing gene flow increases the off-diagonal elements of **k**. Figure 5B shows that our approach performs well also for estimating *F*_{ST} from the individual-based data, although there is a slight bias upward for scenario 2 with a high amount of drift. Here the true values of the coancestry coefficients were computed from the simulated pedigree, using first the standard recursive relationships (File S1) and then averaging the individual-level coancestries over the natural subpopulations (not the genotyped individuals), according to Equation 1. For comparison, the Weir–Cockerham estimator (Weir and Cockerham 1984), implemented in FSTAT (Goudet 1995), gives very similar results (Figure 5B). Thus, the novelty of our approach is not in estimation of *F*_{ST}, but in separating gene flow and genetic drift as causal factors behind the observed level of differentiation. This separation is needed to estimate the full coancestry matrix *e.g.*, for detecting signals of natural selection in quantitative-genetic studies (Ovaskainen *et al.* 2011).

### Case studies with real data

Here we illustrate our model’s output with two natural data sets. Both of these data sets are included in the R package RAFM (Karhunen 2012). The first data set consists of 183 ninespine sticklebacks genotyped for 12 microsatellite markers (a subset of data used by Shikano *et al.* 2010), and it comprises four populations: Baltic Sea (60°13′N, 25°11′E), White Sea (66°18′N, 33°25′E), pond Bynästjärnen in Sweden (64°27′N, 19°26′E), and pond Pyöreälampi in Finland (66°15′N, 29°26′E). The pond populations are likely to have experienced a very high amount of drift, and all populations are likely to have remained reproductively isolated from each other since the last ice age (Shikano *et al.* 2010). Thus, the demographic assumptions of Equation 3 and the AFM are at least approximately in line with the biological context.

For the ninespine sticklebacks, the median (95% credibility interval) of *F*_{ST} given by the AFM was *F*_{ST} = 0.34 (0.31–0.37). The Weir–Cockerham estimator yielded a higher estimate, the point estimate (95% confidence interval) being *F*_{ST} = 0.50 (0.44–0.55). The median estimates of the within-population coancestries *F*_{ST} values, *i.e.*, *θ _{i}* of Weir and Hill (2002), calculated from pairwise

*F*

_{ST}values given by FSTAT (Goudet 1995): 0.13, 0.09, 0.77, and 0.98 in the same order. Thus, as expected intuitively, the pond populations have experienced much more drift than the sea populations. In our analysis, the White Sea population is more diverse than the Baltic Sea population, which may reflect a higher effective population size in the White Sea that is in direct contact with the Arctic Ocean. In line with the expectation of no recent gene flow due to geographic barriers, the level of between-population relatedness was very low in our analysis (median estimates of all off-diagonal terms of the matrix

^{−5}−10

^{−3}, attributable to numerical noise from the MCMC).

The second data set originates from a much smaller spatial setting, containing samples of the common shrew (*Sorex araneus*) on islands on the lake Sysmä (62°40′N, 31°20′E) and the surrounding mainland in Finland (Hanski and Kuitunen 1986). Here we utilize data from the mainland, two large islands (L1 and L3, areas 3.8 and 4.4 ha), and two small islands (S5 and S10, areas 0.7 and 0.4 ha). The islands form two pairs, each consisting of a large and a small island, so that the distance between L1 and S5, as well as the distance between L3 and S10, is <500 meters, but the distance between any other pair of islands is at least 1300 meters. The diameter of the lake is ∼3 km, and thus the size of the study system is comparable to the potential migration distances of shrews (Hanski and Kuitunen 1986).

The small spatial scale is reflected by the low overall degree of population differentiation, the AFM yielding the estimate *F*_{ST} = 0.08 (0.06–0.09) and the Weir–Cockerham estimator giving *F*_{ST} = 0.05 (0.04–0.07). As expected from variation in population size, the within-subpopulation relatedness *F*_{ST} estimates (0.01, 0.12, 0.09, 0.09, and 0.06 in the same order). The only off-diagonal terms that are ≥0.01 in the median estimate are between the mainland and the island L1 (0.01) and between the islands L3 and S10 (0.01) that are located close to each other, but it is hard to draw conclusions on a more general pattern based on this observation. This is in line with the discriminant function analysis based on metrical traits by Hanski and Kuitunen (1986), which also revealed little indication of isolation by distance.

## Discussion

The AFM can be used to infer population-level coancestry coefficients *et al.* (2005) for multiallelic data and a more general population structure. As discussed above, the estimates of *F*_{ST} by Rousset (2004). Using the AFM for estimating *F*_{ST} is justified subject to two conditions: First, we have assumed that the subpopulations have diverged from a common ancestral population at some time in the past. Second, we have assumed that the mutation rate is low compared to the time elapsed since divergence or at least compared to the influence of potential gene flow after time since divergence. If these two conditions are met, *F*_{ST} (Slatkin 1991, 1995; Rousset 2004). The AFM models the allele frequencies by an admixture of evolutionary independent lineages, but this assumption is less restrictive. As the simulations show, it can also be used to mimic the effects of continuous gene flow (Figure 5).

The parameters of the AFM convey information about the demographic history of the local populations, as we have demonstrated with the simulated data and the two natural data sets. Using the AFM, it is possible to analyze the level of connectivity between the subpopulations (as characterized by **k**) and the relative effective population sizes of the underlying evolutionary lineages. However, it is not possible to disentangle the absolute effective population sizes and the number of generations after divergence (as they are not identifiable on basis of **a** alone), nor it is possible to deduce per-capita rates of migration.

Apart from demography, the AFM also makes a number of assumptions regarding the type of genetic data. As discussed above, the mutation rate is assumed to be low, suggesting that using microsatellite markers should be avoided. As usual in population-genetic studies, we have also assumed that the markers used are selectively neutral. Thus, markers subject to diversifying (stabilizing) selection are likely to cause an upward (downward) bias in the estimate of *F*_{ST}, as is the case of *F*_{ST} estimates obtained by other methods (Excoffier *et al.* 2009). Third, we have ignored genotyping error, which is known to increase the sampling variation of *F*_{ST} estimates (Bonin *et al.* 2004; Herrmann *et al.* 2010). The implementation of these features to the present framework would be an important extension that we hope to be addressed by future work. Finally, we have used the Dirichlet distribution to model random genetic drift within each of the independent lineages. This approximation should be taken with some criticism (Nicholson *et al.* 2002; Balding 2003). Some authors have used truncated normal distribution in place of Dirichlet for estimating *F*_{ST} (Nicholson *et al.* 2002; Weir and Hill 2002; Coop *et al.* 2010). However, both of these statistical models are approximations of the true model, and both of them have their limitations, which we discuss in File S2.

For the molecular ecologists and population geneticists, *F*_{ST} is probably a more familiar variable than the matrix *F*_{ST} as a parameter, some consider it as an estimator or a point estimate of this parameter. For different types of data and different mutation models, a full “alphabet soup of related indices have been developed” (Whitlock 2011, p. 1083), which may cause part of the confusion. There has also been recent discussion concerning the aptitude of *F*_{ST} for measuring genetic differentiation (Jost 2008; Whitlock 2011). Some authors have reported that locus-specific values correlate with the polymorphism of the marker loci (Hedrick 2005; Carreras-Carbonell *et al.* 2006; Jost 2008). By the canonical definition (Equation 4), *F*_{ST} is fully determined by the coalescent, so that it is logically independent of ancestral polymorphism. On the other hand, a high rate of mutation of course shows both in *F*_{ST} and in the present level of polymorphism. At the limit of a low mutation rate, *F*_{ST} reduces into a function of expected coalescence times (Slatkin 1991, 1995; Rousset 2002, 2004; Whitlock 2011) that are independent of polymorphism. In line with this, our coancestry-based *F*_{ST} is a function of coancestry coefficients and the pedigree that do not depend on the ancestral polymorphism.

Jost (2008) pointed out that *F*_{ST} can have low values even if the subpopulations do not share any alleles. In terms of coancestry coefficients, this implies *y*-axis of Figure 4, the value of *F*_{ST} can range anywhere between zero and one also in this case. However, unlike Jost (2008), we do not consider this as a problematic feature of *F*_{ST}. From the viewpoint of Equation 4, *F*_{ST} is just a summary statistic of the subpopulation-to-subpopulation coancestry matrix *F*_{ST} to be a very useful quantity in population genetics, *e.g.*, for the reason that it is the relevant statistic for *F _{ST}*−

*Q*comparisons that attempt to find signals of stabilizing and disruptive selection in quantitative traits (Merilä and Crnokrak 2001; Mckay and Latta 2002), although we note that also this analysis can be done more effectively using the full matrix of population-level coancestries

_{ST}*et al.*2011).

## Acknowledgments

We thank Christopher Wheat, Juha Merilä, and Michael Whitlock for helpful comments; Takahito Shikano for providing the ninespine stickleback data; and Ilkka Hanski for providing the common shrew data. Our research was supported by the Academy of Finland (grants 124242, 129636, and 250444 to O.O. and the work by M.K. partly covered by grants 213491, 129662, 118673, and 134728 to Juha Merilä) and the European Research Council (starting grant 205905 to O.O.).

## Footnotes

*Communicating editor: M. A. Beaumont*

- Received March 31, 2012.
- Accepted June 26, 2012.

- Copyright © 2012 by the Genetics Society of America