- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Falush, D.
- Articles by Pritchard, J. K.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Falush, D.
- Articles by Pritchard, J. K.
Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies
Daniel Falusha, Matthew Stephensb, and Jonathan K. Pritchardca Department of Molecular Biology, Max-Planck Institut für Infektionsbiologie, 10117 Berlin, Germany,
b Department of Statistics, University of Washington, Seattle, Washington 98195
c Department of Human Genetics, University of Chicago, Chicago, Illinois 60637
Corresponding author: Daniel Falush, Schumann Strasse 21/22, 10117 Berlin, Germany., falush{at}mpiib-berlin.mpg.de (E-mail)
Communicating editor: M. K. UYENOYAMA
| ABSTRACT |
|---|
We describe extensions to the method of Pritchard et al. for inferring population structure from multilocus genotype data. Most importantly, we develop methods that allow for linkage between loci. The new model accounts for the correlations between linked loci that arise in admixed populations ("admixture linkage disequilibium"). This modification has several advantages, allowing (1) detection of admixture events farther back into the past, (2) inference of the population of origin of chromosomal regions, and (3) more accurate estimates of statistical uncertainty when linked loci are used. It is also of potential use for admixture mapping. In addition, we describe a new prior model for the allele frequencies within each population, which allows identification of subtle population subdivisions that were not detectable using the existing method. We present results applying the new methods to study admixture in African-Americans, recombination in Helicobacter pylori, and drift in populations of Drosophila melanogaster. The methods are implemented in a program, structure, version 2.0, which is available at http://pritch.bsd.uchicago.edu.
THE study of admixed populations arises in many contexts in population genetics: for example, in the study of hybrid zones (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this article, we develop methods for studying the ancestry of both individuals and specific loci within admixed populations. Much of the previous work on population admixture has aimed to estimate average admixture proportions in an entire population (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
We consider a situation in which we have multilocus genotype data from a sample of individuals collected from a population with (possibly) unknown structure. ![]()
kqk = 1). Both of those models assume that all the markers are unlinked and provide independent information on an individual's ancestry. In this article we introduce a third model, the "linkage model," which extends the admixture model to account for the correlations between linked markers that arise as the result of admixture ("admixture linkage disequilibrium"; ![]()
We also discuss a new prior model for the allele frequencies within each population, which can be used in conjunction with any of the three ancestry models. This model, while still relatively simple, is more accurate in many situations and sometimes allows much more information to be extracted from the data. These and a number of other extensions to the original model described by Pritchard et al. have been implemented in a computer program, structure version 2.0, available at http://pritch.bsd.uchicago.edu.
| SUMMARY OF OLD AND NEW MODELS |
|---|
Consider a sample of N individuals, each genotyped at L loci. ![]()
In the no-admixture model, each individual comes from one of the K populations. We let z(i) denote the population of origin of individual i and Z denote the vector (z(1) ... z(N)). Each of the K populations is characterized by a set of allele frequencies at each locus. Let pklj refer to the frequency of allele j at locus l in population k, and let P denote the full multidimensional vector of allele frequencies for all k, l, and j. A key modeling assumption is that there is linkage equilibrium and Hardy-Weinberg equilibrium (HWE) within populations. Hence, the likelihood of the genotype of individual i, conditional on its population-of-origin z(i), is simply a product of the frequencies of its alleles in that population.
An obvious limitation of the no-admixture model is that in practice individuals may have recent ancestors in more than one population. To model this, Pritchard et al. introduced an admixture model, in which each individual is assumed to have inherited some proportion of its ancestry from each population. Let q(i)k denote the proportion of individual i's genome that is derived from population k (where
), and let Q be the multidimensional vector of ancestry proportions for all the members of the sample. It is now possible for the different allele copies in an individual to come from different populations. (We use the term "allele copy" to refer to an allele carried at a particular locus by a particular individual.) To reflect this, the vector Z now records the population of origin of every allele copy in each individual, with z(i,a)l denoting the origin of the ath allele copy at locus l in individual i. ![]()
![]() |
(1) |
This admixture model also assumes linkage equilibrium and HWE within populations.
Inference is performed in a Bayesian framework, which offers a number of practical advantages in this context. Among these, it allows a straightforward assessment of the statistical uncertainty in each estimate of interest. It also allows us to make use of any prior information that we might have regarding population membership for some members of the sample. See ![]()
The Bayesian approach requires priors for P and Q. Following ![]()
, independently for each k. Some modifications of this prior are described below. The admixture proportions q(i) for individual i were also modeled as draws from a symmetric Dirichlet distribution, in this case with a hyperparameter
. The assumption of symmetry in the prior for the q's corresponds intuitively to an assumption that the K populations contribute roughly equal amounts of genetic material to the sample. To better model situations where this is not the case, the updated implementation of structure allows different values of
to be estimated for each population (so
becomes a vector of K values, with
k representing the relative contribution of population k to the genetic material in the sample). Otherwise the prior for q is unchanged. Alternative models for q are considered by ![]()
![]()
In practice we may not know either the allele frequencies P or the populations of origin Z in advance. Pritchard et al. described a Markov chain Monte Carlo (MCMC) scheme that estimates these jointly. This procedure clusters individuals into populations and estimates the probability of membership (or, for the admixture model, the proportion of membership) in each population for each individual.
A number of related population genetic methods have been described, including ![]()
![]()
![]()
![]()
The linkage model:
A deficiency of the admixture model is that by assuming that the z's within each individual are independent, it ignores the correlations in ancestry that one would expect to see along each chromosome. In this context, it is helpful to distinguish between three sources of linkage disequilibrium (LD). The first source is variation in ancestry (q) among the sampled individuals. Variation in q leads to correlations among markers across the genome, even if they are unlinked, because individuals with a large component of ancestry in population k have an excess of alleles that are common in k. We call this LD "mixture LD." The second source is correlations in ancestry along each chromosome, which cause additional LD between linked markers. We visualize this LD as occurring because each chromosome is composed of a set of "chunks" that are derived, as unbroken units, from one or another of the ancestral populations. In our terminology, this second source is "admixture LD." The third source is "background LD" within populations, which usually decays on a much shorter scale (tens of kilobases in humans). The admixture model in ![]()
To make inference computationally tractable, we use a simple model that incorporates the notion of discrete chromosomal chunks inherited from ancestral populations. Whereas in the "admixture" model of ![]()
Formally, the above assumptions translate into replacing the admixture model assumption that the z's along each chromosome are independent with the assumption that z's along each chromosome are dependent, forming a Markov chain. Specifically, for haploid data, independently for each individual i,
![]() |
(2) |
and
![]() |
(3) |
where dl denotes the genetic distance from locus l to locus l + 1, assumed known. For diploid (or polyploid) data, independently for each individual i, the z's along each of i's two (or more) chromosomes form independent Markov chains satisfying Equation 2 and Equation 3.
Note that the linkage model includes the admixture model as a limiting case: as r tends to infinity in (3), all loci become independent, returning us to the original admixture model (Equation 1). Note also that we assume that r is the same for all individuals, although this assumption could be relaxed at the cost of an increase in the number of parameters.
Interpretation of the linkage model:
To provide some motivation for the linkage model, consider the following idealized scenario. Suppose that our sample comes from a diploid population that experienced a single "admixture event" followed by t2 generations of subsequent random mating within postadmixture populations. In the generation of the admixture event, individuals are formed by mating of individuals between two or more ancestral populations. These individuals inherit their DNA intact (i.e., without intervening recombination) from the ancestral populations. In the subsequent generation, the boundaries delineating these intact chunks will correspond to crossover events in a single meiosis and so (assuming no interference) will form a Poisson process of rate 1 per morgan. Chromosomes in each subsequent generation will inherit chunks of DNA from chromosomes in the previous generation in a similar manner, and it follows from standard results on the superposition of Poisson processes that, in chromosomes in the current generation, the boundaries between the chunks of DNA inherited intact since the admixture event will form a Poisson process of rate t2 per morgan.
This reasoning provides some justification for the form of the transition rates in Equation 3. However, it falls short of providing a complete justification for all assumptions of the linkage model and in particular for the assumption that the ancestral populations of origin of the chunks are independent draws from some (individual-specific) vector q. Furthermore, in real populations, biological details such as crossover interference and gene conversion events (or transformation in bacteria) will cause deviations from the assumed model. Nevertheless, the linkage model captures, in a parsimonious and computationally convenient way, the correlations in ancestry between linked loci that we would expect to see in admixed individuals from real populations.
The discussion above also suggests an interpretation of the parameter r in terms of the number of generations since admixture first occurred. Specifically, if the genetic distances dl between adjacent markers are measured in morgans, then r can be interpreted as an estimate of t2, the number of generations since the admixture event (although inevitable deviations from the model assumptions mean that it would be wise to treat any such estimate with a degree of caution). Similarly, if the genetic distances are measured in centimorgans, then r can be interpreted as an estimate of 100t2. In some situations the genetic distances between loci may not be known, but a proxy such as physical distance may be available. If the physical distance between loci, measured in nucleotides, is used in place of the genetic distance for dl, then r can instead be interpreted as an estimate of the product of t2 and the recombination rate (expected number of crossovers per base pair per meiosis). If there is no information on map positions, then the linkage model is not applicable.
For many data sets, we will have little prior knowledge concerning the time since admixture (and perhaps also the recombination rate). We have therefore implemented a uniform prior for log r. The bounds of the prior should generally be set to include all biologically plausible values of r, which may range over several orders of magnitude (partly explaining the attraction of working with log r).
Computations for the linkage model:
Because in practice the z's for each chromosome are not observed, the Markov model for the z's used by the linkage model (Equation 2 and Equation 3) results in a hidden Markov model (HMM) for the observed genotype data. Standard HMM methods (see ![]()
Although the linkage model was developed with computational tractability in mind, it is nevertheless more computationally intensive than the admixture model. This can make the linkage model less convenient for particularly large or complicated data sets. For the African-American data set described below (626 diploid individuals and 252 loci) and K = 2 populations, a run consisting of 10,000 burn-in iterations followed by 50,000 further iterations took 3 hr using the admixture model, 7 hr for the linkage model if it was (incorrectly) assumed that the data were fully phased, and 11 hr for the linkage model assuming (correctly) that the data were unphased (calculations were performed on a DEC Alpha of 2001 vintage). Performance differentials will increase for larger K: the computation scales linearly with K for the admixture model and for the linkage model with phased data, but scales with K2 for the linkage model with unphased or partially phased data.
Models of allele frequencies:
As described above, ![]()
![]()
![]()
The new model is based on ideas in ![]()
The new model for correlated allele frequencies that we describe here is based on the same implicit assumptions as the model of ![]()
![]()
![]()
![]() |
(4) |
independently for each l. Here,
may be fixed or estimated within the MCMC scheme. Conditional on PA, the frequencies in each population k have a prior distribution
![]() |
(5) |
independently for each k and l. The size of Fk tells us about the effective population size of population k during the time since divergence, with large values of Fk indicating a smaller effective population size (![]()
We refer to this new model for correlated allele frequencies as the "F" model. The name is chosen to reflect the fact that there are close connections between the model and the classical measure of correlations between populations, Wright's FST (![]()
![]()
(1 -
), where pk is the frequency of an allele in population k, and
is the overall frequency of that allele across all subpopulations (![]()
, and Fk plays a role like that of FST in the classical model, except that we use a generalized model with different drift rates for each population. Using a different value of F for each population, rather than a single common value for all populations, introduces a considerable amount of extra flexibility into the model at the expense of only a few additional parameters.
The prior distribution that we have implemented for F assumes that the Fk are a priori independent, with a density proportional to a gamma distribution truncated at 1 (so that Pr[0 < Fk < 1] = 1). Depending on the parameters of the distribution, the prior can be "harsh"putting most of its weight on low values of F, or "permissive"not discriminating strongly against any value of Fk. A harsh prior on low values of F corresponds to strong prior information that the allele frequencies in the different populations are similar to one another, and this seems generally to give the best performance in detecting subtle admixture in problems that are difficult for the independent frequencies model. However, if the values of Fk are being used to make evolutionary inferences, a permissive prior is more appropriate. In the Appendix, we present Metropolis-Hastings updates for PA, Pk, and Fk.
| MODEL RESULTS USING SIMULATED DATA SETS |
|---|
To assess the uses and limitations of the new structure features, we have performed Wright-Fisher simulations, based on the seven demographic scenarios (IVII) shown in Fig 1. Mutation parameters differ among the simulations and are specified separately for each one. Under each scenario, the goal is first to identify the current populations and second to reconstruct elements of their history: for example, the amount of genetic drift, the degree of admixture, and the time since admixture occurred.
|
Differentiating between closely related populationsThe F model:
One advantage of the new F model is that it can sometimes detect population subdivision that is invisible to structure when the gene frequencies of the populations are modeled without correlations. An example is shown in Fig 2, where a single random-mating population splits into two (scenario II). Eight generations after the split, the uncorrelated model is unable to distinguish between the two populations, while the F model distinguishes them quite accurately, with the exception of a few individuals that are not assigned with high probability to either population. After 16 generations of separate evolution, the uncorrelated model becomes able to distinguish reasonably accurately between the two populations and the F model provides little improvement in clustering. Further simulations (not shown) indicate that the F model is less likely to improve performance when the number of loci is small; rather, it can allow accurate clustering of individuals from extremely closely related populations when large numbers of markers are used (e.g., ![]()
|
Estimation of K:
Simulations presented by ![]()
For some data sets, higher estimates of K obtained using the F model may reflect deviations from random assortment that are not caused by genuine population subdivision. Table 1A shows model likelihoods estimated for a single panmictic population (scenario I). Whether or not the F model is used, the highest value of P(X|K) is given by K = 1. In Table 1B the evolutionary parameters are identical but there is a 50% selfing rate. In this case, the F model gives higher probabilities for K = 2, while the original model continues to give the highest model likelihood for K = 1. Other situations that might cause additional populations to be inferred by structure (with or without the F model) include a significant frequency of inbreeding, cryptic relatedness within the sample, or the presence of null alleles.
|
Inference of demographic history:
The F model can also be used to estimate the amount of genetic drift undergone by the different populations under study. In Fig 3A, estimates of F are shown for a population that trifurcated (scenario III). For a substantial time period after the trifurcation, the estimated values of F are approximately proportional to the time since the split and inversely proportional to the population sizes. When the values of F start to exceed
0.2, F no longer increases linearly but the ranking of the values of F continues to reflect the relative degrees of drift that the populations have undergone.
|
The use of the F model to estimate drift is subject to a caveat, which is that contrary to the model assumption, drift may not have occurred independently in each population. For example, Fig 3B shows results based on scenario IV, in which a single population divides into two and one of the populations subsequently subdivides. The structure algorithm interprets the similarity of the two subpopulations as evidence that their gene frequencies are close to those of the ancestor of all three populations and estimates lower values of F for them than for their common ancestor prior to subdivision.
In principle it should be possible to generalize the model to allow for the possibility of hierachical subdivision but we do not attempt this here. Rather, we suggest testing for deviations from the model by estimating values of F while excluding one or more of the populations in turn. If the assumption that all of the populations evolved independently from a single common ancestral population is correct, then this should leave the F values estimated for the other populations approximately unchanged. If the F values decrease, then this suggests that one of the excluded populations diverged first, so that the remaining populations share a more recent common ancestor than shared by the whole sample considered together. If F values of one or more of the populations increase, then it may indicate that the original F values were artificially reduced by the presence of closely related subpopulations in the sample. Other diagnostics are discussed by ![]()
Inference in admixed populationsthe linkage model:
Inference of demographic history becomes more difficult if admixture has occurred subsequent to population divergence (e.g., ![]()
Linkage information can help to resolve the ambiguity. Informally, admixed individuals contain chromosomal chunks that derive from one population or another. Using closely linked markers, the linkage model aims to detect the chromosomal chunks and can potentially reconstruct the ancestral populations accurately even if no pure members exist.
To explore the properties of the new method, we have performed extensive simulations. We consider individuals genotyped at L* loci on each of C chromosomes (i.e., typed at a total of CL* loci). The loci are equidistant, with a recombination rate R per generation between adjacent genotyped sites. The genetic map is assumed known. We analyzed the simulated data using the uncorrelated model for allele frequencies.
Estimation of allele frequencies:
One measure of whether structure is performing well is if it can accurately estimate the population allele frequencies in the ancestral populations. To visualize this, we have constructed neighbor-joining trees based on the posterior mean allele frequencies. When the allele frequencies are accurately estimated, the branch tips lie close to the large black dots (which represent the "correct" frequencies).
We focus on scenario VI, "unidirectional" admixture. In three out of the four cases shown (Fig 4, AC), structure is highly consistent in its inference of gene frequencies for ancestral population 1, reflecting the continuous presence of pure individuals in the sample. The accuracy of this inference provides a baseline from which to judge the performance of structure in disentangling the gene frequencies of ancestral population 2, which ceases to have any pure descendants a few generations after admixture.
|
In the first simulated example (Table 2A, Fig 4A), as the number of generations after admixture (t2) increases, the admixture model becomes increasingly biased, underestimating the divergence between the populations (shown by the intermediate position of the inferred populations in the gene frequency tree) and underestimating the amount of admixture (H2 in Table 2A). In contrast, the linkage model estimates gene frequencies and the degree of admixture accurately for many generations after the admixture event.
|
The performance of the admixture model is improved by increasing the number of chromosomal regions studied (Table 2B, Fig 4B) but the linkage model continues to prolong the number of generations after admixture for which accurate ancestry estimates can be obtained.
In another example (Table 2C, Fig 4C), the admixture model shows the opposite bias for a number of generations, overestimating rather than underestimating admixture and the degree of divergence between ancestral populations. The linkage model again uses linkage information to resolve the ambiguity and it performs well up to eight generations after the admixture event. However, in this example, the marker density is low enough that in later generations, the linkage information is lost, and the admixture and linkage models produce almost identical (and similarly biased) results.
By contrast, Fig 4D (see also Table 2D) shows the situation where the marker density is very high. In this case, background LD is substantial, leading the linkage model to consistently overestimate the divergence between the two populations. Further, substantial admixture is estimated for population 1, which is in fact pure. The admixture model actually does rather better in the populations it infers, but a few generations after the admixture event it also produces misleading results. These results illustrate the problems that can arise when there is substantial background LD, a point to which we return in the DISCUSSION.
Estimating the time since admixture:
In addition to improving estimates of the degree of admixture, the linkage model also provides an indication of the time since admixture. For examples AC the value of r (Table 2) provides good estimates of the number of generations since admixture, except immediately after the admixture event (when little admixture has occurred, so that there is not yet much admixture LD and the posterior for r is uninformative) and >100 generations after admixture, when the number of generations is underestimated. The time of admixture may be considerably overestimated by r if there is substantial background LD in the sample (Table 2D).
Population-of-origin assignments for chromosomal regions:
A further advantage of the linkage model is that if the marker density is high enough, it can provide accurate population-of-origin assignments for chromosomal regions, as required in applications such as admixture mapping. For example, Fig 5 shows population-of-origin assignments for the two allele copies at each locus of a single diploid individual. The Markov structure of the data is clearly evident from the structure output, in that nearby loci typically have similar assignment probabilities. When the data are phased, individual loci are often assigned with very high probability to the correct ancestral population, especially in the middle of a large chunk inherited from one population. At boundaries between chunks, the assignment probabilities typically change rapidly, giving a good indication of the position of the recombination event that brought the chunks from different populations together.
|
For unphased diploids, the data contain somewhat less information about the population of origin of individual allele copies, particularly in regions of the genome where the two homologous chromosomes are spanned by chunks inherited from different ancestral populations. In these regions, neighboring loci do not provide information concerning which of the two allele copies at a particular locus comes from one population and which comes from the other. For many problems (such as in admixture mapping) we are mainly interested in inferring the number of allele copies from each population, and this information can be extracted from the data, given sufficiently dense markers (as in Fig 5).
Coverage properties:
A final advantage of the linkage model is that it gives more accurate estimates of the statistical uncertainty of admixture proportions. This property is illustrated in Fig 6, which shows 90% credibility regions for q for a sample of individuals from two populations. The two populations partially admixed with each other in an admixture event (scenario VII) 32 generations before the sample was taken. After 32 generations of random mating within each postadmixture population, the ancestry coefficients of each individual are almost identical (differing by <0.001) and are shown by the red horizontal lines in the figure. Ancestry estimates were made by both the admixture and linkage models, for markers at a variety of genetic distances. Tightly linked markers give nonindependent information and are therefore less informative about the value of q for each individual than are the same number of unlinked markers, leading to higher variation in estimates between individuals. The admixture model does not take these correlations into account. Consequently, the sizes of the estimated credibility regions are approximately independent of the actual degree of linkage and are much too narrow for tightly linked markers. Under the linkage model, the credibility regions for q become wider as linkage between markers increases and continue to reflect the true degree of statistical uncertainty even for tightly linked markers.
|
| APPLICATIONS TO DATA |
|---|
Recombination between distinct populations of Helicobacter pylori:
The bacterium Helicobacter pylori colonizes the human stomach lining. When multiple strains infect the same stomach, they recombine rapidly through the import of fragments of DNA that are typically a few hundred base pairs in length (![]()
![]()
Fig 7 shows results for a typical isolate from South Africa. South Africa contained isolates from hpAfrica1, hpAfrica2, and hpEurope populations, reflecting the ethnic diversity of the region. The particular isolate we consider here was assigned to the Africa1 (blue) population by the no-admixture model. The top plot (Fig 7A) shows the posterior assignment probabilities for each individual nucleotide based purely on the estimated population allele frequencies (i.e., not using information from q). The plot shows that most sites provide little information about ancestry, with assignment probabilities to all four populations being
0.25. The remaining nucleotides are mostly assigned with high probability to Africa1 or, less frequently, Africa2 (red). The Africa2 nucleotides appear to come in runs, suggesting import of specific DNA fragments into a bacterium from the Africa1 population. This conclusion was confirmed by further exploratory analysis (Fig 7B). For each population, the sum of the log of the assignment probabilities was computed within a 100-nucleotide moving window. For most of the sequence, the value of the sum was positive for the Africa1 population (indicating higher probabilities than those under random assignment) and negative for the other three populations. However, in three stretches the sum for the Africa2 population gives positive values, suggesting DNA import into those regions.
|
The linkage model provides a formal method to make population-of-origin assignments that take the linkage relationships into account (Fig 7C). Nucleotides in the three regions identified by the exploratory analysis were assigned to the Africa2 population with probabilities close to 1.0, providing statistical support for the conclusion that there have been (at least) three imports of Africa2 DNA into these fragments. This example shows that given highly differentiated populations and enough informative sites, it can be possible in practice to make accurate population-of-origin assignments for individual loci. Further, because of the large amount of information provided by linkage, it is also possible to reconstruct ancestral populations in the absence of pure individuals, using the linkage model. See ![]()
Admixture LD in African-Americans:
We used the new linkage model to study the extent of admixture linkage disequilibrium in a Chicago-based population of African-Americans. Previous work on African-Americans has shown significant levels of European admixture in the range of
525%, with substantial variation across studies and across study populations (summarized by ![]()
![]()
![]()
![]()
![]()
![]()
![]()
The data set that we used consists of 247 microsatellites genotyped in samples of unrelated individuals including 210 African-Americans (from Maywood, Illinois), 158 European-Americans (from Michigan), and 308 Nigerians (Yoruba; ![]()
![]()
![]()
When run with K = 2, both the admixture and linkage models gave very similar ancestry estimates and both suggested that Nigerians and American whites were almost pure representatives of the respective preadmixture populations, with average estimated admixture rates of 1.4% for both populations. The African-Americans were substantially admixed, having a mean of 17.8% European ancestry (the range of point estimates for individuals' values of q was 259%, using the admixture model). Similar results were obtained if we used the USEPOPINFO = 1 option to specify the population of origin of the American whites and Nigerians. Our estimate of 17.8% European ancestry is very similar to the estimate of 18.8% obtained by ![]()
The posterior distribution for the parameter r under the linkage model is shown in Fig 8A. The posterior mean of r was 0.098 chromosome chunk breakpoints per centimorgan, with a 90% credible region of 0.070.13. Under the simplifying assumption that the African-American population was created by a single hypothetical admixture event (scenario V), this event is estimated to have taken place 713 generations ago. This is consistent with what one might expect, on the basis of the history of African-Americans, who were mostly exported from Africa during the late eighteenth century (![]()
|
We repeated our analysis using information provided by map distances from the recently published deCODE map (![]()
|
The posterior distribution of r (Fig 8A) clearly excludes large values of r, indicating that we are detecting a significant signal of admixture LD. Recall that as r gets large the linkage model becomes equivalent to the admixture model, so the fact that the posterior for r excludes large values shows that the linkage model provides a better fit to the data than does the admixture model in this case. For comparison, Fig 8B shows the posterior for r for the same data and map distances, but with the order of the loci randomized. The posterior for r has considerable support all the way up to the maximum value of r permitted by the prior and would clearly have extended to still larger values had the prior allowed this. Three further randomizations produced similar results, supporting the effectiveness of the posterior for r in summarizing the extent of admixture LD.
Although we detected a definite signal of admixture LD in our sample, most of the LD present in the African-Americans is actually due to variation in q: i.e., "mixture LD" in the terminology we introduced earlier. To differentiate between mixture and admixture LD, we examined the correlations of ancestry estimates (from the admixture model) between adjacent loci. The first measure that we used (Fig 10A) measures the correlation between the estimated probability of African ancestry (averaged over the two allele copies at each locus) for pairs of neighboring loci. The correlations were positive on average (mean 0.041 with standard error 0.009), albeit with a great deal of variation between different locus pairs. These correlations reflect the total LD in the sample. The second measure (Fig 10B) shows the correlations that remain when variation in q among African-Americans is accounted for. For each individual, at each locus, we computed a "residual" by subtracting the individual's estimated q(i) from the estimated probability of African ancestry (averaged over the two allele copies at each locus). The figure shows the correlations of these residuals. The correlations are slightly but not significantly negative on average (mean -0.001 with standard error 0.007), implying that most of the LD in the sample can be accounted for by variation in q (i.e., mixture LD). Further, a regression of the correlations with the genetic distance between the loci does not have a significant slope under either measure, presumably because the trend has been obscured by the high degree of variation in correlation values at each genetic distance. The fact that the linkage model obtains plausible estimates of r and rejects large values of r indicates that the linkage model extracts much more information from the data than the pairwise comparisons do.
|
Our results also highlight an important feature of human genetic data, which is that there is a great deal of noise in raw LD estimates for individual locus pairs, even when the admixture involves populations that, by human standards, are relatively highly differentiated. Thus, our description of admixture LD in this population would be enhanced by using a denser set of markers, and for applications such as admixture mapping where one needs to estimate the population of origin for the sampled chromosomes, a denser marker set would be critical. In admixed populations where loci have been chosen specifically to have large frequency differences between the putative parental populations (![]()
Genetic drift in Drosophila melanogaster:
To illustrate some of the possible ways of using the F model in historical inference, we have reanalyzed the data set of ![]()
We started the analysis without making any assumption about geographical clustering. Using the no-admixture and F models, K = 3 gave the highest model likelihood. The three inferred populations correlated well with the land masses Israel, Tasmania, and Australia, although many flies were not clearly allocated to one population and 50 Australian flies, 2 Tasmanian flies, and 7 Israeli flies were assigned to their home population with <50% probability. These inconsistencies are due to limited statistical power rather than to identifiable admixture events because an analysis under the migration model with the USEPOPINFO option (![]()
|
Where did the Tasmanian flies come from? We estimated F values when analyzing Tasmanian and Israeli flies together (Fig 11C) and Tasmanian and Australian flies together (Fig 11D). The value of F inferred for the Australian population was close to zero, much lower than that for the Israeli population, while a high value of F was estimated for the Tasmanian population in both cases. This analysis suggests that Tasmanian and Australian flies share a more recent common ancestor with each other than with the Israeli flies. Further, the amount of drift that the Australian flies have undergone since splitting with the Tasmanian flies is very low, implying that Tasmania was colonized from Australia and underwent a bottleneck in the process. A possible technical objection to our analysis is that flies were sampled in several locations in Australia and that this might somehow account for the particularly low estimated value of F. We tested for this possibility using the five sampling locations in Australia that had >25 genotypes. We ran structure separately for each one, using all of the Tasmanian genotypes in every case. The analysis gave a consistently high value of F for the Tasmanian population (0.1000.125) and a low value for the Australian population (0.0040.023)lower than that for the Israeli population in the equivalent analysis (0.039). Our inference therefore appears to be robust to the exact combination of populations chosen for analysis.
The approach we have taken is similar to that used by ![]()
| DISCUSSION |
|---|









PiPj between allele frequency vectors Pi and Pj as
, where
and pklj is the frequency of allele j at locus l in population k.





