We propose a multilocus version of FST and a measure of haplotype diversity using localized haplotype clusters. Specifically, we use haplotype clusters identified with BEAGLE, which is a program implementing a hidden Markov model for localized haplotype clustering and performing several functions including inference of haplotype phase. We apply this methodology to HapMap phase 3 data. With this haplotype-cluster approach, African populations have highest diversity and lowest divergence from the ancestral population, East Asian populations have lowest diversity and highest divergence, and other populations (European, Indian, and Mexican) have intermediate levels of diversity and divergence. These relationships accord with expectation based on other studies and accepted models of human history. In contrast, the population-specific FST estimates obtained directly from single-nucleotide polymorphisms (SNPs) do not reflect such expected relationships. We show that ascertainment bias of SNPs has less impact on the proposed haplotype-cluster-based FST than on the SNP-based version, which provides a potential explanation for these results. Thus, these new measures of FST and haplotype-cluster diversity provide an important new tool for population genetic analysis of high-density SNP data.
GENOME-WIDE data sets from worldwide panels of individuals provide an outstanding opportunity to investigate the genetic structure of human populations (Conrad et al. 2006; International Hapmap Consortium 2007; Jakobsson et al. 2008; Auton et al. 2009). Populations around the globe form a continuum rather than discrete units (Serre and Paabo 2004; Weiss and Long 2009). However, notions of discrete populations can be appropriate when, for example, ancestral populations were separated by geographic distance or barriers such that little gene flow occurred.
FST (Wright 1951; Weir and Cockerham 1984; Holsinger and Weir 2009) is a measure of population divergence. It measures variation between populations vs. within populations. One can calculate a global measure, assuming that all populations are equally diverged from an ancestral population, or one can calculate FST for specific populations or for pairs of populations while utilizing data from all populations (Weir and Hill 2002). One use of FST is to test for signatures of selection (reviewed in Oleksyk et al. 2010).
FST may be calculated for single genetic markers. For multiallelic markers, such as microsatellites, this is useful, but single-nucleotide polymorphisms (SNPs) contain much less information when taken one at a time, and thus it is advantageous to calculate averages over windows of markers (Weir et al. 2005) or even over the whole genome. The advantage of windowed FST is that it can be used to find regions of the genome that show different patterns of divergence, indicative of selective forces at work during human history.
Another measure of human evolutionary history is haplotype diversity. Haplotype diversity may be measured using a count of the number of observed haplotypes in a region or by the expected haplotype heterozygosity based on haplotype frequencies in a region. Application of this regional measure to chromosomal data can be achieved by a haplotype block strategy (Patil et al. 2001) or by windowing (Conrad et al. 2006; Auton et al. 2009).
One problem with the analysis of population structure based on genome-wide panels of SNPs is that a large proportion of the SNPs were ascertained in Caucasians, potentially biasing the results of the analyses. Analysis based on haplotypes is less susceptible to such bias (Conrad et al. 2006). This is because haplotypes can be represented by multiple patterns of SNPs; thus lack of ascertainment of a particular SNP does not usually prevent observation of the haplotype. On a chromosome-wide scale, one cannot directly use entire haplotypes, because all the haplotypes in the sample will almost certainly be unique, thus providing no information on population structure. Instead one can use haplotypes on a local basis, either by using windows of adjacent markers or by using localized haplotype clusters, for example those obtained from fastPHASE (Scheet and Stephens 2006) or BEAGLE (Browning 2006; Browning and Browning 2007a).
Localized haplotype clusters are a clustering of haplotypes on a localized basis. At the position of each genetic marker, haplotypes are clustered according to their similarity in the vicinity of the position. Both fastPHASE and BEAGLE use hidden Markov modeling to perform the clustering, although the specific models used by the two programs differ.
Localized haplotype clusters derived from fastPHASE have been used to investigate haplotype diversity, to create neighbor-joining trees of populations, and to create multidimensional scaling (MDS) plots (Jakobsson et al. 2008). It was found that haplotype clusters showed different patterns of diversity to SNPs, while the neighbor-joining and MDS plots were similar between haplotype clusters and SNPs.
In this work, we apply windowed FST methods to localized haplotype clusters derived from the BEAGLE program (Browning and Browning 2007a,b, 2009). We consider population-average, population-specific, and pairwise FST estimates (Weir and Hill 2002). Population-average FST's either assume that all the populations are equally diverged from a common ancestor, which is not realistic, or represent the average of a set of population-specific values. This can be convenient in that the results are summarized by a single statistic; however, information is lost. A common procedure is to calculate FST for each pair of populations, and these values reflect the degree of divergence between the two populations. Different levels of divergence are allowed for each pair of populations but each estimate uses data from only that pair of populations. On the other hand, population-specific FST's allow unequal levels of divergence in a single analysis that makes use of all the data.
We compare results from the localized haplotype clusters to those using SNPs directly. The results of applying localized haplotype clusters to population-specific FST estimation are very striking, showing better separation of populations and a more realistic pattern of divergence than for population-specific FST estimation using SNPs directly. We also use BEAGLE's haplotype clusters in a haplotype diversity measure and investigate the relationship between this measure of haplotype-cluster diversity and the recombination rate.
We analyzed data from phase 3 of the International HapMap Project (International Hapmap Consortium 2005, 2007). As the full analysis of these data had not been published at the time this study was performed, we restricted our attention to chromosome 22 and to two regions of interest: the lactase gene (LCT) and the 8p23 inversion (Antonacci et al. 2009). Samples from 11 populations are represented in the HapMap3 data; population labels and descriptions are given in Table 1.
We downloaded chromosomes 2, 8, and 22 of the draft 2 phase 3 HapMap in nonredundant HapMap format from www.hapmap.org in January 2009. We used only those SNPs that were genotyped in all populations (13,875 SNPs on chromosome 22). In any trio or parent–offspring pair with Mendelian inconsistency at a SNP, we set as missing all the genotypes for all individuals in the trio or pair for that marker. Genetic (centimorgan) positions were also obtained from phase II HapMap (International Hapmap Consortium 2007).
Imputation of missing genotypes and haplotype phase:
We used BEAGLE version 3.0 (Browning and Browning 2007b, 2009) to impute missing genotypes and estimate haplotype phase. For related individuals (mother–father–child trios or parent–offspring pairs), we used the pedigree information to increase the accuracy of the imputation/phasing (Browning and Browning 2009). All individuals were combined in the imputation and phasing analysis. This had two purposes. First, BEAGLE performs better with a larger sample size, although its performance in simultaneous analysis of individuals from multiple populations has not been examined in detail. Second, if the imputation and haplotype phase inference were performed separately for each population, this could bias the inferred results to further separate the populations, potentially leading to inference of differences between populations when none exist. In contrast, when all populations are analyzed simultaneously, the bias is toward the average of all populations, which is of less serious consequence. See, for example, Balding (2006) for a brief discussion of separate vs. combined phasing of cases and controls in case–control studies. On the other hand, pooling of populations for phasing could be problematic in that patterns of linkage disequilibrium (LD) differ substantially between populations, and the pooled population would not be in Hardy–Weinberg equilibrium. Thus, we also performed supplementary analyses in which each population was phased separately using BEAGLE. It will be seen that phasing the data together or separately does not have any notable effect on the results. After haplotyping, parent–offspring trios were reduced to the four parental haplotypes, parent–offspring pairs were reduced to three haplotypes (the child's two haplotypes plus the parent's nontransmitted haplotype), and unrelated individuals each contributed two haplotypes.
Beginning with phased haplotypes with no missing data, obtained using the methods described above, we built a localized haplotype cluster model (Browning 2006) using BEAGLE version 3.0. We do not dwell on the construction of the cluster model here; details can be found in previous work (Browning and Browning 2007a). We used the default option that gives the version of the model that we have used for multilocus association testing (Browning and Browning 2007a).
After fitting the localized haplotype cluster model, as described above, each haplotype is a member of one “localized haplotype cluster” state (Browning 2006; Browning and Browning 2007a) at each marker position. Figure 1 depicts the clusters obtained for the LCT gene. Note that the haplotypes are clustered at each marker position. Figure 2 shows the specific haplotypes contained in four of the clusters. Each haplotype within a haplotype cluster with marker position at SNP x has the same allele at SNP x; however, two haplotypes within the cluster may have differing alleles at other SNPs. In general, haplotypes within the same cluster will tend to have the same allele at SNPs near the marker position of the cluster. For example, in Figure 2, SNP 12 (rs12988076) is the SNP defining the cluster location, and all haplotypes within a cluster share the same allele at this SNP. Moreover, SNPs around SNP 12 also tend to be shared within each cluster. In cluster 1 we see that SNPs 8–19 are identical for all haplotypes, while some differences are seen at SNPs that are farther away.
We estimated population-average FST using the method of Weir and Cockerham (1984), as follows. Let r be the number of populations sampled, and let ni be the number of haplotypes in the sample from population i (note that in our data all missing data are imputed, so there is no dependence of this value on genetic marker l). For genetic marker l, let be the number of copies of allele j in individuals sampled from population i, and let be the sample allele frequency of allele j in population i. Similarly, for haplotype clusters, at the position corresponding to genetic marker l, let be the number of sampled haplotypes from population i that are in haplotype cluster j, and let be the sample frequency of haplotype cluster j in population i.
We define the adjusted population sample sizewhich has adjusted averageWe also obtain the weighted average frequency for each allele or haplotype cluster,We form mean squares among (MSA) and within (MSW) populationsThe population-average FST is estimated bywhere the sums are over loci and alleles or haplotype clusters within the current window.
For SNP-based analysis, windows are defined by the chromosomal positions of the SNPs. For haplotype clusters, each haplotype cluster has a corresponding marker (SNP), the chromosomal position of which we use as the position of the haplotype cluster.
We calculated paired FST for a pair of populations by considering only the samples from the two populations and applying the population-average FST estimator given above.
We calculated population-specific FST using the method of Weir and Hill (2002). Following the notation above, the estimated population-specific FST for population i is
When calculating FST estimates for SNPs, we made use of the phased haplotype data rather than the raw genotype data. We did not utilize the phase information directly, but did make use of the imputed sporadic missing genotypes and the reduction to unrelated haplotypes from the trio and parent–offspring pair data.
Figure 3 shows population-specific FST values for each HapMap3 population in sliding windows of 5 Mb along chromosome 22, while the third column of Table 1 shows the chromosome-averaged values of population-specific haplotype-cluster FST. The corresponding results for data phased separately by population are given in supporting information, Figure S2 and Table S1, but are virtually indistinguishable from the results for data phased together shown in Figure 3 and Table 1. The striking feature of the results is that the haplotype-cluster approach (Table 1 and left panel of Figure 3) separates the ethnicities into broad geographic origins, with Africans (Yoruba in Ibadan, Nigeria, YRI; Luhya in Webuye, Kenya, LWK; Maasai in Kinyawa, Kenya, MKK; and African ancestry in southwest United States, ASW) least diverged from the average or ancestral population; East Asians (Japanese in Tokyo, JPT; Chinese in Denver, CHD; and Han Chinese in Beijing, CHB) most diverged; and Europeans [Utah residents (CEPH) with northern and western European ancestry, CEU; Toscans in Italy, TSI], Mexicans (MEX), and Gujarati (GIH) intermediate, as we would expect from other studies of human genetic demography (Prugnolle et al. 2005). In contrast, the SNP-based estimates (right panel of Figure 3) are mixed together with no obvious meaningful pattern. The YRI, JPT/CHB, and CEU data from the right panel of Figure 3 have a similar pattern to the chromosome 22 panel of Figure 4 in Weir et al. (2005), which was based on HapMap phase I data. The East Asian populations show an increase in divergence ∼37–41 Mb. We are not aware of any known targets of selection in this region; however, a cluster of extreme integrated haplotype score (iHS) values was found at 38.0 Mb in HapMap phase I data (Voight et al. 2006).
FST analysis of genetic markers can be affected by the marker ascertainment scheme. A high proportion of SNPs in the HapMap3 data were ascertained from Caucasian individuals. Thus, SNPs with very low minor allele frequency (MAF) in Caucasians, but high MAF in some other populations will be underrepresented. To investigate the effect that this type of ascertainment has on the FST estimates, we ran additional analyses in which we first removed all markers with MAF < 0.05 in the CHB HapMap3 sample data. On chromosome 22, this reduced the number of SNPs from 13,875 down to 12,480. We then recalculated the population-specific FST estimates, averaging over the whole of chromosome 22, and looked at the difference between these values and the original estimates. For the SNP-based results, the differences ranged from −0.021 to 0.049, with mean absolute difference (over the 11 populations) of 0.022. For the haplotype cluster results, the differences ranged from −0.012 to 0.021, with mean absolute difference of 0.009. Thus the haplotype cluster results were less changed by the added ascertainment than were the SNP-based results. Not surprisingly, for both the SNP-based and the haplotype cluster results, the African populations had the largest decreases in estimated FST resulting from the ascertainment, while the East Asian populations had the largest increases. Since the ascertainment was performed in an East Asian population (the CHB), it is biased toward SNPs that differ more between East Asians and other populations (those SNPs that have reached high frequency in East Asians but may not have reached high frequency in other populations), thus making East Asians look more different from other populations. Other populations lose some of their most differentiated SNPs in the ascertainment, particularly African populations that have the most low-frequency SNPs not shared by other populations, resulting in decreases in FST.
Variants in LCT conferring lactase persistence have been selected for in pastoralist populations, including northwestern Europeans (represented by the CEU) and Kenyans (represented by the MKK) (Swallow 2003; Tishkoff et al. 2007). The region around LCT has a signature of selection in CEU from HapMap phase I in population-specific FST analysis (Weir et al. 2005). Figure 4 shows population-specific FST analysis of the region surrounding this gene in the HapMap3 data. In both the SNP-based and the haplotype-cluster analyses, the CEU and MKK populations have increased within-population homogeneity (equivalently, increased divergence from the ancestral population) around the gene, indicating selection for a favored allele. Although the peak for MKK is less pronounced in the haplotype cluster analysis, it has better localization than in the SNP-based analysis. Figure 1 shows BEAGLE's localized haplotype clusters in this gene. It is clear that a large proportion of CEU individuals share one haplotype (cluster 10 in Figure 2), while a large proportion of MKK individuals share a different haplotype (cluster 11 in Figure 2). This is consistent with the results of a study by Tishkoff et al. (2007), which found that African and European lactase persistence arose independently on differing haplotypic backgrounds.
The 8p23 inversion covers the approximate region 8.1–12.3 Mb on chromosome 8 (Antonacci et al. 2009). An analysis of pairwise population FST estimates from HapMap2 data showed that East Asians (CHB and JPT) have a signature consistent with positive selection at this location. We also find this signal in the three HapMap3 East Asian populations (CHB, CHD, and JPT), as shown in Figure 5. As in Deng et al. (2008), we see this signal over the first and the last third of the region. Deng et al. conclude that the signal is likely the result of positive selection in East Asians in XKR6, located at 10.8–11.1 Mb.
We calculated paired FST estimates (see methods) for each pair of populations, averaged these over chromosome 22, and converted to distances using the formula (Reynolds et al. 1983). Neighbor-joining trees (Saitou and Nei 1987) based on these distances are shown in Figure S3. Whereas Jakobsson et al. (2008) also constructed neighbor-joining trees based on haplotype clusters, their trees were constructed using allele-sharing distances rather than FST estimates.
We calculated haplotype-cluster diversity, which we define to be the number of haplotype clusters at each location, in each population sample. Results for diversity by chromosomal location are shown in Figure 6, while chromosome-wide average diversity values are given in the right-hand column of Table 1. There is a clear ordering of populations, very similar to that seen with the haplotype-cluster-based FST, with African populations showing greatest diversity; East Asians showing least diversity; and Europeans, Mexicans, and Gujarati showing intermediate diversity.
We calculated correlations between our measure of haplotype-cluster diversity, the inverse recombination rate [megabases per centimorgan; obtained from HapMap phase II estimates of genetic distance (International Hapmap Consortium 2007)], and SNP density (SNPs per kilobase) for sliding windows of 100 markers over chromosome 22. The sliding windows of diversity and inverse recombination rate have correlations between 0.45 and 0.60, depending on the population (lowest correlations for non-African populations, highest for the African populations). Figure 7 shows sliding windows of diversity and inverse recombination rate for the YRI sample (the correlation for YRI is 0.56). At positions with high recombination rate, LD is low and haplotype clusters represent small numbers of SNPs, resulting in fewer haplotype clusters and lower diversity. On the other hand, where recombination rate is low, LD is high and haplotype clusters represent large numbers of SNPs, which tends to result in more haplotype clusters and higher diversity. We also checked the correlation between diversity and the marker density. Although marker density (SNPs per kilobase) is highly correlated with recombination rate (centimorgans per megabase), with correlation 0.42, the correlation between diversity and marker density is low (range −0.004 to −0.16). Thus the number of genotyped SNPs in a region does not greatly affect haplotype-cluster diversity.
We also investigated the use of haplotype-cluster data for visualizing individual ancestry via MDS plots. As in Jakobsson et al. (2008), we saw no obvious difference between the haplotype-cluster and standard SNP-based MDS plots (results not shown). This suggests that the use of haplotype clusters in analysis of population structure does not provide large amounts of new information not already evident in the SNP data. Rather, it seems that the use of haplotype clusters partially corrects for SNP ascertainment bias, which is important for analysis methods such as FST that assume an unbiased sample of genetic markers and are sensitive to violations of this assumption.
We have shown that haplotype-cluster FST has useful properties compared to SNP-based FST. Haplotype-cluster population-specific FST provides results that have better interpretability than results from the SNP-based version. We showed that haplotype-cluster FST is less influenced by SNP ascertainment than is SNP-based FST, which provides an explanation for the improved interpretability of the results from the haplotype-cluster-based estimator. In this work, we used haplotype clusters from BEAGLE in calculating the FST values. One could also use haplotype clusters from fastPHASE, which would likely give similar results. We did not apply fastPHASE to these data, as fastPHASE (version 1.2.3) takes unphased data from unrelated individuals and hence would not have been able to take full advantage of the HapMap3 data, which include parent–offspring pairs and trios.
The use of a haplotype-cluster-based approach avoids problems with appropriate choice of windowing when investigating properties of haplotypes (Jakobsson et al. 2008). We used haplotype clusters to measure haplotype diversity. Jakobsson et al. (2008) also used haplotype clusters to investigate diversity, but whereas they used a fixed number of clusters (from fastPHASE), we used BEAGLE's variable number of clusters. BEAGLE's variable-cluster approach has the advantage of greater flexibility for modeling regions of the chromosome with different LD patterns, compared to using a fixed number of clusters as in fastPHASE. Jakobsson et al. measured diversity by displaying sample population frequencies for each cluster and by looking for haplotype clusters that were not shared by all populations. In contrast, we measured diversity by counting numbers of observed haplotype clusters within each population sample. We found that our measure of haplotype-cluster diversity varies between populations in a manner that is consistent with the population histories and that our measure of haplotype-cluster diversity is strongly correlated with recombination rate.
The authors thank Brian Browning for helpful comments on a draft manuscript. This work was supported in part by National Institutes of Health grant GM075091.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.116681/DC1.
Communicating editor: N. Takahata
- Received March 15, 2010.
- Accepted April 28, 2010.
- Copyright © 2010 by the Genetics Society of America