Despite recent advances in population genomics, much remains to be elucidated with regard to East Asian population history. The Ainu, a hunter–gatherer population of northern Japan and Sakhalin island of Russia, are thought to be key to elucidating the prehistory of Japan and the peopling of East Asia. Here, we study the genetic relationship of the Ainu with other East Asian and Siberian populations outside the Japanese archipelago using genome-wide genotyping data. We find that the Ainu represent a deep branch of East Asian diversity more basal than all present-day East Asian farmers. However, we did not find a genetic connection between the Ainu and populations of the Tibetan plateau, rejecting their long-held hypothetical connection based on Y chromosome data. Unlike all other East Asian populations investigated, the Ainu have a closer genetic relationship with northeast Siberians than with central Siberians, suggesting ancient connections among populations around the Sea of Okhotsk. We also detect a recent genetic contribution of the Ainu to nearby populations, but no evidence for reciprocal recent gene flow is observed. Whole genome sequencing of contemporary and ancient Ainu individuals will be helpful to understand the details of the deep history of East Asians.
THE rapid development of genomic technologies has greatly enhanced our understanding of the history of modern human dispersal out of Africa and the peopling of different continents (Li et al. 2008; Green et al. 2010; Reich et al. 2010). However, the history of populations in East Asia, including Siberia, remains poorly understood even though they account for a large fraction of the human population (Stoneking and Delfin 2010; Oota and Stoneking 2011). There is still no clear consensus regarding basic questions such as when, where, and how many times modern humans migrated into East Asia and Siberia. For example, several previous studies based on contemporary samples concluded that one migration from south to north could explain the genetic structure of East Asians (Li et al. 2008; HUGO Pan-Asian SNP Consortium 2009). However, recent studies indicate that this scenario is too simplistic: a recent study detected western Eurasian ancestry in an individual in southern Siberia 24,000 years ago as well as a substantial contribution of this ancestry to the gene pool of Native Americans (Raghavan et al. 2014), and several studies detected a clear genetic differentiation between northeast Siberians and central-south Siberians, with the latter being more closely related to northeast Asians (Rasmussen et al. 2010; Fedorova et al. 2013). Indigenous high-altitude populations of the Tibetan plateau provide another example of East Asian populations that do not fit into the simple one migration hypothesis. Tibetans and Sherpa show a divergent history from lowland East Asian populations such as Han Chinese (Jeong et al. 2014), and their adaptive haplotype spanning the EPAS1 (endothelial PAS domain-containing protein 1) gene shares its ancestry with that of an archaic hominin (Huerta-Sánchez et al. 2014).
Considering robust evidence of human habitation in Arctic Siberia before the last glacial maximum (LGM) (Pitulko et al. 2004), it is possible that there were multiple expansions (from south to north) and contractions (from north to south) of human populations in mainland East Asia and Siberia over a long period of time, generating a complex pattern of genetic relationships among contemporary populations. Population isolates frequently provide critical information to understand the genetic structure of surrounding populations with more complex histories. For example, it has been proposed that Sardinians are key to understanding the changes in population structure in mainland Europe with the arrival of Neolithic farmers (Keller et al. 2012; Skoglund et al. 2012). The Onge people from the Andaman Islands have also been proposed as the best contemporary representatives of the “Ancestral South Indian” ancestry (Reich et al. 2009; Moorjani et al. 2013). Therefore, unraveling the genetic history of population isolates in East Asia and Siberia may provide new insights into the initial colonization of these regions.
The Ainu people are an indigenous population of Hokkaido, a northern island in the Japanese archipelago, and of the southern part of Sakhalin islands (Figure 1). They have been proposed by archaeologists, linguists, and geneticists to be the direct descendants of prehistoric Japanese hunter–gatherers, associated with the Jomon pottery culture, dating back to 16,500 years before the present (Hanihara 1991; Habu 2004). The dual structure model for the Japanese population (Hanihara 1991) envisions that the contemporary Japanese are a mixture of two distinct genetic sources, one from the indigenous Jomon hunter–gatherers and the other from East Asian rice farmers who first migrated into the archipelago ∼2,300 years ago (“Yayoi culture”). Genetic data clearly support this model in two main regards. First, the genetic profile of the mainland Japanese reveals a strong signature of admixture, best modeled as a mixture of Ainu-related ancestry and continental East Asians (Jinam et al. 2012; Nakagome et al. 2015). Second, the Ainu are genetically closer to the Ryukyuan, who live in the southernmost islands of the Japanese archipelago, than to the mainland Japanese (Tajima et al. 2002; Matsukusa et al. 2010; Jinam et al. 2012; Koganebuchi et al. 2012). This suggests that inhabitants of the northern and southernmost parts of the archipelago were genetically most isolated from the incoming farmers, who first arrived in the central part of the archipelago. However, the origin of the Ainu people in the context of eastern Eurasian population history has not been thoroughly investigated using genome-scale variation data.
In this study, we investigated the genetic relationship of the Ainu with surrounding East Asian and Siberian populations, using genome-wide genotype data, and found that the Ainu gene pool is basal to all other East Asian populations. In addition, the Ainu show unusual patterns of excess genetic affinity with low-altitude East Asians and northeast Siberians, as well as signatures of genetic adaptations to their local environments and maritime hunter–gatherer life style.
Materials and Methods
We obtained genome-wide genotyping data of worldwide populations from several previous publications. First, we obtained genotype data of 36 Ainu individuals from a previous study, typed on the Affymetrix Genome-Wide Human SNP 6.0 array (Jinam et al. 2012). We estimated genetic relatedness for all pairs of Ainu individuals using PLINK v1.07 (Purcell et al. 2007), with 396,552 autosomal SNPs with minor allele frequency ≥0.05. We randomly removed one individual from each pair with coefficient of relationship (r) > 0.1875, which corresponds to relatedness between first cousins (r = 0.125) and half siblings (r = 0.25). This step removed 11 individuals, involved in 17 of 630 pairs, some of which correspond to first degree relatives (Supporting Information, Figure S1). Second, we used genotype data of 1963 individuals from 183 worldwide populations, “Affymetrix Human Origins Fully Public Dataset” (Lazaridis et al. 2014), typed on the Affymetrix Axiom Genome-Wide Human Origins 1 array (Patterson et al. 2012). We removed three individuals with genotype missing rate >5% and kept SNPs with missing rate not exceeding our criteria in all 152 populations with ≥5 individuals. Specifically, we allowed one missing genotype in populations with <20 individuals, and two missing genotypes in populations with ≥20 individuals. After this filtering, an overlap of 103,218 SNPs between the Ainu and Human Origins datasets was used for the majority of analyses (“WA” dataset, for worldwide and Ainu data). Third, we overlapped the “WA” dataset with genotype data for the Sherpa and Tibetans genotyped on Illumina arrays. Specifically, we took 21 Sherpa individuals described as the “high-altitude proxy” samples in a previous study (Jeong et al. 2014) and 30 Tibetans from near Lhasa, Tibet Autonomous Region in China (Wang et al. 2011). We used the overlapping 45,513 SNPs (“WHA” dataset, for worldwide, high-altitude and Ainu data) for most of the analyses along with the WA dataset to investigate the genetic relationships of the Ainu and the high-altitude East Asians. Fourth, we overlapped the WHA dataset with the genotype data of two Nivkh individuals (Fedorova et al. 2013) for additional genetic clustering analysis. Fifth, for genome scans of positive selection in the Ainu, we overlapped the Ainu genotype data with the 1000 Genomes Project phase 3 dataset, downloaded from https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference. This includes 540,304 SNPs (“1KG-Ainu” dataset). Finally, we overlapped the 1KG-Ainu dataset with available high-coverage Illumina whole genome sequences of contemporary humans and archaic hominins. For this, we retrieved genotype calls of high-coverage Denisovan (Meyer et al. 2012) and Altai Neandertal (Prüfer et al. 2014), in variant call format (VCF). Chimpanzee alleles were extracted from the chimpanzee genome assembly Pan_troglodytes-2.1.4 (Pantro4), using LiftOver tool (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver) to convert coordinates between the human reference sequence (GRCh37) and Pantro4. We also obtained short read data of 13 individuals (1–2 individuals from Han, Dai, Sherpa, Yoruba, Karitiana, Sardinian, Papuan, and Australian aborigine) from previous studies (Meyer et al. 2012; Jeong et al. 2014; Prüfer et al. 2014). Short reads were aligned to the human reference (GRCh37) using BWA backtrack 0.7.4-r385 (Li and Durbin 2009) with “-q 15” option, duplicate removed with Picard tool v1.98 (http://broadinstitute.github.io/picard/), locally realigned around indels, and base quality score recalibrated using the Genome Analysis Toolkit (GATK) v2.8-1 (McKenna et al. 2010; Depristo et al. 2011) following the “best practice workflows” from GATK (Auwera et al. 2013). We kept only properly paired nonduplicate reads with phred-scaled mapping quality ≥30 using Samtools v1.2 (Li et al. 2009). For each sample, we called genotypes across all sites using the GATK UnifiedGenotyper module, based on bases with phred-scaled quality score ≥30, and kept sites only if phred-scaled quality score was ≥50. We combined genotype calls of all modern and archaic individuals using bcftools v1.1 (Li et al. 2009), removed sites if they had any missing genotype, showed more than one alternative allele (including 1KG data and chimpanzee alleles), or located within either human CpG islands (Wu et al. 2010) or human/chimpanzee repeat regions. Finally, we intersected these data with the Ainu genotype data and removed strand ambiguous (A/T or G/C) SNPs, leaving a total of 322,011 SNPs (“CND-1KG-Ainu” dataset, CND for Chimpanzee-Neandertal-Denisova).
Assessment of genetic homogeneity within the Ainu individuals
To determine if 25 unrelated Ainu individuals represent a homogenous gene pool, we performed a model-based genetic clustering analysis using ADMIXTURE v1.22 (Alexander et al. 2009). For this analysis, we included 25 unrelated Ainu individuals and sets of 30 randomly chosen individuals per each 1KG East Asian population from the 1KG-Ainu dataset. We removed SNPs with minor allele frequency of <0.01 and randomly removed one from each pair of SNPs with r2 > 0.2 (“–indep-pairwise 200 25 0.2” option in PLINK), leaving 84,462 SNPs for the analysis. We ran 50 replicates with random seeds for the number of clusters (K) from 2 to 6 and chose a run with the maximum log likelihood for each K. The optimal value K = 2 was chosen based on its lowest fivefold cross validation error. We also performed a principal component analysis (PCA) of 1KG East Asians and the Ainu individuals, as implemented in the smartpca program in the EIGENSOFT package v4.1 (Patterson et al. 2006; Price et al. 2006).
We further estimated admixture time in the Ainu, using a decay of weighted admixture linkage disequilibrium (LD) as implemented in ALDER v1.03 (Loh et al. 2013). For this, we performed a two-reference ALDER analysis, using 1KG JPT (Japanese in Tokyo, Japan) and 12 Ainu with 100% Ainu ancestry in the ADMIXTURE analysis (“Ainu” in Figure S2A) as references and the other 10 Ainu as the target (“Ainu2” in Figure S2A). Three individuals who clustered together with mainland Japanese (“Ainu3” in Figure S2) were excluded from the analysis. We also ran ALDER with all 22 Ainu individuals as the target population, with SNP loadings for the Ainu–Japanese cline in PCA (PC1 in Figure S2B) as a weight function instead of specifying reference populations, to check if our split of Ainu individuals into two groups generates a bias in estimation. For both analyses, we applied bin size of 0.025 cM.
Because both analyses above suggested a recent admixture (12.3 and 11.6 generations, respectively) and we are interested in investigating the ancient history of the Ainu (Figure S3), we focused on the 12 individuals with no mainland Japanese ancestry (Ainu in Figure S2).
Population clustering and TreeMix analyses
We conducted a genetic clustering analysis of worldwide populations, with a subset of the WHA dataset, including all East Asian and Siberian individuals, using ADMIXTURE v1.22. For each dataset, we ran 50 replicates with random seeds for the numbers of clusters (K) from 2 to 9 and chose a run with the maximum log likelihood for each K. In all analyses, 5–10 best runs for each K had log likelihoods ranging within a difference of 1, supporting a convergence of the best runs. We chose the optimal K for each dataset by taking one with the lowest fivefold cross-validation error.
Then, we built a consensus tree of 15 worldwide populations representing major ancestry components inferred from the ADMIXTURE analysis. For this, we first removed all populations showing a negative three-population (f3) statistic (Reich et al. 2009; Patterson et al. 2012) to exclude populations that experienced recent admixture. Two additional populations were excluded because they showed significant evidence for admixture using ALDER: the Ulchi (using Korean and Itelmen as references; Z = 4.65 and P = 3.3 × 10−6) and the Eskimos (using Surui and the Chukchi as references; Z = 4.19 and P = 2.8 × 10−5). Second, for populations outside East Asia and Siberia, we arbitrarily chose Ju_hoan_North and Mandenka for Africans, Sardinian and Basque for Europeans, Papuan for Oceania, and Karitiana and Surui for Native Americans. These populations represent major branches of human continental diversity and have been used as such representatives in many population genetic studies (Li et al. 2008; Reich et al. 2011; Keller et al. 2012; Pickrell et al. 2012; Skoglund et al. 2015). Last, we removed the She from the analysis because of its unstable position in the population trees generated by TreeMix (Pickrell and Pritchard 2012). The resulting set of 15 populations covers well all ancestry components inferred from the ADMIXTURE analysis (Figure 2). Five hundred bootstrap replicates of the maximum-likelihood tree were generated for the final 15 populations by TreeMix, with 50 SNPs per block (“-k 50”). A majority consensus tree was inferred using the R package “ape” (Paradis et al. 2004). We also performed TreeMix analysis with 1 to 5 migration edges allowed (“-m 1” to “-m 5”) to detect major patterns of extra population affinity not captured by the tree in Figure 3. One hundred bootstrap replicates were conducted for each m value.
Formal tests of admixture
We calculated three-population (f3) and Patterson’s D statistics (Green et al. 2010) for all combinations of 71 worldwide populations (Table S1), using the qp3Pop and qpDstat programs in the ADMIXTOOLS v1.1 package (Patterson et al. 2012). All contemporary populations from the human genome diversity panel were included. In addition, all the other East Asian and Siberian populations were included if they had sample size ≥5. Four Yukagir individuals were removed because they show a large proportion of European-related ancestry, likely due to recent admixture.
We used the admixture LD decay-based method implemented in ALDER to provide additional evidence for admixture as well as an estimate of time since admixture, assuming a single pulse of admixture. We ran ALDER with two reference populations chosen based on the three-population test results. We applied bin size of 0.025 cM and required a minimum genetic distance of 0.5 cM between bins.
Frequency of derived alleles with selection signals in East Asians
We interrogated three variants, EDAR V370A (rs3827760), OCA2 H615R (rs1800414), and a noncoding SNP rs3811801 in the ADH gene cluster, which carry a selective sweep signal in East Asians. Because rs3827760 and rs3811801 were not included in the Ainu data, we imputed them using IMPUTE2 (Howie et al. 2009) with 1KG phase 3 dataset as a reference. Rs3827760 was imputed with high confidence (genotype posterior probability >0.98) for all 12 individuals. Except for two individuals who were omitted from further analysis, all imputed genotypes at rs3811801 had posterior probability ≥0.89. Therefore, although imputation in an isolated population may have reduced accuracy, genotypes were imputed with high confidence in our samples.
Genome scans of recent positive selection in the Ainu
We used autosomal SNPs from the 1KG-Ainu dataset to detect genomic regions showing signatures of recent positive selection in the Ainu. For this, we calculated the cross-population extended haplotype homozygosity (XP-EHH) statistic (Sabeti et al. 2007) against 1KG phase 3 CHB (Han Chinese in Beijing, China) and the population branch statistic (PBS) (Yi et al. 2010) using CHB as a comparison group and 1KG phase 3 CEU (CEPH Utah residents with northern and western European ancestry) as an outgroup. We removed strand ambiguous (G/C and A/T) SNPs from the analysis. To perform the XP-EHH analysis, we first phased the Ainu genotype data using SHAPEIT2 v2.r790 (Delaneau et al. 2013) with 1KG phase 3 data as a reference. The most likely haplotypes were chosen after running SHAPEIT2 with default parameter values. We summarized signals for each of 500-kb windows sliding by 50 kb. First, we counted the number of total SNPs (ntotal) and top 1% signal SNPs (ntop) for each window and each test. Second, we calculated a simple binomial probability P(X ≥ ntop) assuming a binomial distribution with success probability 1%, i.e., X ∼ B(ntotal, 0.01). Probability of 1 was assigned to windows with ntotal < 20. Then, we prioritized windows with binomial P ≤ 0.01 for both XP-EHH and PBS and merged adjacent windows. Finally, we narrowed down the peaks by removing 50-kb windows that do not harbor any top 1% SNP, resulting in 66 signal peaks for further analysis.
We used a simulation-based approach to test if our selection scans have enough power to distinguish loci under positive selection from neutrally evolving ones, given the small effective population size of the Ainu and our small sample size. For this purpose, we focused on the top 10 regions among the above 66, which have the highest PBS statistics. In our procedure, we first simulated neutral trajectories of derived alleles under the Wright–Fisher model with a constant size of Ne = 2,219, which we estimated from LD decay. Specifically, we fit a nonlinear regression model with the equation E(r2) = 1/(1 + 4Nec) + 1/n, where c is a genetic distance in morgans and n is the number of sampled chromosomes (Tenesa et al. 2007). We also repeated our simulations with a more conservative estimate Ne = 1000. Although it is hard to model accurately demographic history using array genotyping data, the range of Ne values we used is likely to span a range of relevant demographic scenarios. The time of divergence between Ainu and the other East Asians was assumed to be 800 or 1000 generations ago, and the frequency in Ainu at the time of the split was taken from the current frequency in the other East Asians. At the end of each simulation, we took the frequency (fpresent) and sampled 24 alleles following a binomial distribution of probability fpresent. We repeated this process 10,000 times and calculated an empirical probability of our data by taking the fraction of simulations where the counts of simulated-derived alleles are greater than those observed in the Ainu sample.
A subset of the Ainu samples is the result of recent admixture
We first investigated if the Ainu samples in our study represent a homogenous gene pool. Both a PCA and a genetic clustering analysis showed that the Ainu samples are genetically heterogeneous and form a few distinct clusters (Figure S2). Based on these analyses, three individuals labeled as Ainu were indistinguishable from the mainland Japanese samples (Figure S2); therefore, they were removed from the analysis. The same analyses also identified 10 Ainu individuals with substantial non-Ainu ancestry (>11%). Using ALDER, which is based on weighted admixture LD decay, we estimated the admixture time for these individuals to be 12.3 generations ago (Figure S3), further supporting a recent mixture (P = 1.1 × 10−6). If we included the entire sample of unrelated Ainu, a similar analysis with SNP loading on PC1 as a weight vector, without specifying reference populations, returned a similarly recent estimate (11.6 generations ago). Furthermore, to explore more complex admixture scenarios, we also fit data to a two-pulse admixture model, resulting in a combination of a younger (5.2 ± 2.1 generations ago) and an older (40.7 ± 8.9 generations ago) admixture event (Figure S3). When using a more stringent cutoff of 2.0 cM for the minimum genetic distance between markers, estimates were younger: 8.8 ± 2.1 for a single-pulse model and 4.6 ± 2.0 and 30.4 ± 10.1 for a two-pulse model. Even the older estimates of 30–40 generations ago from the two-pulse model, which set an upper bound of a continued gene flow, are too recent to be consistent with the initial expansion of the Yayoi culture into the Japanese archipelago (Jinam et al. 2012; Nakagome et al. 2015) or with the hypothesized gene flow from the so called “Okhotsk culture,” which spread throughout the Sakhalin and Hokkaido islands between the 5th and 11th centuries (Befu and Chard 1964; Ohyi 1975). Because we are primarily interested in the ancient history of the Ainu gene pool and its relationship with worldwide populations, we decided to primarily use 12 individuals with 100% Ainu ancestry (Ainu in Figure S2) as representative of the original Ainu gene pool in the following analyses. However, we also performed several analyses with all 22 Ainu individuals in parallel and obtained comparable results, as presented below.
The Ainu form an outgroup to all East Asian farmers including Tibetan populations
To infer the genetic relationship of the Ainu gene pool with worldwide populations, we compiled genotype data of the Ainu, the Sherpa, Tibetans, and 183 populations around the world as described in the Materials and Methods section (WHA dataset). With this dataset, we first characterized the Ainu ancestry in the context of East Asian genetic diversity, by performing a model-based unsupervised genetic clustering as implemented in the program ADMIXTURE. With the optimal number of ancestral components (K = 8), the Ainu individuals are assigned to a distinct ancestry (Figure 2). In suboptimal runs with fewer ancestral components (K ≤ 6), the Ainu, as most other East Asians, are modeled as a mixture of Siberian and East Asian ancestries, with limited contributions from other populations (Figure S4). Interestingly, Ainu-related ancestry is present in the Japanese and the Ulchi people from the lower basin of the Amur River (17.8 and 13.5% mean ancestry in Figure 2, respectively), as well as in two Nivkh individuals, an indigenous population from the Sakhalin island, in a similar analysis with additional samples (27.2% mean ancestry; Figure S5). This suggests a potential gene flow from an Ainu-related gene pool into these surrounding populations.
Then, we further explored the genetic relationships among the ancestry components inferred from the ADMIXTURE analysis. To do this, we aimed to choose population samples that were good representatives of the global population structure (Figure 2). First, we removed populations with signatures of recent admixture not involving the Ainu ancestry, suggested either by negative values of the three-population (f3) statistic or by significant decay of weighted admixture LD (see Materials and Methods). To simplify the analysis, we arbitrarily chose one or two populations to represent each of the major continental groups outside of East Asia, but kept all East Asian and Siberian populations with no signal of recent admixture. This process resulted in a total of 15 populations (Figure 3), which cover all ancestry components identified by ADMIXTURE.
Using a maximum likelihood-based algorithm implemented in TreeMix, we found that the Ainu can be modeled as an outgroup to all East Asian farmers (Ami, Atayal, Dai, Lahu, and the Sherpa; Figure 3) in all 500 bootstrap replicates. The long terminal branch leading to the Ainu (Figure 3), as well as slow decay of LD (Figure S6), suggests a strong genetic drift, expected for a small population isolate. The Ainu’s position as an East Asian outgroup in the tree is unlikely due to gene flow from outside East Asia, as no significant results were found with Patterson’s D statistic in the form of D(African, non-African outgroup; Ainu, East Asian farmer) (Table S2). We further investigated this point by using an additional dataset of 320K SNPs in a smaller set of populations including also the Neandertal and Denisova data (“CND-1KG-Ainu” dataset). Consistent with the TreeMix analysis, the Ainu form a clade with East Asian populations relative to Europeans or South Asians (Table S3 and Figure S7). Additionally, at the resolution provided by this data set, the Ainu are not inferred to contain more archaic ancestry than other East Asian populations. The TreeMix result is unlikely to be an artifact of either genetic drift or variant ascertainment bias: this algorithm was shown to work well even if population-specific variants are common (extreme genetic drift) or if all variants are ascertained in a single population (Pickrell and Pritchard 2012). In addition, the allele frequency distribution of the SNPs included in the analysis was similar across the Ainu and the other East Asian, Siberian, and Native American populations (Figure S8) and it did not vary substantially between different intersected sets of SNPs (Figure S9). The Sherpa formed an outgroup to the lowland East Asian farmers, consistent with our previous study showing a deep split between high- and low-altitude East Asians (Jeong et al. 2014). Siberian populations (Nganasan and Itelmen) were modeled either as a sister group of all East Asians including the Ainu (76.8%; Figure 3) or as a sister group of Native Americans, Karitiana, and Surui (23.2%; Figure S10). When we allowed for migration edges in the TreeMix analysis, gene flow events between Europeans and Native Americans or Siberians were robustly inferred in all bootstrap replicates (Figure S11 and Table S4). This pattern is in agreement with the findings of previous studies on the genetic history of Native Americans and Siberian populations (Rasmussen et al. 2010; Fedorova et al. 2013; Raghavan et al. 2014).
The Ainu share more ancestry with low-altitude than with high-altitude East Asians
Human populations often have a complicated genetic history, which cannot be fully captured by a simple tree-based model. The distribution of residual covariances from the maximum likelihood trees indeed suggests that the consensus tree cannot fully explain the data for many of the populations, including the Ainu (Figure S12). Therefore, as a next step, we investigated if the Ainu have extra affinity with other populations beyond what could be inferred in a bifurcating tree. First, we tested if East Asian farmer populations are symmetric to each other in terms of their relationship with the Ainu, as suggested by the Ainu position as an outgroup to these populations in the consensus population tree (Figure 3 and Figure S13A). If the population relationships in the consensus tree hold, two East Asian farmer populations should be equally close to the Ainu. However, the D statistics in the form of D(outgroup, Ainu; Sherpa, other East Asian) consistently showed significantly positive values (D > +2.9 SD; Table S5), suggesting gene flow between the Ainu and lowland East Asian populations after their split from the high-altitude populations. Unsurprisingly, the inclusion of the 10 recently admixed Ainu individuals further strengthened this pattern (D = +3.0 to +8.6 SD for the same set of tests). Genetic affinity tests of East Asian populations with the Ainu, assessed by the outgroup f3 statistic (Raghavan et al. 2014), also supported a closer relationship between the Ainu and lowland East Asians than between the Ainu and the Sherpa (Figure S14). TreeMix analysis allowing migration edges also detected a similar relationship: an edge between the Ainu and lowland East Asians was inferred for 59–88% of bootstrap replicates, when three or more migration edges were allowed (Figure S11 and Table S4).
The Ainu share ancestry with northeast Siberians but not with central Siberians
Previous genetic studies of Siberian populations clearly demonstrated genetic differentiation between northeast Siberians and the rest of the Siberian populations (Volodko et al. 2008; Rasmussen et al. 2010; Fedorova et al. 2013; Raghavan et al. 2014). Our genetic clustering analysis recapitulates this observation, by separating the northeast Siberian ancestry (skyblue in Figure 2; concentrated in Eskimo, Chukchi, Itelmen, and Koryak) from central Siberian ancestry (orange in Figure 2; most prevalent in the Nganasan and present in southern Siberians and northeast Asians). Even though the Itelmen and the Nganasan cluster together in our population tree (Figure 3 and Figure S10, Figure S13B), most East Asian populations are genetically closer to the Nganasan than to the Itelmen, as shown by their negative D(African, East Asian; Nganasan, Itelmen) statistic (−2.4 to −12.0 SD; Figure 4 and Figure S15). Interestingly, the Ainu were the only East Asian population showing a closer affinity to the Itelmen than to the Nganasan, although the observed negative D statistic was within statistical noise (+1.5 SD; Figure 4). This pattern was robust to the inclusion of the 10 recently admixed Ainu individuals, which, as expected, dampens the signal due to the presence of mainland Japanese ancestry (+1.0 SD; Figure S16). Consistent with this result, a D test in the form D(African, Siberian, Ainu, East Asian) showed that the Itelmen are genetically closer to the Ainu than to East Asian farmer populations, but the Nganasan are symmetric in their relationship to the Ainu and the East Asian farmers (Table S6). Northeast Asian or southern Siberian populations could not be directly compared in this way because of the shared ancestry between the Itelmen and the Nganasan. We obtained qualitatively similar results when we replaced the Itelmen with the Chukchi (Figure S17 and Table S6). The TreeMix analysis also detected migration edges between the Itelmen and the Ainu when four or five migration edges were allowed, but only in 13–23% of bootstrap replicates (Table S4).
An Ainu-related ancestry was introduced Into nearby populations
Our genetic clustering analysis strongly suggests that the Ainu-related ancestry substantially contributed to the gene pools of nearby populations, such as the Japanese or the Ulchi (Figure 2 and Figure S4, Figure S5). However, strong genetic drift in the Ainu may artificially generate such a signal. Therefore, we applied two formal tests of admixture, which use different aspects of genetic variation data, to test for Ainu-related admixture in these populations. First, three-population test statistics were significantly negative when the Ainu were used as a reference: −22.2 SD for the Japanese (using the Ainu and Han as references) and −3.9 SD for the Ulchi (using the Ainu and Nganasan as references). Second, admixture LD decay was clear in both populations (P = 3.7 × 10−26 and P = 8.7 × 10−9 for the Japanese and the Ulchi, respectively), with estimates of admixture time ∼70 and 22 generations ago for the Japanese and the Ulchi, respectively (Figure S18, Figure S19).
The Ainu genome harbors shared and unique signatures of adaptations
To investigate adaptive evolution occurring in the Ainu, we analyzed the allele frequencies of variants known to be swept to high frequency in East Asians and performed genomic scans of recent positive selection across the Ainu genome.
A nonsynonymous V370A (rs3827760) mutation in the EDAR (ectodysplasin A receptor) gene harbors a strong selective sweep signal shared among low- and high-altitude East Asians and Native Americans, which is not present in contemporary western Eurasian populations (Kamberov et al. 2013). In addition, the derived allele is associated with “East Asian phenotypes,” such as shovel shaped incisors (Kimura et al. 2009). In sharp contrast to surrounding populations, this allele occurs at only 25% (6 of 24) frequency in the Ainu (Table S7). Consistent with this finding, the Ainu are also reported to have the sundadont dental pattern, even though the sinodont pattern, which is associated with shovel shaped incisors, is the dominant one in northeast Asia (Howells 1997). This suggests that the Ainu ancestors may not have shared the selective pressures for EDAR V370A with other ancestral East Asian and Native American populations.
In contrast to EDAR V370A, two variants occurring at high frequency in East Asia and virtually absent elsewhere, rs1800414 (H615R) in the OCA2 gene (Hider et al. 2013) and rs3811801 in the ADH gene cluster (Li et al. 2011), have high frequency in the Ainu (Table S7). The onset of positive selection on these two variants was estimated to have occurred <11,000 years ago (Peng et al. 2010; Li et al. 2011; Chen et al. 2015). Therefore, the Ainu ancestors may have shared Holocene environmental factors favoring these variants with other East Asians, although gene flow between the Ainu and other East Asians, as we inferred from genome-wide SNP data, may also have contributed to their high frequencies in the Ainu. Interestingly, they occur at low frequency, ∼10% in the Sherpa and Tibetans, raising the possibility that selective pressure on these variants was different in the high-altitude environments.
To find genomic loci involved in local adaptations in the Ainu, we performed a LD-based test of recent positive selection (XP-EHH), and an allele frequency-based test (PBS), in the Ainu against CHB from the 1KG phase 3 dataset. We found a total of 66 genomic regions containing excess SNPs with extreme values (top 1%), defined by a binomial probability ≤0.01 of having the number of extreme SNPs equal to or greater than the observed one, of both XP-EHH and PBS statistics (Table S8). One such region on chromosome 11 spans the Apolipoprotein (APO) gene cluster, including the APOA5, APOA4, APOC3, and APOA1 genes. This region contains a noncoding SNP rs964184 associated with levels of blood triglycerides and LDL cholesterol, HDL cholesterol, and risk of coronary artery disease and ischemic stroke (Global Lipid Genetics Consortium 2013; Dichgans et al. 2014). Interestingly, the risk allele has much higher frequency in the Ainu, reaching 75% in comparison to 22% in 1KG CHB, and is found on a haplotype with extended LD (Figure S20).
In addition, multiple loci among the above 66 regions include SNPs showing extreme allele frequency differentiation between the Ainu and 1KG East Asians excluding JPT (Table S9). Several genes in these loci have been reported to be associated with other metabolic traits, such as body mass index, glomerular filtration rate, and serum metabolite levels (Table S9). Because the Ainu sample size is small (n = 12) and the Ainu have had historically small effective population size (Figure S6), high allele frequency divergence may not necessarily be due to local adaptations. Therefore, we relied on simulations to test if the observed allele frequency difference is greater than expected by chance for a small population and a sample size as small as ours. We found that for all of the 10 top PBS regions, the difference in allele frequency is unlikely to be due to chance (empirical P-value ≤0.01), based on neutral simulations of population of constant effective population size (Ne) of 2219, as estimated using an LD decay method (Table S9). We obtained similar results (9 of 10 top PBS regions with empirical P-value ≤0.05) (Table S9), when we used a smaller, more conservative estimate of Ne = 1000.
Our genome-wide analysis shows that the Ainu are one of the deepest branches of East Asian diversity, forming an outgroup to all present-day East Asian farmers, including high-altitude populations (Figure 3). The deep history of the Ainu is consistent with the archaeological record for the Jomon culture in Japan starting 16,500 years ago (Habu 2004) as well as their hunter–gatherer life style. Therefore, the ancestors of the Ainu are likely to have reached the Japanese archipelago in an early migration event distinct from the spread of farmer populations across East Asia.
Interestingly, we find evidence for extra genetic affinity between the Ainu and northeast Siberians (Itelmen and Chukchi), who share ancestry with Native Americans. This finding coupled with the ancient origin of the Ainu raises the possibility that the same migration event led to the settlement of Jomon hunter–gatherers and to the initial dispersal of Native American ancestors. If this is the case, this first northward migration took place before the LGM (“scenario 1” in Figure 5A). This proposal is consistent with previous studies that suggested a connection between Jomon or Ainu people and Native Americans based on morphological and genetic evidence (Tokunaga et al. 2001; Adachi et al. 2009; Owsley and Jantz 2014) (Figure 4 and Table S6). Under this scenario, the split between the Ainu and Native American ancestors is likely to have occurred earlier than the gene flow of western Eurasian ancestry into the Native American ancestors (Raghavan et al. 2014), because the Ainu and other East Asians are symmetrically related to contemporary Europeans and to the ancient MA1 sample (Table S2). However, a recent study reported no genetic affinity between the Ainu and the Kennewick man (Rasmussen et al. 2015). An alternative to this scenario is that there may have been more recent gene flow between the Ainu and northeast Siberian populations (“scenario 2” in Figure 5B). Our TreeMix analysis with migration edges suggests a gene flow from the northeast Siberians to the Ainu, although both the pattern itself and its direction are not robust (Table S4). As explained above, the spread of the Okhotsk culture is unlikely to account for this finding, although such a contact across the Sea of Okhotsk may have happened earlier than the Okhotsk culture. Whole genome sequence data of the Ainu, ancient Jomon samples, and northeast Siberians will shed more light on the details of this history (Li and Durbin 2011; Schiffels and Durbin 2014).
Surprisingly, we also find extra genetic affinity between the Ainu and lowland farmer populations in comparison to the Sherpa (Figure S14 and Table S5), indicating gene flow between these two groups of populations. A long-standing hypothesis posits that Ainu and Tibetans share a part of their ancestry that is not present in other East Asian populations based on patterns of Y chromosome variation. The Y chromosome haplogroup D-M174 shows a striking pattern of geographic distribution: it is highly prevalent in Tibetans and Japanese (especially in the Ainu) and virtually absent everywhere else in Eurasia (Hammer et al. 2006; Shi et al. 2008; Chiaroni et al. 2009; Stoneking and Delfin 2010). A possible explanation for the Y chromosome data is that the Tibetan and the Japanese branches of this haplogroup have deep coalescence times, i.e., older than 30,000 years before the present (Shi et al. 2008). Therefore, even if the presence of the D-M174 haplogroup in the Ainu and Tibetans is due to shared ancestry, the shared history of these populations was short and left only a weak genome-wide signature of shared variation in their gene pools.
Even though we find strong evidence of gene flow between the Ainu and lowland East Asian farmers, it is hard to establish whether migrations were mainly unidirectional and, if so, which direction was predominant. One possibility is that Ainu-related populations, probably hunter–gatherers, once occupied mainland East Asia preceding the expansion of farmers and that they contributed to the gene pool of the latter. The observation of Ainu-related ancestry in the Ulchi from the lower Amur River basin (Figure 2 and Figure S4, Figure S5) is consistent with the presence of such an Ainu-related population in mainland northeast Asia. This model of gene flow is not expected to generate a signature of admixture in the Ainu. Consistent with this scenario, we fail to detect an admixture signal in the Ainu beyond the 10 recently admixed individuals (Figure S2, Figure S3): three-population statistics are strongly positive for all combinations of reference populations listed in Table S1 (> +25 SD). Extended LD in the Ainu and the lack of a reference population representing the Jomon ancestry made it difficult to test for admixture in the Ainu using ALDER: indeed, decay constant and amplitude parameter estimates from one-reference ALDER analysis did not change regardless of our choice of reference population (decay constant = 9.0–9.8 generations ago, amplitude = 1.6 to 2.3 × 10−4 across all 26 1KG populations as a reference). This pattern was reported to be a likely false positive in the ALDER analysis, which occurs when the target population experienced strong genetic drift (Loh et al. 2013). Therefore, we probably did not have enough power to accurately infer the timing and direction of ancient gene flow events, such as those we found between the Ainu and lowland East Asians or northeast Siberians. Genetic analysis of ancient Jomon and Ainu samples over a range of time periods will be critical to distinguish among these hypotheses.
While we confirm the evidence first reported by Jinam et al. (2012) for an admixture event between Yayoi farmers and Ainu ancestors, we do not find evidence supporting a claim for a gene flow into the Ainu from an unknown population. This claim was based on a group of five Ainu individuals clustering away from the rest in PCA plots (Jinam et al. 2012). We think this is an artifact of including close relatives in PCA, a well-known phenomenon. Indeed, a recent reanalysis by Jinam et al. (2015) also found that exclusion of close relatives from the analysis removed this clustering pattern (Jinam et al. 2015). We discuss this issue in more detail in Text S1.
We took a cautious approach in performing and interpreting genome scans of recent positive selection in the Ainu, because of their small effective population size and our small sample size. For example, we decided not to use the integrated haplotype score approach (Voight et al. 2006) because it requires substantial numbers of haplotypes harboring both ancestral and derived alleles at the focal variant. Considering this limitation, it is particularly encouraging that we find several loci harboring extreme allele frequency differentiation in the Ainu (greater than expected under neutrality based on simulations) in comparison to 1KG CHB data (Table S9). In particular, lipid metabolism may have been a key process for local adaptation in the Ainu, as suggested by the selection signature around the APO gene cluster and their historical heavy dependence on a marine subsistence. Archaeological evidence for heavy reliance of the prehistoric Jomon culture, particularly in the northeastern part of Japan, on marine mammals and fish (Chisholm and Koike 1999; Yoneda et al. 2002) may provide a plausible link between our findings and local adaptations of the Ainu and Jomon people.
We thank the Asian DNA Repository Consortium for sharing the Ainu genotype data for this study. We are grateful to Hiroki Oota, Shuhei Mano, John Novembre, Matthew Stephens, and David Witonsky for helpful discussions and advice on data analysis methods. Computational resources for data analysis were provided by Tadashi Imanishi in the Biomedical Informatics Laboratory, at Tokai University School of Medicine, the Center for Research Informatics, at the University of Chicago, and the Beagle supercomputer provided by the Computation Institute and the Biological Science Division of the University of Chicago and Argonne National Laboratory, under grant 1S10OD018495-01. This work was supported in part by the National Institutes of Health grant R01HL119577 to A.D. C.J. was supported by a Samsung scholarship. S.N. was supported by a postdoctoral fellowship for research abroad from the Japan Society for the Promotion of Science (26-541).
Communicating editor: S. Ramachandran
Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.178673/-/DC1.
- Received May 27, 2015.
- Accepted October 19, 2015.
- Copyright © 2016 by the Genetics Society of America