Male Demography in East Asia: A North–South Contrast in Human Population Expansion Times
Yali Xue, Tatiana Zerjal, Weidong Bao, Suling Zhu, Qunfang Shu, Jiujin Xu, Ruofu Du, Songbin Fu, Pu Li, Matthew E. Hurles, Huanming Yang, Chris Tyler-Smith


The human population has increased greatly in size in the last 100,000 years, but the initial stimuli to growth, the times when expansion started, and their variation between different parts of the world are poorly understood. We have investigated male demography in East Asia, applying a Bayesian full-likelihood analysis to data from 988 men representing 27 populations from China, Mongolia, Korea, and Japan typed with 45 binary and 16 STR markers from the Y chromosome. According to our analysis, the northern populations examined all started to expand in number between 34 (18–68) and 22 (12–39) thousand years ago (KYA), before the last glacial maximum at 21–18 KYA, while the southern populations all started to expand between 18 (6–47) and 12 (1–45) KYA, but then grew faster. We suggest that the northern populations expanded earlier because they could exploit the abundant megafauna of the “Mammoth Steppe,” while the southern populations could increase in number only when a warmer and more stable climate led to more plentiful plant resources such as tubers.

HUMANS have expanded enormously in geographical range and numbers in the last 100,000 years, starting as a rare species confined to parts of Africa and ending with the current population of >6 billion distributed all over the world, but the details of these changes are poorly understood (Jobling et al. 2004). Historical records document a substantial demographic expansion within historical times, and also the complexity of the changes that have occurred, but are available only for the last few thousand years. Before this time, the archaeological record indicates that humans increased substantially in number when Neolithic transitions led to greater and more reliable food production after ∼10 thousand years ago (KYA), but provide only limited quantitative information. Genetic variation can also provide insights into past demography. Standard neutral models of evolution predict the extent and pattern of variation expected in a constant-sized population, but experimental data from the human population are often not consistent with such a model. For example, an overall excess of rare variants, reflected by negative values for Tajima's D (e.g., Akey et al. 2004), is commonly interpreted as a signal of demographic expansion, although the details of such an expansion remain unclear (Wall and Przeworski 2000; Ptak and Przeworski 2002). Genomewide analyses of short tandem repeats (STRs) have been interpreted as revealing an early expansion in Africa 49–640 KYA with no expansion outside Africa (Reich and Goldstein 1998), or alternatively a constant population size in Africa compared with expansions in Europeans and Africans (Kimmel et al. 1998). A larger-scale study using 377 loci in 52 populations suggested expansion in African farmers starting ∼35 KYA, in Eurasians ∼25 KYA, and in East Asians ∼18 KYA, but found no significant signal of growth in African hunter-gatherers or populations from Oceania and America (Zhivotovsky et al. 2003). The conflicting conclusions may reflect, in part, the complexity of the real events, so that it may not be useful to compare descriptions of demographic change summed over large geographical regions. Instead, studies at higher spatial resolution may be necessary to understand how the demography has changed in different ways at local levels.

Single loci can be influenced by stochastic variation and locus-specific selection, but nevertheless two of them, mitochondrial DNA (mtDNA) and the Y chromosome, are of particular interest because of the insights they can provide into female-specific and male-specific evolutionary patterns, respectively. Mismatch distributions of mtDNA sequences from populations around the world have suggested expansion, on average, ∼40 KYA (Sherry et al. 1994), while a phylogenetic star contraction method indicated expansion of the major Asian clades M and N ∼30 KYA (Forster et al. 2001). Studies of the Y chromosome have shown a strong signal of expansion beginning in the Paleolithic ∼18 KYA (7–41 KYA, Pritchard et al. 1999) or ∼22 KYA (8.5–50 KYA, Macpherson et al. 2004) worldwide, with limited variation between continents. In contrast, a detailed study of one country, Armenia, suggested a start of expansion in the Neolithic ∼4.8 KYA (2.0–11.1 KYA, Weale et al. 2001).

We want to understand the history of East Asia, including its male demographic history. Modern humans were present in Australia at ∼50 KYA and, despite a lack of direct archaeological evidence, may have reached the southern part of East Asia at about the same time (Jobling et al. 2004). Classical marker studies reveal a genetic distinction between northern and southern China, with a boundary corresponding approximately to the Yangtze River (Xiao et al. 2000). Some authors have suggested that modern East Asian populations are derived largely from a northward expansion of southern populations after the last glacial maximum (LGM) ∼18–21 KYA (Jin and Su 2000), while others have suggested a significant male contribution from Central Asia (Karafet et al. 2001). Despite these and other (Su et al. 1999; Deng et al. 2004) surveys of Y-chromosomal haplogroup distributions, we know little about the detailed demography of the region and how it compares between north and south. We now show that male demographic history differs substantially between the northern and southern parts of East Asia and link this to ecological differences between the regions in the Paleolithic period.


Data set:

Nine hundred eighty-eight males belonging to 27 populations from China, Mongolia, Korea, and Japan were included in this analysis. The samples, and their typing with 16 Y-specific binary and 16 short tandem repeat (STR) markers, have been described previously (Zerjal et al. 2003; Xue et al. 2005). For this study, we typed hierarchically an additional 29 binary markers (M89, M8, M38, P33, M217, M93, M48, M61, M76, M147, M27, M214, M5, M128, M178, M119, M101, M50, M175, P31, M95, M88, M122, M121, M134, M164, M159, M113, and M7) using multiplexed primer-extension reactions (Paracchini et al. 2002) adapted for the ABI (Columbia, MD) Prism SNaPshot system (Hurles et al. 2005) according to the manufacturer's guidelines. As before, DYS19 was excluded from most analyses because it is duplicated in some individuals.

Data analyses:

Haplotypes for this haploid locus could be constructed simply from the combination of STRs and/or binary markers present in the same individual and their frequencies determined by counting. Analysis of molecular variance (AMOVA) was performed using Arlequin 2.0 (Schneider et al. 2000) and spatial AMOVA (SAMOVA) analysis using SAMOVA1.0 (Dupanloup et al. 2002). Spatial autocorrelation was carried out using autocorrelation index for DNA analysis (AIDA) (Bertorelle and Barbujani 1995). Inferences about Y-chromosomal lineage histories and demographies were made using the Bayesian analysis of trees with internal node generation program (BATWING) (Wilson et al. 2003). Populations (represented by 25–65 individuals) were analyzed individually using weakly informative prior distributions for N, the effective population size before expansion [gamma(1, 0.0001): mean = 10,000, SD = 10,000]; α, the rate of growth per generation [gamma(2, 400): mean = 0.005, SD = 0.0035]; and β, the time in coalescent units when exponential growth began [gamma(2, 1): mean = 2, SD = 1.41] (Wilson et al. 2003). A calibrated “evolutionary” mutation rate for Y-STRs (Zhivotovsky et al. 2004) was used as the basis for a per-locus mutation rate prior of gamma(1.47, 2130) (mean = 0.00069, SD = 0.00057) and was allowed to vary independently for each locus. This mutation rate was calibrated against two historical events (the divergence of the Maoris and Cook Islanders in the Pacific and the migration of the Bulgarian Roma from India to Europe), and thus our time estimates are also calibrated against these events and do not depend on assumptions about generation time. Binary markers (unique event polymorphisms, UEPs) were included under option 2, in which they condition only the tree structures possible. A total of 104 samples of the program's output representing 106 Markov chain Monte Carlo (MCMC) cycles were taken after discarding the first 3 × 103 samples as “burn-in,” and convergence was confirmed by examining longer runs of 108 MCMC cycles for four populations and finding the same posterior distributions. The influence of population sample size in the range 25–65 was investigated by randomly subsampling 25, 30, 40, or 50 individuals from the Outer Mongolian population with size 65 and found to be negligible. The 0.025, median, and 0.975 percentiles of the output samples were recorded. Regression analyses were carried out using SPSS 14; the stepwise criteria in multiple linear regression were the defaults, probability of F to enter ≤0.05 and probability of F to remove ≥0.10. A contour plot of expansion times was drawn using SigmaPlot version 9 with inverse square smoothing and a sampling proportion 0.5.

“Expansion” always refers to an increase in numbers rather than area and “expansion time” to the time when the increase started, unless otherwise stated.


Approximately 1000 males from 27 East Asian populations were typed with 61 Y-chromosomal markers, and we first describe the basic properties of this data set. The 45 binary markers identified 31 haplogroups (including paragroups) in the sample, while the 15 STRs defined 730 different haplotypes (Figure 1, Table 1; see also supplemental Table 1 at Population diversities ranged from 0.60 to 0.94 for binary markers and from 0.84 to 1.00 for STRs (Table 2). There was considerable variation in the distribution of lineages between populations, but this did not correspond to the major ethnic distinction in the area, which is between the Han Chinese (>80% of the combined populations of China, Mongolia, Korea, and Japan) and the other populations. AMOVA analysis showed that only 1.8 and 0.5% of variation lay between Han and non-Han populations using binary and STR markers, respectively, and neither of these values was significantly greater than zero. There were, however, major geographical differences. Figure 2 shows that, despite the overall predominance of haplogroup O (56%), specific haplogroups were concentrated in each geographical region: C and N in the north; P and J in the west; O2b in the east; and O1*, O2*, and O3d in the south. We therefore wished to identify the most important elements of the geographical pattern in an objective way.

Figure 1.—

Phylogeny of Y-chromosomal haplogroups detected in this study.

Figure 2.—

Geographical distributions of Y-chromosomal haplogroups. (A) Populations sampled. (B–F) Haplogroup frequencies: circle area is proportional to sample size and sector area to haplogroup frequency. (B–E) Haplogroups are sorted into those showing predominantly northern (B), western (C), southern (D), and eastern (E) distributions. (F) The overall frequency of the most common haplogroup, O.

View this table:

Haplogroup frequencies in East Asian populations

View this table:

Population statistics: observed variation and BATWING prior and posterior estimates

We based the subsequent analyses on the STR data unless otherwise indicated because of the problems in interpreting data from preascertained binary markers. SAMOVA analysis (Dupanloup et al. 2002) identifies, for a prespecified number of groups of populations, the geographical groups that are most differentiated from one another. Application of this method to the East Asian Y-STR data set using two or three groups distinguished small numbers of unusual populations, a finding that is readily understood from the high frequencies of the “star cluster” (Zerjal et al. 2003) and “Manchu cluster” (Xue et al. 2005) lineages in some northern populations, and reflects extreme expansions of individual patrilines within historical times. The use of four groups provided the most informative subdivision, with a cluster of six southern populations distinguished in addition to some of the northern ones (Figure 3A). This pattern corresponds well to the north–south distinction seen with classical markers and shows that, in this respect, the Y-chromosomal variation is typical of that on other chromosomes. The division of the sample into more groups led to further subdivisions in the south (e.g., Figure 3B). Spatial autocorrelation analysis (Bertorelle and Barbujani 1995), based on the binary marker variation, produced correlograms that indicated significant clinal patterns or long-distance differentiation (not shown). The north–south haplogroup structure is therefore a continuum rather than a sharp bipartite division. To understand it further, we have explored the characteristics of the populations in more detail, concentrating on the 22 non-Han populations because of the spread of the Han during historical times (Wen et al. 2004).

Figure 3.—

SAMOVA analysis illustrating the geographical divisions identified when four (A) or six (B) groups are specified.

A simple property of a population is the variation it contains, and this can be expressed in a number of ways. A widely used measure, diversity, is so high when 15 STRs are used that the differences between populations are small (Table 2) and difficult to interpret. Reducing the number of STRs to an arbitrary four or three (Table 2, supplemental Figure 1 at produces a wider range of diversity values, and these are notably higher in the north than in the south. An alternative measure of variation within a population, average squared distance (ASD), shows a similar pattern. BATWING analysis allows demographic parameters of the populations to be explored. Using a model where the population size remains constant for a period and then begins to expand exponentially, we estimated, for each population, posterior values of (1) the effective population size during the constant period, Nposterior; (2) the time at which growth began; (3) the rate of growth per generation, α; and (4) the time to the most recent common ancestor (TMRCA) of the population (Table 2). We again noted substantial variation with latitude. Median Nposterior was higher in the north, the expansion began earlier, the rate of growth was slower, and the TMRCA was longer. Although all of these variables correlated significantly with latitude when examined individually in regression analyses (Table 3), the highest was with expansion time (adjusted R2 = 0.68), compared with 0.40 for the next highest, ASD. Unsurprisingly, a stepwise multiple regression analysis identified expansion time as the best predictor of north–south distance, and only α increased this significantly to reach an adjusted R2-value of 0.75. Thus earlier expansion time in the north and, to a lesser extent, more rapid expansion in the south, account best for the observed north–south differences. We display the expansion times as a contour plot in Figure 4, where the consistent difference between north and south is apparent. Figure 4 suggests, however, that the highest correlation of population expansion may not be with distance due north–south, but with distance along an axis tilted slightly northwest–southeast, and further examination showed that a tilt of ∼10° in fact gave the highest R2-value (0.71 compared with 0.69).

Figure 4.—

Contour plot showing the distribution of expansion times. Demographic expansion began earlier in the north than in the south.

View this table:

Regression analysis

The demographic model used is simple: it assumes that each population is independent and that a constant phase is followed by exponential growth. The other demographic models available in BATWING, constant population size or continuous expansion, are not informative about the expansion time. To explore one consequence of departure from the model used, we investigated artificial population mixtures constructed from combinations of the populations showing the earliest expansion (Inner Mongolians) and those showing the most recent [Yao (Bama) or Li]. The artificial population mixtures showed an early expansion time equivalent to that of the Inner Mongolians (Figure 5), demonstrating that the signature of early expansion is not obliterated by admixture.

Figure 5.—

Effect of artificial mixing of population data on estimated expansion time. Median values are plotted, together with their 95% confidence intervals.


We consider how our findings on East Asian male variation compare with previous studies and the implications of our work for the understanding of the demographic history of the region.

The distribution of Y-chromosomal haplogroups in East Asia has been extensively documented (e.g., Jin and Su 2000; Karafet et al. 2001; Deng et al. 2004), but these observations have raised questions about the relationship of northern and southern populations that remain unanswered. Su et al. (1999) typed 19 binary markers, 12 of which were chosen because they were already known to be variable in East Asia, and found higher diversity in the south than in the north and that the northern lineages were a subset of the southern ones, leading them to suggest that the northern populations were derived from the south by northward migrations. In contrast, Karafet et al. (2001) used a larger set of 52 binary markers ascertained mainly because of their variation in worldwide populations and discovered higher diversity (mean pairwise differences) in the north and that the northern lineages were not a subset of the southern ones. They concluded that a contribution to the northern populations from Central Asia was likely. The use of preascertained binary markers introduces a bias into estimates of diversity, but STRs are essentially free of this bias because they are variable in all populations. In our samples, STR diversity and ASD measurements were higher in the north than in the south (Table 2), a finding that is not easily reconciled with a largely or exclusively southern origin for the northern populations. It has been suggested that some populations, such as Hui, Uygurs, and Mongolians, have recent admixture with Central Asia and so reliance on them may give a false impression (Shi et al. 2005), but our findings are common to most populations from the north (Table 2).

Our most striking observation was the demographic contrast between north and south, which was explained largely by the variation in the start of population expansion (Tables 2 and 3; Figure 4). Despite the simplified demographic model and wide confidence intervals in the BATWING estimates (Table 2), the median values exhibit a simple and striking pattern: all of the northern estimates lie between 22 and 34 KYA, while all of the southern estimates are between 12 and 18 KYA. These suggest that the northern populations started to expand before the LGM (∼18–21 calendar KYA), while the southern populations started to expand after it. These time estimates are calibrated against historical events (Zhivotovsky et al. 2004) and so do not depend on the assumption of a particular male generation time, but nevertheless are uncertain, and so any interpretation based on them must be regarded with caution. Importantly, however, they are affected little by extensive admixture (Figure 5) and in such a case reflect the earlier expansion time. While extreme northern latitudes were inhospitable to early humans, Siberia has an extensive Upper Paleolithic archaeological record (Kuzmin and Orlova 1998) and a highly productive environment stretched across Asia. This showed an abundance of large animals and has been called the “Mammoth Steppe” (Guthrie 1990). Expansion times calculated in the same way for the Central Asian populations described by Zerjal et al. (2002), excluding those showing recent severe bottlenecks, lay between 24 (13–45) and 36 (16–74) KYA, like those of the northern populations from East Asia. We therefore propose that this cold but rich environment allowed the demographic expansion of populations who learned to exploit the profuse animal resources, and these people contributed in sufficient numbers to the ancestry of the northern populations we have tested to leave a signature in their paternal lineages. In contrast, this environment did not extend to the southern region, and the populations based there expanded only after the end of the LGM as the climate became warmer and more stable. The large-scale use of underground tubers is thought to have begun in the south as early as 15 KYA (Tong 2004), and it is notable that population expansion was subsequently more rapid there. The survival of this distinct demographic signature provides further evidence for the genetic differentiation between north and south and lack of extensive gene flow, leading to a genetic boundary seen initially in classical marker studies (Xiao et al. 2000).

Our conclusions, of course, refer only to the time when expansion began and do not conflict with the notion that population numbers increased much further during Neolithic and historical times. They do, however, illustrate the value of demographic studies at high spatial resolution: a similar analysis of a combined East Asian sample would lead to the conclusion that population growth began at ∼30 KYA [in remarkable agreement with the mtDNA estimate (Forster et al. 2001)] and would miss an important distinction. Further detailed genetic studies of demography in other parts of the world are now needed.


We thank all sample donors for their contributions to this work and all those who helped with sample collection and Andrew Flint, Tim Cutts, and Mike Shield for setting up and maintaining BATWING. This work was supported by a Joint Project from the National Natural Science Foundation in China and The Royal Society in the United Kingdom, by Key Teacher support from the Education Office of Heilongjiang Province, and by The Wellcome Trust.


  • Communicating editor: M. Nordborg

  • Received December 4, 2005.
  • Accepted February 9, 2006.


View Abstract