# Detecting Recent Positive Selection with a Single Locus Test Bipartitioning the Coalescent Tree

^{*}Key Laboratory of Computational Biology, Chinese Academy of Sciences-Max Planck Gesellschaft (CAS-MPG) Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China^{†}University of Chinese Academy of Sciences, Beijing 100049, China^{‡}Howard Hughes Medical Institute, University of California, San Francisco, California 94143^{§}Institut für Genetik, Universität zu Köln, 50674 Germany

- 1Corresponding author: Laboratory of Evolutionary Genomics, CAS-MPG Partner Institute for Computational Biology, Yue Yang Rd. 320, Shanghai 200031, China. E-mail: lihaipeng{at}picb.ac.cn

## Abstract

Many population genomic studies have been conducted in the past to search for traces of recent events of positive selection. These traces, however, can be obscured by temporal variation of population size or other demographic factors. To reduce the confounding impact of demography, the coalescent tree topology has been used as an additional source of information for detecting recent positive selection in a population or a species. Based on the branching pattern at the root, we partition the hypothetical coalescent tree, inferred from a sequence sample, into two subtrees. The reasoning is that positive selection could impose a strong impact on branch length in one of the two subtrees while demography has the same effect on average on both subtrees. Thus, positive selection should be detectable by comparing statistics calculated for the two subtrees. Simulations demonstrate that the proposed test based on these principles has high power to detect recent positive selection even when DNA polymorphism data from only one locus is available, and that it is robust to the confounding effect of demography. One feature is that all components in the summary statistics () can be computed analytically. Moreover, misinference of derived and ancestral alleles is seen to have only a limited effect on the test, and it therefore avoids a notorious problem when searching for traces of recent positive selection.

- positive selection
- adaptation
- demography

DETECTING recent events of positive selection and identifying beneficial alleles is of continued high interest in evolutionary biology. During the last two decades, many methods for this purpose have been proposed. Recent positive selection can alter the polymorphism pattern of a neutral locus which is partially or completely linked to a selected locus (*i.e.*, the locus carrying the beneficial allele). The altered site frequency spectrum can be detected if it differs significantly from the expectation under the standard neutral model (Tajima 1989b; Fu and Li 1993; Fay and Wu 2000; Zeng *et al.* 2006; Achaz 2009; Ferretti *et al.* 2010). Recent positive selection may also reduce genetic diversity around the selected locus due to genetic hitchhiking (Maynard Smith and Haigh 1974; Kim and Stephan 2002; Li and Stephan 2005; Nielsen *et al.* 2005), and result in an elongation of haplotype blocks (Sabeti *et al.* 2002). Several excellent reviews (Sabeti *et al.* 2006; Jensen *et al.* 2007; Fu and Akey 2013) provide more insights into the details of selection detection algorithms. Recently, also machine learning techniques have been proposed (Pavlidis *et al.* 2010; Lin *et al.* 2011; Ronen *et al.* 2013; Pybus *et al.* 2015; Schrider and Kern 2016). However, it still remains a challenging task to distinguish positive selection from demographic events since the amount and pattern of DNA polymorphism in samples from a population is typically affected not only by recent positive selection but also by demography (Nei *et al.* 1975; Watterson 1986; Tajima 1989a; Jensen *et al.* 2005; Teshima *et al.* 2006). The confounding effect of demography cannot be neglected when scanning genomes for candidate loci of recent positive selection (MacCallum and Hill 2006; Akey 2009; Hermisson 2009; Scheinfeldt and Tishkoff 2013; Vitti *et al.* 2013; Adrion *et al.* 2015).

Recent positive selection may produce a coalescent tree which is different from that under neutral evolution (Kaplan *et al.* 1989; Barton 1998; Fay and Wu 2000). Let us consider a locus partially linked to the selected locus, which means that a few recombination events happened during the selected phase. While selection operates, a lineage could escape from the selective sweep by recombination (Kaplan *et al.* 1989; Fay and Wu 2000; Kim and Nielsen 2004). In the tree topology this is reflected by one particularly long branch emerging from the root of the tree (Figure 1). Such tree topology is often called *unbalanced* (Sibert *et al.* 2002; Blum *et al.* 2006b; Li 2011; Li and Wiehe 2013). The unbalancedness of trees has been well studied (Kirkpatrick and Slatkin 1993; Blum *et al.* 2006a), and can also be measured as the distribution of the average length of the base branches that emanate from the root (Uyenoyama 1997).

Let us consider the tree of a sample collected from a random mating population. It is known that the probability for a tree to be unbalanced is independent of changes in population size (Tajima 1983; Hudson 1991). However, the probability of an unbalanced tree increases substantially when the neutral locus is partially linked to a positively selected site (Kaplan *et al.* 1989; Fay and Wu 2000). In this case, the majority of lineages fail to escape the selective sweep, so they will coalesce as a star-like subtree, and the internal branches are very short (Galtier *et al.* 2000) (Figure 1C). We found that the ratio between the lengths of two subtrees under the selective sweep model departs from the expectation under the standard neutral model (Figure 1). Therefore, hitchhiking also biases the average number of mutations in the two subtrees.

Recently, several groups proposed to detect selection using the information of coalescent tree, including topology of the tree (Li 2011; Disanto *et al.* 2013; Li and Wiehe 2013; Ferretti *et al.* 2017), and the distribution of coalescent times (Hunter-Zinck and Clark 2015; Ronen *et al.* 2015). Following those studies, we proposed a new test statistic, , to detect recent positive selection in a varying size population. The tree can be partitioned into two parts, denoted as major and minor branches (Figure 1). Based on the infinite-sites model and the number of mutations occurred on the major and minor branches, we can calculate and Under the standard neutral model, we expect the balanced tree, and where is Watterson’s (Watterson 1975), and estimated as the number of segregating sites divided by the tree length. However, under the selective sweep model, we expect an unbalanced tree, and since positive selection has a stronger impact on the major branches than on the minor branches. Thus, the unbalanced tree and a normalized difference between and (denoted as ) is proposed as a statistic to test for recent positive selection.

To measure the robustness of this statistic with respect to different confounding factors, we examined its false positive rate. Here, the false positive rate is defined as the probability to reject the standard neutral model (the null hypothesis) when actually no positive selection happened (Przeworski 2002; Jensen *et al.* 2005; Li 2011). A high false positive rate indicates that confounding factors should be taken into consideration when searching for loci under positive selection. We also compared our new method with other single locus tests for selection [see reviews by Sabeti *et al.* (2006), Fu and Akey (2013)]. Many of the available methods consider DNA polymorphism data from multiple loci, to first obtain an empirical background distribution, and then to perform genome scans for selective sweeps (Akey *et al.* 2002; Kim and Stephan 2002; Li and Stephan 2005, 2006; Nielsen *et al.* 2005; Sabeti *et al.* 2007). Generally, those methods require much more information than a single locus. Therefore, to conduct a fair comparison, we consider here only single locus based methods.

## Methods

### Statistical test

Unbalanced bifurcating trees have been introduced before (Purvis and Agapow 2002; Holman 2005; Li 2011; Li and Wiehe 2013). Here, we make a modification so that the unbalancedness can be defined for small samples. Let and be the number of left- and right-descendants of the two branches that originate from the root node of the tree, respectively, where and *n* is the number of sampled chromosomes. Furthermore, let such that may take integer values from to where means the smallest integer Under neutrality, it is known that, when is odd for and, when is even, (Tajima 1983).To bound the number of false positives produced by random topologies, we set We call a tree unbalanced if and 10% probability that the tree is more unbalanced (*i.e.*, ) under the standard Wright-Fisher neutral model. For smaller sample size, one can only take violating the 10% probability criterion.

Then let us consider a sample from a diploid population (although the test would also hold for haploid populations). Without loss of generality, we assume and denote as the node that is the most recent common ancestor (MRCA) of the lineages that are the left-descendants of the root. Thus, the whole sample is partitioned into two subsamples by the node (Figure 1). The branches of the subsample rooted by the node are defined as the major branches, and the rest of the tree is defined as the minor branches. Note that the minor branches include the two basal branches immediately descended from the root. Let be the expected total length of the minor branches under neutrality, scaled so that 1 unit corresponds to 2*N* generations, where *N* is the effective size of a diploid population. is the expected total length of the major branches under neutrality. and are calculated under the standard neutral model, and where is the tree length. Moreover, it is well-known that () can be estimated from the number of mutations on the tree since (Watterson 1975; Hudson 1991), where is the mutation rate per generation, the number of segregating sites (*i.e.*, the number of mutations occurred on the tree). Similarly, and can also be estimated from the number of mutations on the major branches () and minor branches (), respectively, and we have To calculate and details are given as follows:

Let * () be the time duration required for **k* sequences to coalesce to (*k*−1) sequences, *i.e.*, the so-called *k*-coalescent time. Following the standard coalescent process (Hudson 1991), we have(1)(2)(3)Furthermore, we can derive the expectation and variance of the branch lengths of the tree (Hudson 1991) from Equations (1) and (2):(4)(5)We first take as an example to derive the statistical properties of an unbalanced tree. More general results for unbalanced trees with are given in the Appendix. When the expectation and variance of and can be obtained from Equations (1)–(3) as follows:(6)(7)(8)(9)Equation (6) holds because is equal to the sum of the tree height and the branch length between the root and the node (Figure 1). Since(10)we can get the covariance of and (11)For a general value of we have where means the largest integer Let so we have If the sample consists of *n* sequences, there are coalescent events excluding the last one (from present to past). If we denote coalescent events that happen on major branches with “1,” and those that happen on minor branches with “0,” the sequence of coalescent events, from present to past, is a binary vector with “1”s and “0”s. Thus, there are possible binary vectors As an example, Figure 2 illustrates all the possible coalescent event vectors when and We use ** B** to denote the set of binary vectors. Each element in

**represents a set of trees having the same sequence of coalescent events. Each set of trees occur with the same probability (), since tree topology is independent of coalescent time, and trees with the same topology have the same probability. In other word for all**

*B*For convenience, we introduced a series of indices …, Each index corresponds to a coalescent event on major branches. The subscript in denotes the number of lineages in major branches right before this coalescent event (from present to past), and the value represents the number of lineages in the whole tree at the same time (Figure 2). With these indices, the expectation of can be calculated following the law of total expectation (details are given in the Appendix):(12)whereand Since we can easily obtain the expectation of from Equations (4) and (12),(13)Similar to the derivation of we derived the variance of and (details are presented in the Appendix),(14)where is determined by: if ( and ).(15)where is determined by where The covariance of and is(16)where and are given by Equations (5), (14), and (15), respectively.

Define where is the mutation rate of the locus, and let be the number of mutations on the minor branches. Then, we have Similarly, we have where is the number of mutations on the major branches. Thus, under the infinite-sites model, the population parameter can be estimated from and separately as(17)(18)Similar to Tajima’s *D* (Tajima 1989b), we denote the normalized difference between and as That is,(19)Then, we have where and can be expressed as functions of expectation, variance, and covariance of and (Equations 12–16). From Equation (16), the variance of is obtained as(20)where is estimated by Watterson’s method (Watterson 1975), that is, Similarly, we have the variance of (21)Calculation of is given in the Appendix. The result is(22)Following previous studies (Tajima 1989b; Fu and Li 1993), was estimated by Under the standard neutral model, we have thus However, it is expected that recent positive selection has stronger impact on than (Figure 1, C and D), which reduces This results in a stronger reduction of than of Thus and

To get the inferred value of for the locus, which is denoted by (*i.e.*, the branching pattern of the root), an unweighted pair group method with arithmetic mean (UPGMA) tree (Sneath and Sokal 1973) was reconstructed based on nucleotide differences. Here, only the information on the root of the UPGMA tree was collected. UPGMA was chosen in this study because it is an effective tree-construction algorithm, and can naturally determine the root. There are also other algorithms based on nucleotide differences that has similar efficiency, for instance neighbor-joining (Saitou and Nei 1987), but the latter is unable to infer the root without the information of outgroup. Moreover, in contrast to between-species phylogenetics, rate-variation of the mutation rate along branches is not a major concern in population genetic genealogies. A similar tree-bipartition algorithm was developed to estimate coalescent times (Tang *et al.* 2002; King and Wakeley 2016).

To calculate let the sampled sequences be numbered from 1 to *n*, assuming that the sample consists of *n* sequences. Thus, represents the whole sample. Without loss of generality, denote the descendants of the left branch originating from the root as and the descendants of the right branch originating from the root as Then, under the infinite sites model, is the number of segregating sites in the subsample (Supplemental Material, Figure S1 in File S1), and where and are the observed number of mutations on minor branches and major branches, respectively. Therefore, we do not infer the ancestral and derived status of alleles to estimate and and the outgroup is only used to root the tree. Following Equations (18) and (19), and can be calculated.

Finally, to test whether the data differ significantly from the expectation under the standard neutral model (one tail test), the *p*-value of the test is given by where is the estimated value of the statistic of the studied locus, and is the observed compact site frequency spectrum of the locus (Li and Stephan 2005). When an outgroup is available, the compact site frequency spectrum is polarized, and is the number of singletons observed in the sample, the number of doubletons, and (Li and Stephan 2005). When an outgroup is not available, the compact site frequency spectrum is unpolarized.

The test can only be performed if the sample size is not extremely small (*i.e.*, at least 10 diploid individuals, or the number of chromosomes ) since it requires to infer an unbalanced tree (*i.e.*, ). This restriction reduces the power because positive selection may not always produce an unbalanced tree. However, for the sake of keeping the false positive rate under control, we accept somewhat less sensitivity. We estimate the *P*-value of the test by generating 10^{4} samples of polymorphism data under the standard neutral model, and choosing the significance level of 0.05.

### Simulation data

To examine the performance of the test, single-locus DNA polymorphism data were simulated in different demographic scenarios. Then, the root branching pattern of tree () was estimated by the UPGMA method to partition each simulated sample into major and minor parts. Simulations for neutrality were done according to the procedures described previously (Hudson 1991). Positive selection was assumed to be directional with codominant alleles: denoting *b* as the wild-type allele and *B* as the favored allele on the selected locus, the three genotypes, *bb*, *bB*, and *BB*, have the relative fitness 1, 1+*s*, and 1+2*s*, respectively. The dynamics of the selected locus follow a deterministic approximation for the selective stage, where the frequency of the beneficial allele increases from to where (Kim and Stephan 2002; Li and Stephan 2006). We did not use time-forward simulations to generate the trajectory of the selected allele (Ewing and Hermisson 2010), and it is beyond the topic of this study to compare with the two trajectories.

### Data availability

Maize HapMapV2 data were downloaded from the Panzea website (www.panzea.org), which consisted of 84 maize sequences (Chia *et al.* 2012). The core region (84.63–84.67 Mb) on chromosome 10 was further analyzed by the new method. This region was previously identified as a candidate of positive selection on the maize genome (Tian *et al.* 2009; Lin *et al.* 2011). We also randomly selected and analyzed three chromosomal regions as neutral controls.

The software to perform the test is platform independent. It requires a preinstalled standard Java environment (version 1.5 or higher). The software for calculating can be freely downloaded from Zenodo (http://doi.org/10.5281/zenodo.835226) and our institutional website (http://www.picb.ac.cn/evolgen/softwares/).

## Results

### The distribution of

We first examined the distribution of under the standard neutral model without recombination (Figure S2 in File S1). The shape of distribution is slightly asymmetric, and, as expected, we have (Table S1 in File S1). With recombination, distribution shifts to the right, while the variance becomes smaller, so tends to be more positive (Figure S2 and Table S1 in File S1). This makes our test conservative. This is due to the fact that recombination breaks the region into several nonrecombined fragments. These fragments can have different genealogies. When there is no recombination, and under the infinite sites model, a genetic variant can either segregate in one of the major and minor subsample, or be monomorphic but distinct in the two subsamples. With recombination present, it could be segregating in both subsamples (Figure S1 in File S1). The genetic variant may then be counted as Thus, tends to be overestimated and tends to be positive when there is recombination within the locus. This agrees with the previous findings that recombination affects the distribution of summary statistics for detecting selection (Wall 1999).

Moreover, the infinite-sites model could be invalid due to multiple hits and homoplasy, although it is popularly used when detecting recent positive selection. We expect that the effects should be similar to recombination (Hudson 2001; McVean *et al.* 2002) and make the test more conservative, as discussed above.

### False positive rates under neutrality

Population size expansion and bottleneck may result in the incorrect inference of positive selection when there is none (Nei *et al.* 1975; Watterson 1986; Tajima 1989a; Jensen *et al.* 2005). As many researchers pointed out (Akey 2009; Hermisson 2009), this problem cannot be underestimated, in particular in genome-wide scans. To investigate the robustness of the test, the false positive rate is the probability that the null hypothesis (*i.e.*, the standard neutral model) is rejected when the neutral data are simulated under different demographic scenarios (Przeworski 2002; Jensen *et al.* 2005; Li 2011).

#### Population size expansion:

We examined the robustness of the tests under recent and old population size expansions (Figure 3E). The maximum false positive rate of Tajima’s *D* test, Fu and Li’s *D* test is ∼60–85% in the cases examined (Figure 3A). It agrees with the previous findings that Tajima’s *D* and Fu and Li’s *D* test are sensitive to population size expansion (Tajima 1989a; Fu and Li 1993). In contrast to these tests, but similar to the MFDM test, the test remains robust under all the examined expansion scenarios, no matter whether the outgroup is available or not. Fay and Wu’s *H* test performs comparably to the test in this model. We shall also point out that the false positive refers only to the case that the tests falsely identify the signal of positive selection in a neutral evolved sample. If we are interested with detecting population size expansion, the false positive rate would be indicated as power (Tajima 1989a; Fu 1997; Pluzhnikov *et al.* 2002).

#### Bottleneck:

We examined the population bottleneck model (Figure 3F), including recent and old bottleneck scenarios. Note that the bottleneck model becomes the expansion model when the duration of the bottleneck is large. The test remains robust under all the examined bottleneck scenarios, independent of the availability of an outgroup (Figure 3B). The robustness of Tajima’s *D* test, Fu and Li’s *D* test, and Fay and Wu’s *H* test depends on the age of bottleneck. Generally, Fay and Wu’s test is not robust for the most recent events and Tajima’s D tests, Fu and Li’s D for the intermediate age events. Additional simulations suggest that the robustness of the test holds when the number of sampled chromosomes is very small (*i.e.*, 10 diploids or ) (Figure S3 in File S1).

Moreover, the examined bottleneck scenarios are particularly severe in so far as they produce particularly large false positive rate of Tajima’s *D* test (Tajima 1989b) and the selective sweep based approach (Li and Stephan 2005), since they enlarge the variance of different summary statistics, for example and Watterson’s As has been noted before (Wall *et al.* 2002; Wiehe *et al.* 2007), bottlenecks with an extreme effect on the false positive rate have a severity of where is the duration of the bottleneck, the current effective population size, and the effective population size during the bottleneck. For such range of parameter values, in some loci, all the lineages coalesce rapidly during the bottleneck, while, in other loci, some lineages survive the bottleneck and have much older MRCA with other lineages (Depaulis *et al.* 2003). Bottlenecks with such an order of severity should be carefully examined whenever determining the false positive rate of a test for recent positive selection (Xiang-Yu *et al.* 2016).

#### Population structure:

We also examined the effects of population subdivision under the symmetric island model (Figure 3G). Population subdivision has a profound effect on tree topology (Wakeley and Aliacar 2001). It was known that when the migration rate is intermediate, a lineage may migrate from one subpopulation to another, and the migrated lineage may not coalesce with any others before the most recent common ancestor. In such a case, the tree is unbalanced. In some cases, the migrated lineage causing unbalancedness can be detected by a simple phylogenetic method with sampling of an additional individual (known as migrant-detector, MD) (Li 2011) since the MD would coalesce with the migrated lineage first before coalescing with any others. Overall, the examined population structure scenario does not affect the test while Tajima’s *D* test and Fay and Wu’s *H* test are sensitive to this confounding factor (Figure 3C). Fu and Li’s *D* test is also robust to most migration parameters.

Then, we examined the population divergence model (Figure 3H), which is similar to the finite island model, but the subpopulations have a recent common ancestral population. All the tests are insensitive to the examined population divergence event for different migration parameters (Figure 3D). Moreover, an incomplete MD-sampling problem may occur in practice, since population structure often remains unknown. We investigated this by assuming that MDs are only sampled from several randomly chosen demes (Figure S4 in File S1). The test is not completely robust to the effect of population structure, although the false positive rate of the test is generally lower than that of Fay and Wu’s *H* test (max 0.25 *vs.* 0.46). When there are no MDs available, the false positive rate of the test varies with migration rate. It ranges between 0.05 and 0.25, and reaches the maximized value when The false positive rate of the test decreases when the number of MDs increases. The false positive rate of the test remains below 0.05 when MDs are available from about one-half of the demes. The hidden population structure may also lead to an unequal sampling problem. Following the previous studies (Przeworski 2002; Li 2011), we explored this problem with a wide range of migration parameters. We found that the test performed better than Fay and Wu’s *H* test, and is robust with respect to the unequal sampling problem in the most of cases (Figure S5 in File S1).

### Power

We examined the sensitivity of the test under a selective sweep. Our simulations suggested that the power of the test can be above 60% when (or the number of sampled diploid individuals is ≥50) (Figure 4A). As expected, the power increases when sample size and/or the number of observed polymorphic sites increase (Figure 4). Generally, the power (slightly) increases when an outgroup is available. Interestingly, when recombination occurs within the locus, the power of the test remains similar to that without recombination within the locus (Figure 4 and Figure S6 in File S1). This suggests that recombination may not have an adverse effect on the power of much, because selection reduces the total length of the coalescent tree, and, hence, reduces the opportunity for recombination (Sabeti *et al.* 2002).

We also investigated the effect of the population size expansion and bottleneck models on the power of test, including old and recent demographic events. It has sufficiently high power to identify a selective sweep even when population size varied in the past (Figure S7 in File S1). We also surveyed the effect of different in detecting positive selection under the constant population size model and the bottleneck model, where is the time back to the completion of the selective substitution (in units of generations, where is the effective population size) (Figure S8 in File S1). It is expected that the power is lowered when increases. Simulations suggest that the test is able to detect positive selection when

### Robustness to model misspecification

Next, we checked the dependence of the false positive rate on background selection (Charlesworth *et al.* 1993). Similar to the previous analysis (Fu 1997; Li 2011), we considered here a partially linked gene pair that, as a pair, is subject to purifying selection, with a neutral DNA segment placed in between. Background selection typically leads to an excess of singletons, but does not affect the high frequency derived mutations (Fu 1997). As expected, the false positive rate of the test under background selection remains below the significance level of the test (Figure S9 in File S1). Among the three classical statistical tests, Fay and Wu’s *H* test is also robust to background selection.

### Application to experimental data from maize

Previous studies (Tian *et al.* 2009; Lin *et al.* 2011) identified a selective sweep in a core region 83.5–86 Mb on chromosome 10 of the maize genome in a data set containing 28 sequences (Gore *et al.* 2009). It is a subset of data that we used in this study (*n* = 84) (Chia *et al.* 2012). Following these studies we conducted an analysis on the core region in the latest maize data set using the test. Ten windows were analyzed, and the *P*-values of the test were obtained by analyzing 10^{4} simulated data (Table 1). After false discovery rate (FDR) correction for multiple testing (Benjamini and Hochberg 1995), test results in all windows were significant (FDR < 0.05). The examined consecutive windows are largely overlapped, thus, the FDR correction is rather conservative. It indicates that a putative beneficial allele may have occurred near, or within, the core region, which agrees with previous studies (Tian *et al.* 2009; Lin *et al.* 2011). We also randomly selected five 20-kb regions as controls and found no inferred unbalanced trees (Table S2 in File S1). This indicates that the significant results of the core region may not be due to hidden population structure.

## Discussion

Unlike the demography-adjusted tests of neutrality (Li and Stephan 2006; Rafajlović *et al.* 2014), the test does not need prior information on the demography of the species. The demography can be inferred (Li and Stephan 2006; Gutenkunst *et al.* 2009; Li and Durbin 2011; Liu and Fu 2015), and used as the null hypothesis when detecting recent positive selection (Li and Stephan 2006). Or an empirical background frequency spectrum can be used for identifying sites in the genome that were affected by recent positive selection (Nielsen *et al.* 2005; DeGiorgio *et al.* 2016). This is indeed an efficient means to reduce the false positive rate, due to the confounding effect of demography, and to increase the power (Pavlidis *et al.* 2008); however, these methods usually require genome-wide polymorphism data. Thus, we have developed the single-locus approach, which is robust against the most confounding effects of demography.

Employing a tree-based approach (Li 2011; Disanto *et al.* 2013; Li and Wiehe 2013; Hunter-Zinck and Clark 2015; Ronen *et al.* 2015), we propose the test for recent positive selection at a single locus. It requires DNA polymorphism data obtained from a single locus (∼1–50 kb chromosomal region) when randomly selected sequences (at least 10 diploid individuals) are available. It may not be suitable for genome-wide scans for positive selection because of the multiple-testing problem. Our analysis has demonstrated that it is robust to varying population size models, and most population structure scenarios examined. To further reduce the false positive rate of the population structure, we recommend to implement the MD approach (but it requires more sampling) (Li 2011), and to analyze a few randomly selected control loci, since population structure is detectable on a genome-wide scale.

Moreover, the varying read mapping coverage among lineages may affect the test. If some lineages have higher coverage than others, those lineages may falsely appear to be longer branches than others. Thus, we recommend further confirming the DNA polymorphism data within candidate regions, or that randomly chosen controls could be applied to examine this confounding issue.

### Tree unbalancedness

Previously, tree unbalancedness has usually been examined for species tree (Uyenoyama 1997; Purvis and Agapow 2002; Holman 2005; Ford *et al.* 2009). Here, we focus on tree unbalancedness within a species. The main forces to produce unbalanced trees are random drift (Tajima 1983), selection (Kaplan *et al.* 1989; Fay and Wu 2000; Sibert *et al.* 2002; Kim and Nielsen 2004; Blum *et al.* 2006b), and population structure (Wakeley and Aliacar 2001; Przeworski 2002). Population structure affects the whole genome, whereas the effects of selection are always restricted to individual loci located near the selected allele.

To measure tree unbalancedness, we used the maximum value of the left and right subtree sizes of the inferred UPGMA tree, denoted by Here, represents the branching pattern of the root, thus, no complete tree needs to be inferred in this study. is also different with other summary statistics of tree unbalancedness, such as Colless’ index (Colless 1982), the total cophenetic index (Mir *et al.* 2013), and Matsen’s index (Matsen 2006), mainly because contains only limited information of the tree.

### Overview of the test

Our method aims at comparing the summary statistics ( and ) calculated for the two subtrees of the unbalanced tree. It is expected that the mutation rate on the two subtrees is the same, and that and are comparable. We have shown that positive selection typically imposes a strong impact on branch length in one of the two subtrees, while population size change affects the branch lengths of both subtrees. Thus, tends to be more negative under positive selection, while under neutral scenarios. Especially, an unbalanced tree is expected to occur when a neutral locus is partially linked to a beneficial allele. Therefore, a significant *P* of the test indicates that the locus is at close genetic distance from the beneficial allele (Figure 3). Moreover, our results show that the test has high power to detect positive selection in an expanded or bottlenecked population (Figure S8 in File S1).

In this study, and are estimated based on the average tree, similar with Watterson’s It has been shown that is more precise than since the former integrates the phylogenetic information (Fu 1994). Thus, the estimates of and could be improved further in the future.

Moreover, to infer the branching pattern of the root (), the UPGMA method may a poor estimator under certain conditions. We recommend to root the tree by an outgroup whenever it is possible. However, we also note that the performance of the test depends only slightly on the availability of an outgroup in all the examined cases. This is of practical importance since an outgroup may not easily be available.

It is known that tests such as Fay and Wu’s *H*, are sensitive to misinference of derived and ancestral variants of segregating sites by using outgroup (Baudry and Depaulis 2003; Hernandez *et al.* 2007a), and such misinference is an annoying problem when searching for positive selection (Hernandez *et al.* 2007b). In contrast, the misinference of derived and ancestral alleles does not affect the test, since and remain unchanged (Figure 5).

A previous study has shown that, based on simulated data across 400 kb, the selective sweep (multiple-loci) approach has power to detect positive selection when (Li and Stephan 2005). Another independent study demonstrated that long-haplotype based methods can detect much younger positive selection than do selective sweep approaches (Sabeti *et al.* 2006). Therefore, it is a useful and reasonable time scale () that is suitable for our approach.

We applied the test to a well characterized genomic region on the maize genome. The data set we used contains different populations from domesticated improved maize lines and traditional landraces (Chia *et al.* 2012; Hufford *et al.* 2012). It was proposed that maize experienced a bottleneck (Wright *et al.* 2005; Tian *et al.* 2009). Despite the complex history and population structure of the considered sample, the test identified the selective sweep in the core region, but not in the randomly selected control regions.

To summarize, we have demonstrated that the test not only remains robust to a wide range of parameters of demography, but also has high power to detect positive selection. It possesses improvements compared to the existing single locus methods. Nicely, all the components in the summary statistics () can be computed analytically.

## Acknowledgments

We would thank Martin Lascoux for his comments. J.L., Z.Y., and H.L. were supported financially by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13040800) and the National Natural Science Foundation of China (nos. 91531306, 91731304). T.W. acknowledges financial support by the German Research Foundation (DFG-SFB680).

## Appendix

### Statistical Properties of and other Components of

For a general tree topology, where the expectations and variances of major and minor branch lengths can be analytically computed. Let thus, there are possible binary vectors Each represents an array of coalescent events of the tree. The expectation of is as follows:Following this logic, can then be rewritten as follows:where and denotes the number of lineages (counting both the major and the minor subtrees) during the coalescent time period and the number of lineages (counting both the major and the minor subtrees) at a time right before the most recent common ancestor for the major subsample, the number of lineages in the major subtree during the coalescent time period (Figure 2). As one of components in means that there are coalescent events before the time period among which events occur in the major subtree, and indicates that there are coalescent events between the time period and the time of the most recent common ancestor for the major subsample, among which events occur in the major subtree.

Similar to the derivation of we can derive the variance of and as follows:where and where is determined by:where

Sincewe have

## Footnotes

*Communicating editor: N. Rosenberg*Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.117.300401/-/DC1.

- Received October 14, 2017.
- Accepted December 1, 2017.

- Copyright © 2018 by the Genetics Society of America