## Abstract

The multispecies coalescent (MSC) model has emerged as a powerful framework for inferring species phylogenies while accounting for ancestral polymorphism and gene tree-species tree conflict. A number of methods have been developed in the past few years to estimate the species tree under the MSC. The full likelihood methods (including maximum likelihood and Bayesian inference) average over the unknown gene trees and accommodate their uncertainties properly but involve intensive computation. The approximate or summary coalescent methods are computationally fast and are applicable to genomic datasets with thousands of loci, but do not make an efficient use of information in the multilocus data. Most of them take the two-step approach of reconstructing the gene trees for multiple loci by phylogenetic methods and then treating the estimated gene trees as observed data, without accounting for their uncertainties appropriately. In this article we review the statistical nature of the species tree estimation problem under the MSC, and explore the conceptual issues and challenges of species tree estimation by focusing mainly on simple cases of three or four closely related species. We use mathematical analysis and computer simulation to demonstrate that large differences in statistical performance may exist between the two classes of methods. We illustrate that several counterintuitive behaviors may occur with the summary methods but they are due to inefficient use of information in the data by summary methods and vanish when the data are analyzed using full-likelihood methods. These include (i) unidentifiability of parameters in the model, (ii) inconsistency in the so-called anomaly zone, (iii) singularity on the likelihood surface, and (iv) deterioration of performance upon addition of more data. We discuss the challenges and strategies of species tree inference for distantly related species when the molecular clock is violated, and highlight the need for improving the computational efficiency and model realism of the likelihood methods as well as the statistical efficiency of the summary methods.

- anomaly zone
- BPP
- concatenation
- gene trees
- incomplete lineage sorting
- maximum likelihood
- multispecies coalescent
- species trees

In comparisons of genomic sequences from multiple species, it is often observed that different genes or genomic regions may produce conflicting phylogenetic trees. A number of biological processes may cause such gene tree-species tree discordance, including (i) gene duplications and losses combined with misidentification of orthologs, (ii) hybridization, introgression or horizontal gene transfer across species boundaries, and (iii) incomplete lineage sorting (ILS) due to polymorphism in ancestral species (Maddison 1997; Nichols 2001; Degnan and Rosenberg 2009; Szollosi *et al.* 2014). There is increasing empirical evidence that gene flow (introgression or hybridization) occurs commonly between sister or even non-sister species, especially during radiative speciations (Turelli *et al.* 2014; Mallet *et al.* 2016), and indeed many studies have highlighted gene flow and ILS as two major challenges to inference of shallow species phylogenies (Fontaine *et al.* 2015; Pease *et al.* 2016). Currently full likelihood methods that deal with both gene flow and ILS are lacking (Dalquen *et al.* 2016). In this article, we focus on ILS only. ILS occurs when the coalescent process in ancestral species causes the gene tree to differ from the species tree. This is important whenever species divergences are close in time (as occurs in radiative speciations) and the population sizes of the ancestral species are large. The significance of ILS to species tree inference was highlighted by the characterization of the anomaly zone, regions of the parameter space (species tree with associated parameters) with short internal branches and large population sizes, in which the majority-vote approach of using the most common gene tree as an estimate of the species tree is statistically inconsistent (Degnan and Salter 2005; Degnan and Rosenberg 2006). Intuitively one might think that coalescent is a population genetics process and irrelevant to species tree estimation. However, the issue lies with the *length* rather than the *depth* of the internal branches on the species phylogeny (Edwards *et al.* 2005). For example, the species tree of Figure 1A is hard to reconstruct due to the short internal branches (this will be discussed in detail later), but the task is not made any easier if the tip branches (*A*, *B*, *C*, and *D*) are extended by 300 MY of evolution.

The multispecies coalescent (MSC) model provides a natural framework for estimating the species tree in presence of ILS, and indeed the use of the coalescent model in species tree inference has been described as a paradigm shift in molecular phylogenetics (Edwards 2009). Nearly all coalescent-based species tree estimation methods were developed within the last 10 years. Two classes of methods have been developed side by side: the full likelihood methods (including maximum likelihood, ML, and Bayesian inference, BI) and approximate or summary methods. Full likelihood methods involve averaging over the gene trees and are computationally intensive. The ML method integrates over the coalescent times numerically so that the computation is possible for three species only but tens of thousands of loci can be handled (Yang 2002; Dalquen *et al.* 2016). Bayesian methods such as best (Liu and Pearl 2007; Edwards *et al.* 2007; Liu 2008), *beast (Heled and Drummond 2010), and bpp (Yang and Rannala 2014; Rannala and Yang 2016) use Markov chain Monte Carlo algorithms to average over gene trees (topologies and branch lengths) and parameters. They can deal with more species but are currently impractical for large datasets with >1000 loci. Furthermore, current Bayesian implementations do not deal with the violation of the molecular clock adequately and do not appear to work very well for deep species phylogenies (Ogilvie *et al.* 2016). At the same time, over a dozen approximate methods have been proposed for estimating the species tree despite gene tree conflicts. Most of them take a two-step approach of estimating the gene trees at the individual loci using phylogenetic methods and then treating the estimated gene trees as observed data. These are also called summary methods, as gene trees may be viewed as summary statistics derived from the original sequence data. However traditional summary statistics are features or observations of the data, while gene trees are estimates and may differ from the true unobserved gene trees. Approximate or summary methods are fast and applicable to genomic datasets with hundreds or thousands of loci and have been more commonly used than likelihood methods. For recent reviews on species tree methods, the reader may consult Yang (2014, Chap. 9), Liu *et al.* (2015), Edwards *et al.* (2016), and Mallo and Pasada (2016).

In this article we review the statistical nature and conceptual issues of species tree estimation under the MSC. We focus mainly on simple cases involving three or four species, because they are more likely to be tractable and because most summary methods are based on insights from small species trees (*e.g.*, triplet and quartet trees). We assume that the species are closely related so that the molecular clock approximately holds, but will discuss the additional challenges for inferring deep species trees. We demonstrate that counterintuitive behaviors of species tree methods discussed in the literature, such as the existence of the anomaly zone, the singularity of the likelihood surface, and worse performance with more data, apply to the two-step summary methods, but not to the full-likelihood methods. We first study the case of three species and three sequences per locus, which is the simplest species tree problem. Then we discuss the anomaly zone, regions of the parameter space within which the most frequent gene tree differs from the species tree, so that the simple majority-vote method of species tree estimation is inconsistent. Lastly we use simulation to confirm that more data can indeed cause summary methods such as mp-est (Liu *et al.* 2010a) to perform worse, but show that this is due to inefficient use of information in the data by summary methods and does not occur with the full likelihood methods (ML and BI).

## The Multispecies Coalescent

The coalescent describes the stochastic process of lineage joining when one traces the genealogical history of a sample of sequences from a population backward in time until their most recent common ancestor (MRCA) (Kingman 1982a,b; Hudson 1983; Tajima 1983). The theory provides a powerful framework for inference of population genetic processes using genetic sequence data (Hudson 1990; Hein *et al.* 2005; Nordborg 2007; Wakeley 2009). The coalescent waiting time for two randomly sampled sequences from a diploid species with population size *N* has an exponential distribution with the mean of 2*N* generations, or equivalently coalescent between two sequences occurs at the Poisson rate of per generation. In analysis of sequence data, it is convenient to measure time by the mutational distance, so that one time unit is defined as the amount of time taken to accumulate one mutation per site in the sequence. Then coalescent between two sequences occurs at the rate of per time unit, and the average coalescent waiting time is where *θ* = 4*Nμ* is the expected number of mutations per site between two randomly drawn sequences and *μ* is the mutation rate per site per generation. For example, *θ*_{H} ≈ 0.0006 for the human means that two human genomes have on average ∼0.6 differences per kilobase.

For a sample of *n* sequences, the genealogical history is described by a succession of coalescent lineage-joining events. With *j* sequences in the sample, each of the pairs coalesces at the rate so that the total rate is and the waiting time until the next coalescent event (which reduces the number of lineages from *j* to *j*−1) is an exponential random variable with mean This process generates *H* = possible genealogical trees, depending on the order of pairs of sequences to coalesce, and each of them has the same probability (). Here the genealogical tree is a rooted tree with the internal nodes ordered by time, called a *labeled history* (LH) by Edwards (1970). Figure 2 shows all the 18 LHs for four sequences. The joint probability distribution of the genealogical tree (*G*) and the coalescent times, ** t** = {

*t*}, is thusThe multispecies coalescent (MSC) is a simple extension of the single-population coalescent to multiple species (Figure 3). The different species are related by a phylogeny (species tree), and they may have different population sizes. There are two sets of parameters in the MSC model: the species divergence times (

_{j}*τ*s) and the population size parameters (

*θ*s) on the species tree, with both

*τ*s and

*θ*s measured by the number of mutations or substitutions per site. For example, for the species tree of Figure 3A, Θ = {

*τ*,

_{AB}*τ*,

_{ABC}*θ*,

_{AB}*θ*}. Within each species (either modern or ancestral), sequences coalesce at random, at the rate for each pair in population

_{ABC}*i*, as in the standard coalescent, independently of other populations. However at the time of species divergence, two or more lineages may leave the population and enter the ancestral population. Different aspects of the MSC model have been discussed by a number of authors, including Gillespie and Langley (1979), Hudson (1983), Pamilo and Nei (1988), Felsenstein (1988), and Takahata (1989). The probability density of the gene tree and branch lengths (coalescent times) for an arbitrary species tree and arbitrary sampling of sequences at a locus is given by Rannala and Yang (2003). An important feature of the process is that

*the sequence divergence time is greater than the species divergence time*, or more intuitively, that

*gene trees must “fit inside” the species tree*.

## Full Likelihood Methods of Species Tree Estimation Under the MSC

Full likelihood methods of species tree estimation under the MSC follow standard statistical theory. Let *S* be the species tree topology, and Θ be the parameters in the MSC model on the species tree (*τ*s and *θ*s). *S* is a statistical model while Θ are its parameters. The data consist of sequence alignments at *L* loci. The ideal loci for this kind of analysis are loosely linked short genomic segments that are far apart from each other so that recombination within a locus is rare while different loci are nearly independent (*e.g.*, Takahata 1986; Burgess and Yang 2008; Lohse *et al.* 2011). For distantly related species, a locus may correspond to a gene or exon. The MSC model assumes that the gene trees at different loci are independent while all sites at the same locus share the same history (gene tree topology and coalescent times). The assumption of shared history at the same locus is commonly considered to be equivalent to the assumption of no recombination throughout the gene tree. For distantly related species, a locus without recombination throughout the gene tree (called a coalescent-gene or c-gene by Springer and Gatesy 2016) must be very short. For example, for a eutherian mammal dataset, Springer and Gatesy (2016) used empirical estimates of primate recombination rates to calculate the *c*-gene size to be ∼12 bp. However, this calculation is unnecessarily stringent. All sites at the locus will have the same gene tree topology and coalescent times so that the MSC density of Rannala and Yang (2003) will be valid as long as there is no recombination during the parts of the gene tree where coalescent events occur (Lanier and Knowles 2012; Edwards *et al.* 2016). See Figure 4 for an illustration of the assumption using the human and chimpanzee as an example.

Let *X _{i}* be the alignment of

*n*sequences at locus

_{i}*i*, with

*X*= {

*X*}. The sequences are assumed to be neutrally evolving. Let

_{i}*G*be the gene tree (or more precisely, the labeled histories, Edwards 1970) and coalescent times (

_{i}

*t**) at the locus. The gene trees (*

_{i}*G*and

_{i}

*t**) are unobserved random variables (latent variables) with well-specified distributions given the species tree and parameters. The log likelihood function for estimating*

_{i}*S*and Θ averages over the gene trees(2)where

*f*(

*G*,

_{i}

*t**|*

_{i}*S*, Θ) is the MSC density for the gene tree and coalescent times at locus

*i*(Rannala and Yang 2003) and

*f*(

*X*|

_{i}*G*,

_{i}

*t**) is the probability of the sequence data given the gene tree, known as the phylogenetic likelihood (Felsenstein 1981). If the mutation/substitution model assumed in the likelihood calculation involves parameters (such as the transition/transversion rate ratio*

_{i}*κ*), they should be included in Θ as well; in this paper, we use the simple JC mutation model (Jukes and Cantor 1969). The ML method estimates

*S*and Θ by maximizing ℓ(

*S*, Θ). The difficulty with ML lies in the sum over all possible gene trees and the integral over the coalescent times, because the number of possible gene trees is huge and the integral over

*t**for each gene tree is (*

_{i}*n*−1)-dimensional. The only ML implementation appears to be that of Yang (2002), which is limited to only three species and three sequences per locus (one sequence from each species).

_{i}With the Bayesian approach, we assign a prior on the species trees, *f*(*S*), and a prior on the parameters, *f*(Θ | *S*). The posterior of *S* and Θ averages over the gene tree topologies (*G _{i}*) and coalescent times (

*t**):(3)where*

_{i}*C*is the normalizing constant, to ensure that the posterior integrates to 1. Equation 3 is not practical because both the normalizing constant

*C*and the marginal likelihood

*f*(

*X*|

_{i}*S*, Θ) involve huge sums and high-dimensional integrals. Instead, Bayesian programs such as best (Liu and Pearl 2007), *beast (Heled and Drummond 2010) and bpp (Yang and Rannala 2014; Rannala and Yang 2016) use Markov chain Monte Carlo (MCMC) algorithms to generate a sample from the joint posterior distribution of species tree (

*S*and Θ) and gene trees (

*G*and

_{i}

*t**for each locus*

_{i}*i*):(4)Then by ignoring the gene trees {

*G*,

_{i}

*t**} in the MCMC sample, we obtain the marginal posterior of the species tree:*

_{i}*f*(

*S*|

*X*) as well as the within-tree parameter posterior

*f*(Θ |

*X*,

*S*). While the Bayesian method is commonly described as a joint estimation of both the species tree and the gene trees, it should be noted that the maximum

*a posteriori*(MAP) species tree, which is the Bayesian point estimate, maximizes the marginal posterior

*f*(

*S*|

*X*), instead of the joint posterior

*f*(

*S*, Θ, {

*G*,

_{i}

*t**} |*

_{i}*X*) or

*f*(

*S*, Θ |

*X*). Recall that given a table of counts representing a joint distribution,

*f*(

*X*,

*Y*), with rows for

*X*and columns for

*Y*, the “marginal” distribution for

*X*is generated by summing the counts along each row and writing the sum in the right margin, with the columns ignored. In effect, the summation and integration over the gene trees of Equation 3 is achieved numerically in the MCMC algorithm.

The statistical properties of the ML and BI methods when the number of loci (*L*) and/or the number of sites per locus (*n*) increases have yet to be carefully investigated. Under the MSC model, the data (sequence alignments) at the different loci are independently and identically distributed. Statistical consistency of the ML and Bayesian estimates of the species tree when *L* increases then follows automatically as long as the MSC model is identifiable (*e.g.*, Dawid 2011). In other words, when *L* → ∞, the estimate will converge in probability to the true species tree and true parameter values. Identifiability under commonly used substitution models has been discussed by Steel (2013) and Chifman and Kubatko (2015). The efficiency of the methods, while very important, appears intractable analytically.

## Approximate Methods of Species Tree Estimation

The MSC makes simple predictions about different aspects of the gene trees. If the gene trees are known or estimated, those properties can be used to devise methods for species tree estimation. Almost all approximate methods take this two-step approach and treat gene trees inferred using phylogenetic methods as observed data. We review some of them here to illustrate the strategies taken, but do not attempt an exhaustive list.

### Methods that use gene tree topologies

A number of authors have studied the probabilities of gene tree topologies under the MSC. The case of three species (*A*, *B*, and *C*), with the species tree ((*AB*)*C*), and of three sequences per locus (*a*, *b*, and *c*) was considered by Hudson (1983), who derived the mismatch probability that the species tree (*S*) and the gene tree (*G*) differ to be the probability that sequences *a* and *b* do not coalesce in the ancestral species *AB* (Figure 3)(5)Note that for a Poisson process with rate *λ* (or with exponential waiting time of mean 1/*λ*), the probability of no event over time interval *t* is e^{–}* ^{λt}*. Here the exponent, (

*T*−

_{ABC}*T*)/(2

_{AB}*N*), is known as the internal branch length of the species tree in coalescent time units, since a coalescent time unit is 2

_{AB}*N*generations. Equation 5 was used to estimate the ancestral population size of the human and chimpanzee in the so-called “trichotomy” or “tree-mismatch” method (Takahata

*et al.*1995; Chen and Li 2001; Yang 2002). Equation 5 also gives the most common gene tree as the estimate of the species tree topology. This underlies the rooted triples (Ewing

*et al.*2008) and mp-est (Liu

*et al.*2010a) methods of species tree estimation.

Gene tree probabilities for four or more species have been studied by Pamilo and Nei (1988). Degnan and Salter (2005) and Degnan and Rosenberg (2006) designed general algorithms that apply to arbitrary numbers of species. Given a collection of gene trees, the likelihood function — the probability of the gene tree topologies given the species tree *S* and internal branch lengths in coalescent units (** τ**’) — will simply be the product of the probabilities for the gene tree topologies at the multiple loci.(6)where the probability of gene tree

*G*at locus

_{i}*i*can be calculated using the algorithms of Degnan and Salter (2005) and Degnan and Rosenberg (2006) (see also Wu 2012). This likelihood method treating gene tree topologies as data are implemented in the stells program (Wu 2012) and is a special case of the network method of Wen

*et al.*(2016), with gene flow disallowed. The multinomial likelihood leads to a statistically consistent method for species tree estimation, as the distribution of rooted gene tree topologies determines the rooted species tree topology and internal branch lengths (in coalescent units) (Allman

*et al.*2011).

The method of minimizing deep coalescence (mdc) (Maddison 1997) is based on gene tree topologies but does not use all information in them. Maddison (1997) defines *deep coalescence* as the phenomenon that two or more lineages pass through an ancestral species when one traces the gene genealogy backward in time. If ancestral populations are very small, all lineages should coalesce as soon as they enter an ancestral species, and deep coalescence does not occur. Then the gene trees will track the species tree faithfully. A parsimony-like argument suggests that given a collection of gene trees, the species tree that minimizes the number of deep coalescence events (Figure 5) is likely to be the true species tree (Maddison 1997; Than and Nakhleh 2009). Than and Rosenberg (2011) demonstrate that anomaly zones or inconsistency zones exist for mdc for asymmetric species trees for four species, and for all species trees with five or more species.

Many gene tree-based summary methods make use of properties for small gene trees with three or four sequences and then assemble the triplet or quartet trees into a big tree, which is the estimate of the species tree for all species. The rooted triples method (Ewing *et al.* 2008) and the mp-est method (Liu *et al.* 2010a) use different strategies to assemble triplet trees. In the rooted triples method, rooted triplet trees are extracted from all gene trees and their consensus tree is constructed as an estimate of the species tree for all species. In the mp-est method, the probability of the three alternative gene trees given a triplet species tree can be calculated from the trinomial distribution (see Equation 5 and our discussion below on the three-species case). For each species tree (for all species), a likelihood is then calculated by multiplying the trinomial probabilities across all species triplets induced by the specie tree. As the species triplets are not independent, this likelihood function is not the correct one and the method is thus a pseudo-likelihood approach. Both the rooted triples and the mp-est methods are statistically consistent, as the probability distribution of rooted triplet gene trees determines the species tree topology and internal branch lengths (in coalescent units) (Allman *et al.* 2011). The program astral (Mirarab *et al.* 2014; Mirarab and Warnow 2015) assembles quartet trees. It takes a collection of unrooted gene trees, collects all their quartets into a set, which may have many identical quartet trees, and then judges each species tree by how well its quartet trees match those in the set. Thus astral produces the maximum quartet support species tree (MQSST) (Mirarab *et al.* 2014).

Some gene tree-based summary methods use gene tree topologies to define a distance between species, such that the distance tracks the species tree under the MSC. Liu *et al.* (2009) used the rank of the interior nodes on the gene trees as a measure of distance between species, and developed the method of species tree estimation using average ranks of coalescences (star) (Figure 6A). This is based on the observation that the expected ranks of the coalescences among sequences tracks the order of species divergences in the species tree. The method is consistent when the correct gene trees are given as data. For unrooted gene trees, Liu and Yu (2011) defined the distance between two species as the number of internodes between the two species on the gene tree, and use the average gene-tree internode distance to construct the NJ tree (Saitou and Nei 1987) (Figure 6B). This is the NJ_{st} method of species tree estimation. Liu and Yu (2011) show that for any four species, the expected gene-tree internode distance satisfies the four-point condition, and thus NJ_{st} is statistically consistent in estimating the unrooted species tree.

In theory if the data are the gene tree topologies, the ML method (Equation 6) uses all information in the data about the species tree, and summary methods such as mdc, star, astral, and NJ_{st} should suffer from an information loss and reduced efficiency. Not much effort has been made to quantify the information loss.

### Methods that use gene tree branch lengths

A class of summary methods use branch lengths (or coalescent times) in gene trees, perhaps in addition to the gene tree topologies. A major feature of the MSC model is that *genes split before species* or *gene trees run inside the species tree*: the sequence divergence between any two species must be greater than the species divergence, or *t _{ab}* >

*τ*for any two species

_{AB}*A*and

*B*and any two sequences

*a*and

*b*from the two species. As a result, the expected coalescent time between species tracks the species phylogeny. If species

*A*and

*C*diverged earlier than species

*A*and

*B*, with

*τ*<

_{AB}*τ*(Figure 3), the expected sequence divergences will track that relationship:

_{ABC}*E*(

*t*) <

_{ab}*E*(

*t*). Indeed, if all species have the same population size (

_{ac}*θ*), the expected sequence divergences are simply

*E*(

*t*) =

_{ab}*τ*+ and

_{AB}*E*(

*t*) =

_{ac}*τ*+ so that

_{ABC}*E*(

*t*) <

_{ab}*E*(

*t*) follows from

_{ac}*τ*<

_{AB}*τ*. If we define the distance between two species as twice the average coalescent time (node age) between the species, the resulting distance matrix can be used to construct a NJ tree, which will be a consistent estimate of the species tree. This is species tree estimation using average coalescence times (STEAC) of Liu

_{ABC}*et al.*(2009).

Takahata (1989) considered the case of three species with multiple sequences sampled per species, and argued that the *minimum* sequence divergence between species (over all sequence pairs and over all loci) may be more informative about the species tree than the *average* sequence divergence between species. Maddison and Knowles (2006) proposed to cluster species by the shallowest (minimum) coalescences occurring between two species. The method is formalized as the ML method of species tree estimation when the data are rooted gene trees with branch lengths (node ages). This is maximum tree (MT) method of Liu *et al.* (2010b), implemented in the stem program (Kubatko *et al.* 2009), or the glass algorithm (for Global LAteSt Split) (Mossel and Roch 2010). Under the assumption of equal population size, MT, stem, and glass are equivalent. It is remarkable that the ML solution is analytically tractable in this case, and here we use the case of two species to illustrate the result.

In the case of two species with one sequence from each species at every locus, the MSC model involves two parameters: Θ = {*τ _{AB}*,

*θ*}. Suppose the sequence divergence times at

_{AB}*L*loci are given:

*X*= {

*x*},

_{i}*i*= 1, …,

*L*. The likelihood is then the product of the exponential densities:(7)The MLEs are easily found to be = min(

*x*) and where is the mean sequence divergence across loci. The MLE of the species divergence is the minimum sequence divergence.

_{i}This holds true in general with an arbitrary number of species. Given a collection of rooted gene trees with node ages, the ML species tree under the assumption of equal population size is the one in which the species divergence between any two species is equal to the minimum sequence divergence across all gene trees for the two species (Liu *et al.* 2010b). The ML species tree or the maximum tree achieves the maximum species divergences allowed by the gene trees. However, if the different species are allowed to have independent population sizes (*θ*s), the ML method is noted to encounter singularities on the likelihood surface (Liu *et al.* 2010a). This will be discussed later.

While in theory the MT method uses information in both the branch lengths and the gene tree topologies, it did not perform particularly well in simulations (Liu *et al.* 2009; Leaché and Rannala 2011), apparently because of its sensitivity to estimation errors in branch lengths. Estimation of branch lengths or node ages in the ultrametric rooted gene tree may be affected by the mutation model, by rate variation among loci and among sites of the same locus, and by violation of the molecular clock. Thus methods that use branch lengths may not necessarily perform better than topology-only methods. Similarly, Liu *et al.* (2009) found that star (which ignores branch lengths) consistently outperforms steac (which uses branch lengths) when the molecular clock is seriously violated. Nevertheless, for closely related species, the clock should be adequate, and the impact of the mutation model should be minor. In this case, the information gain from the use of branch lengths relative to the sensitivity to sampling errors in branch length estimation may be an interesting topic for further research, as are possible strategies for accommodating sampling errors in branch length estimation.

The two-step summary methods suffer from a few weaknesses. First, they use only part of the data such as the gene tree topology, resulting in information loss. Second, they ignore the phylogenetic errors in gene tree reconstruction due to the finite number of sites at each locus (called the mutation variance by Huang and Knowles 2009). For closely related species, the sequences may be highly similar, with very limited phylogenetic information so that gene tree topologies may be unresolved or highly uncertain. Simulations using both the true and estimated gene trees often suggest large differences, highlighting the importance of phylogenetic reconstruction errors (Mirarab and Warnow 2015). However, accounting for uncertainties in the gene trees using multilocus bootstrap did not lead to consistent improvement to species tree estimation and sometimes made things worse (Mirarab *et al.* 2014). This appears to be due to the fact that the bootstrap proportions are not appropriate weights for alternative gene trees at the same locus. Even the Bayesian posterior probabilities may not be the appropriate weights, if they are not calculated under the MSC model with a species tree underlying all gene trees. From Equation 2, the gene tree density *f*(*G _{i}*,

*t**|*

_{i}*S*, Θ) depends on the species tree and parameters, and the gene trees should be correlated among loci.

### The invariance method SVDquartets

SVDquartets (for Singular Value Decomposition for quartets) is a quartet-based method recently developed by Chifman and Kubatko (2014), and uses a quartet assembly algorithm to generate a species tree estimate. However, it differs from the two-step summary methods in that it generates quartet trees from the 256 observed site pattern counts, not from estimated gene trees. SVDquartets assumes that different sites in the sequence data have independent histories given the species tree, and is thus similar to the snapp method for single nucleotide polymorphism data (SNPs) (Bryant *et al.* 2012).

Let the observed site pattern counts be *o _{k}*, with

*k*= 1, …, 256. The standard statistical approach to analysis of such data is to use a goodness-of-fit criterion such as ℓ = or

*X*

^{2}= to estimate the parameters for each species tree, where the expected site pattern probability

*p*is calculated as a function of the species tree and parameters by summing over the 18 labeled histories and integrating over three coalescent times in each (Figure 2). In other words,

_{k}*p*is

_{k}*f*(

*X*|

_{i}*S*, Θ) of Equation 2 with

*X*being an alignment of only one site. This calculation can be achieved analytically under the JC model (Chifman and Kubatko 2014). Maximization of ℓ to obtain the MLE of Θ or minimization of

_{i}*X*

^{2}to obtain the minimum

*X*

^{2}estimate of Θ also leads to the optimized ℓ or

*X*

^{2}as a score for species tree estimation.

Chifman and Kubatko (2014) did not use such optimization, and instead relied on phylogenetic invariance to generate the quartet tree. The expected site pattern probabilities (*p _{k}*), when arranged into a 16 × 16 matrix according to the true species tree, has rank ≤10, while the rank is >10 if the matrix is arranged according to an incorrect species tree. Note that the rank of a matrix is equal to the number of nonzero singular values or eigenvalues, and that a nonsingular 16 × 16 matrix has rank 16, but linear relationships among rows or columns reduce its rank. In other words, the site pattern probabilities generated by a species tree should satisfy a number of linear relationships, depending on the assumed mutation or substitution model. The criterion used by the method examines whether the 11th–16th eigenvalues are close to 0. The SVDquartets method for generating the quartet tree is thus similar to the evolutionary parsimony method by Lake (1987), which is also an invariance-based method for quartet data. Evolutionary parsimony makes use of only a few site patterns and is known to be inefficient and sensitive to the details of the substitution model (Jin and Nei 1990). In contrast, SVDquartets uses all 256 site patterns. Its statistical performance has yet to be carefully evaluated. In one simulation study, SVDquartets did not show the expected advantage over competing methods for very short alignments (Chou

*et al.*2015).

## The Case of Three Species

In the case of three species (*A*, *B*, and *C*) and three sequences per locus (*a*, *b*, and *c*) the mismatch probability of Equation 5 can be used to estimate the species tree and the parameters using a collection of rooted gene trees. Let (*x*_{1}, *x*_{2}, *x*_{3}) be the numbers of gene trees having topologies *G*_{1}, *G*_{2}, and *G*_{3}. Their probabilities given the species tree are (*p*_{1}, *p*_{2}, *p*_{3}) = (1 – *P _{SG}*, }, with

*p*

_{1}>

*p*

_{2}=

*p*

_{3}, and with

*G*

_{1}to be the gene tree that matches the species tree. The likelihood is then the trinomial probability Maximizing this likelihood simply gives the most common gene tree as the estimate of the species tree, with the internal branch length (in coalescent units) given by Equation 5. While the MSC model involves at least four parameters (

*τ*,

_{AB}*τ*,

_{ABC}*θ*,

_{AB}*θ*) (Figure 3), use of the gene tree topologies identifies only one. With the gene tree topologies given as data, this appears to be the only sensible solution, and it is the solution by the rooted triples method of Ewing

_{ABC}*et al.*(2008) and the mp-est method of Liu

*et al.*(2010a) in this case.

When the gene trees are unknown and reconstructed using phylogenetic methods, two issues arise. First, if two or more gene trees are equally best at a locus, that locus is discarded. Second, phylogenetic errors of gene tree reconstruction inflate the mismatch probability, so that the probability that the species tree differs from the estimated gene tree is > *P _{SG}* (Yang 2002). For example, for the human-chimpanzee-gorilla trio, ≈ 0.4 for loci of 500 bp, while

*P*≈ 0.3 (Burgess and Yang 2008; Scally

_{SG}*et al.*2012). The inflated species tree-gene tree mismatch erodes our chance of inferring the true species tree, and furthermore, the internal branch length in the species tree (in coalescent units) will be inconsistently estimated. We thus have a highly unusual estimation problem, in which the species tree is identifiable and consistently estimated, the internal branch length is identifiable but inconsistently estimated, while the other three parameters of the MSC model are unidentifiable and inestimable.

Given that the species tree (*S*) is consistently estimated by the method, it is of interest to know how fast the estimation error approaches 0 when the amount of data (the number of loci) increases. Liu *et al.* (2010a) provided a proof that the error drops to 0 at least at the rate of 1/*L*. This result may be strengthened. For the case of three species, the estimated species tree is correct if and only if *x*_{1} > *x*_{2} and *x*_{1} > *x*_{3}, where *x*_{1}, *x*_{2}, *x*_{3} are the counts of gene trees *G*_{1}, *G*_{2}, and *G*_{3}, with *G*_{1} being the gene tree that matches the species tree. Let (*p*_{1}, *p*_{2}, *p*_{3}) be the probabilities for those gene trees, with *p*_{2} = *p*_{3}. We also consider *x*_{0} as the number of loci with ties and *p*_{0} the probability of ties. From the fact that (*x*_{0}, *x*_{1}, *x*_{2}, *x*_{3}) have a multinomial distribution, the species tree estimation error may be approximated by(8)where Φ(⋅) is the cumulative distribution function (CDF) for the standard normal distribution *N*(0, 1) (Yang 1996). The approximation assumes *L* ≫ 1/*p*_{2}. For very large *L*, Equation 8 is approximately which approaches 0 much faster than 1/*L*. For example, with parameters *τ _{ABC}* = 0.06,

*τ*= 0.05,

_{AB}*θ*=

_{ABC}*θ*= 0.02 for the species tree of Figure 3A, and with

_{AB}*n*= 1000 sites per locus, we have, through simulation,

*p*

_{0}= 0.0197 for the proportion of loci with ties,

*p*

_{1}= 0.6915 for the matching gene tree, and

*p*

_{2}=

*p*

_{3}= 0.1444 for the mismatching gene trees. Then the error probability

*P*

_{error}is 0.0296, 0.0023, and 0.000034 for 10, 20, and 40 loci, respectively, according to simulation, while Equation 8 gives 0.0398, 0.0030, and 0.000020. The discrepancy appears quite large, relative to

*P*

_{error}itself. Better approximations will be desirable. Note also that for those parameter values, Equation 5 gives the species tree-gene tree mismatch probability as

*P*= = 0.24525, while with 1000 sites per locus, the mismatch probability for the estimated gene tree is = 0.3010 by simulation (with ties for gene trees at the locus broken evenly). Phylogenetic errors cause considerable inflation of the mismatch probability.

_{SG}## The Anomaly Zone

Degnan and Salter (2005) and Degnan and Rosenberg (2006) derived the probabilities for gene tree topologies under the MSC model, and highlighted the fact that the most probable rooted gene tree topology may not match the species tree when the species tree has short internal branches and large ancestral populations. Thus the simple majority-vote method of using the most commonly observed gene tree as the estimate of the species tree will be statistically inconsistent and will converge in probability to a wrong species tree when the number of loci increases. The species tree and parameters that lead to such *anomalous gene trees* are said to be in the *anomaly zone* (Degnan and Rosenberg 2006). In other words, the anomaly zone is the inconsistency zone for the majority-vote method. Computer simulation (Kubatko and Degnan 2007) and mathematical analysis (Roch and Steel 2015) suggest that concatenation may similarly be inconsistent in certain regions of the parameter space; concatenation has its own anomaly or inconsistency zone.

The case of four species is simple and illuminating (Figure 1). Given the asymmetrical species tree *S*_{1} = (((*AB*)*C*)*D*), the probabilities of the gene trees *G*_{1} = (((*ab*)*c*)*d*) and *G*_{2} = ((*ab*)(*cd*)) for four sequences (*a*, *b*, *c*, *d*) under the MSC model are(9)where *x* = and *y* = are the lengths of the deeper and shallower internal branches (in coalescent units) in the species tree (Rosenberg 2002; Degnan and Rosenberg 2006). Now if *x* → 0 and *y* → 0, we have *P*(*G*_{1}) → 1/18 and *P*(*G*_{2}) → 2/18. When the internal branches on the species tree disappear and the species tree becomes a star tree, all three coalescent events for the four sequences will occur in the ancestral species *ABCD*, and the process is simply the single-population coalescent (Figure 2). As the asymmetrical gene tree *G*_{1} is compatible with only one labeled history while the symmetrical gene tree *G*_{2} is compatible with two labeled histories (*G*_{2}* _{a}* and

*G*

_{2}

*), we have*

_{b}*P*(

*G*

_{1}) = 1/18 and

*P*(

*G*

_{2}) = 2/18. Now if

*x*and

*y*are nonzero but very small,

*P*(

*G*

_{1}) <

*P*(

*G*

_{2}) may still hold even if

*P*(

*G*

_{1}) >

*P*(

*G*

_{2})/2. Degnan and Salter (2005) and Degnan and Rosenberg (2006) designed algorithms for determining such boundaries in the parameter space (values of

*x*and

*y*). When

*P*(

*G*

_{1}) <

*P*(

*G*

_{2}), the majority-vote estimate of the species tree will be inconsistent and will converge to the mismatching topology of

*G*

_{2}.

Thus the coalescent process favors symmetrical trees, and the effect is greater for larger trees. The number of compatible labeled histories is 1 for a completely asymmetrical rooted tree. For any other rooted tree, this number is given as the product of over the internal nodes on the rooted tree, where *x* and *y* are the numbers of descendent internal nodes on the left and right part of the internal node, respectively (Yang and Rannala 2014). For example, for the rooted tree *G*_{2} of Figure 1, *x* = *y* = 1 for the root node and *x* = *y* = 0 for the two other internal nodes, so that the number of compatible labeled histories is = 2, with the convention = 1. An alternative method to calculate this number, suggested by one of the referees, is to divide (*n*−1)! by the product of *n*(*k*)−1 over all the interior nodes *k* of the tree, where *n* is the number of tips on the tree and *n*(*k*) is the number of tips below node *k*. If there are eight species (and sequences) and if the true species tree is completely asymmetrical with the six internal branch lengths nearly 0, the completely asymmetrical matching gene tree has only about 1/80 the probability of the completely symmetrical mismatching gene tree (Table 1). It may thus be easier to find anomaly with larger species trees, or, as Rosenberg and Tao (2008) put it, the anomaly zone seems to expand when the number of species increases. There is no anomaly zone for the case of three species. For four species, the anomaly zone exists for the asymmetrical species tree only. For five or more species, it exists for any species tree (Degnan and Rosenberg 2006).

Figure 7 shows the behavior of species tree estimation in the anomaly zone for the case of four species. Gene trees and sequence alignments at multiple loci are simulated using MCcoal, which is part of the bpp package (Yang 2015), using the species tree, (((*A*, *B*), *C*), *D*), with *τ _{AB}* = 0.01,

*τ*= 0.011, and

_{ABC}*τ*= 0.012, and with

_{ABCD}*θ*= 0.05 for all species. Those parameter values give

*P*(

*G*

_{1}) = 0.0740 and

*P*(

*G*

_{2}) = 0.1191 (Equation 9), so the species tree is in the anomaly zone. The JC model (Jukes and Cantor 1969) is used for both simulation and analysis. Each replicate dataset consists of

*L*loci, of

*n*= 1000 sites, and the number of replicates is

*R*= 100. We consider three methods: the majority vote, concatenation, and BI (bpp) (Yang and Rannala 2014). With the majority vote, baseml (which is part of the paml package, Yang 2007) is used to construct the rooted gene tree at each locus under the molecular clock, and the most common gene tree is taken as the estimate of the species tree. Loci with ties (that is, with two or more binary rooted trees to be equally best) are discarded. With concatenation, the concatenated sequences are analyzed as one supergene using baseml under the clock and the resulting rooted gene tree is taken as the estimate. Both majority-vote and concatenation are inconsistent so that when the number of loci (

*L*) increases, the probability of recovering the true species tree approaches 0 (Figure 7). Both methods converge to the incorrect symmetrical species tree, ((

*A*,

*B*), (

*C*,

*D*)).

In the bpp analysis (Analysis A01 in Yang 2015), the uniform prior is assigned on the rooted trees (Prior 1 in Yang and Rannala 2014), while gamma priors with shape parameter 2 and with the means to be the true values are assigned to the parameters: *τ _{ABCD}* ∼

*G*(2, 2/0.012) for the age of the root and

*θ*∼

*G*(2, 40). The shape parameter of 2 means the gamma priors are fairly diffuse. As the Bayesian method of model selection is consistent, the probability of recovering the true species tree approaches 1 when the number of loci increases.

The issue of anomalous gene trees has been discussed extensively. Degnan and Rosenberg (2006) wrote that “the use of multiple genomic regions for species tree inference is subject to a surprising new difficulty, the problem of “anomalous gene trees.”” We emphasize that the anomaly zone is not an intrinsic difficulty of the estimation problem. It exists because of the heuristic nature of the majority-vote method, and vanishes if the data are analyzed using full likelihood methods. If the data consist of a collection of gene tree topologies, the likelihood function is given in Equation 6. As discussed earlier, this ML method is consistent. Intuitively if the true species tree is asymmetrical and if the two internal branches are very short, one expects the symmetrical mismatching gene tree *G*_{2} to be nearly twice as frequent as the asymmetrical matching gene tree *G*_{1} (Figure 1). Thus the slightly higher proportion of *G*_{2} than *G*_{1} may be seen to be compatible with an asymmetrical species tree (with short internal branch lengths) and may not be evidence against it.

The analysis of Degnan and Salter (2005); Degnan and Rosenberg (2006) assumes that the gene trees are known without error. In real data analysis, gene trees reconstructed using phylogenetic methods may involve substantial errors and uncertainties. Huang and Knowles (2009) evaluated the impact of phylogenetic reconstruction errors on the anomaly zone and observed an expansion of the anomaly zone due to phylogenetic errors: there is a larger space of species tree parameters within which anomalous gene trees are observed when the gene trees are estimated than when they are given. Nevertheless, Huang and Knowles (2009) suggest that in the anomaly zone, phylogenetic analysis of typical datasets tends to return unresolved gene trees rather than resolved incorrect trees, so that the difficulty posed by anomalous gene trees to practical phylogenetic analysis may be limited. At any rate, ML and BI methods for analyzing real datasets are based on the sequence likelihood (Equations 2 and 3), so that reliable inference of the species tree is possible even if the estimated gene tree at every locus is unreliable and involves substantial uncertainties.

## Can More Data Make Things Worse?

Liu *et al.* (2015, Figure 4) reported a simulation study showing that adding “weak” genes (short genes with few sites) to a set of “strong” genes (long genes with many sites) might cause mp-est to perform worse. Here we confirm this counterintuitive behavior by conducting two simulation experiments, using species trees of five and three species, respectively. We find that the effect is subtle. We then discuss two analogous simple examples to explain the result, offering an argument that it does not occur with full likelihood methods.

Our first simulation largely follows Liu *et al.* (2015). The species tree is ((((*A*: 0.002, *B*: 0.002): 0.002, (*C*: 0.002, *D*: 0.002): 0.002): 0.002, *E*: 0.006): 0.01, *F*: 0.016), with *θ* = 0.008 for all populations (Figure 8A). Species *F* is used as the outgroup. There are two kinds of loci: strong genes with 1000 bp and weak genes with 100 bp. Gene trees and sequence alignments were generated using MCcoal (Yang 2015). The mutation model used in data generation and analysis is JC (Jukes and Cantor 1969). For the mp-est analysis, PhyML (Guindon and Gascuel 2003) was used to infer the unrooted ML tree for each locus, with the sequence from species *F* used to root the tree. The rooted gene trees for the five ingroup species are then processed using mp-est (Liu *et al.* 2010a) to infer the species tree. We consider the use of strong genes only, weak genes only, and a mixture of 20 strong genes followed by addition of weak genes (Figure 8). mp-est performed much worse in the simulation of Liu *et al.* (2015, Figure 4) than here, as those authors used bootstrap gene trees while we used ML gene trees (L. Liu, personal communication). At any rate, Figure 8 shows a similar pattern to that in Liu *et al.* (2015, Figure 4): adding weak genes to the dataset of 20 strong genes led to deteriorated performance by mp-est. The effect, however, is much weaker than in Liu *et al.* (2015). As the method is consistent even with weak genes only, the probability of correct species tree will eventually approach 1 when the number of loci *L* → ∞.

We also included bpp for comparison. As bpp assumes the clock and works with rooted trees, we removed the sequence from species F at every locus. We assigned a uniform prior on rooted trees (Yang and Rannala 2014), and gamma priors on parameters in the MSC: *τ _{ABCDE}* ∼ G(2, 2/0.006) with mean 0.006 for the root age, and

*θ*∼ G(2, 250) with mean 0.008 for

*θ*. bpp performed much better than mp-est, with large differences when the data include only weak genes or only a few strong genes. bpp did not show a deterioration of performance upon adding weak genes, but this is hard to discern as the probability of recovering the true species tree is already 98% with 20 strong genes alone.

Our second simulation uses the species tree ((*AB*)*C*) for three species (Figure 3). Note that for three species, many summary methods such as mp-est (Liu *et al.* 2010a), mdc (Maddison 1997), and star (Liu *et al.* 2009) are equivalent, so that our results for mp-est apply to the other equivalent methods as well. We used *τ _{AB}* = 0.05,

*τ*= 0.06, and

_{ABC}*θ*=

_{AB}*θ*= 0.02. We again simulated strong genes (1000 sites) and weak genes (50 sites). Gene tree reconstruction under the clock and the JC model for three sequences is tractable analytically (Yang 2000), so that we can analyze a huge number of replicates for the mp-est method without the need to use the mp-est program. Each alignment of three sequences (

_{ABC}*a*,

*b*,

*c*) can be summarized as counts (

*n*

_{0},

*n*

_{1},

*n*

_{2},

*n*

_{3},

*n*

_{4}) of five site patterns:

*xxx*,

*xxy*,

*yxx*,

*xyx*, and

*xyz*, where

*x*,

*y*, and

*z*are any distinct nucleotides. We take the ML tree as ((

*ab*)

*c*) if and only if

*n*

_{1}> max(

*n*

_{2},

*n*

_{3}). If two trees are equally best, we discard the locus. The most common gene tree topology is then the estimated species tree. Ties for species trees are broken evenly: if two species trees have the same maximum number of matching gene trees, each is given 50%, and if all three species trees are equally good, each is given 1/3. The performance deterioration with mp-est upon addition of weak genes is real but the effect is so small that a huge number of replicates are necessary to demonstrate its presence. With five strong genes, the probability of recovering the correct species tree by mp-est is 0.8883, and this probability drops to 0.8867–0.8875 when 1–3 weak genes are added to the dataset, and rises back to 0.8883 when four weak genes are included. With 10 strong genes, the probability is 0.9705, and it drops to 0.9690–0.9699 when 1–7 weak genes are added to the dataset. The effect exists but is too small to have any biological significance.

For comparison, we included ML (3s) (Yang 2002) and BI (bpp). Note that 3s optimizes the parameters (Θ) while bpp averages over them, so that the two methods are not equivalent even for this simple case of three species. For bpp, we used the uniform prior on the three species trees, and gamma priors on parameters: *τ*_{0} = G(2, 2/0.06) for the root age, and *θ* = G(2, 100) for *θ*s. The two methods had very similar performance (Figure 9). As expected neither 3s nor bpp shows the counterintuitive behavior. When only weak genes or a few strong genes are analyzed, 3s and bpp recovered the true species tree with higher probabilities than mp-est.

We note that the performance differences between the summary method (mp-est) and the full-likelihood methods (3s and bpp) are smaller for the three-species case (Figure 9) than for the five-species case (Figure 8). This may be expected from the simplicity of the three-species case. Indeed as far as the point estimate of the species tree is concerned, mp-est, ML (3s) and BI (bpp) are expected to be equivalent if the data consist of only one locus. Let *p*_{0} be the probability that a locus shows a tie (with two or more gene trees being equally best), *p*_{1} be the probability of the matching gene tree, and *p*_{2} and *p*_{3} be the probabilities of the two mismatching gene trees (Figure 3). For the case of Figure 9A, these are *p*_{0} = 0.1989, *p*_{1} = 0.4013, *p*_{2} = *p*_{3} = 0.2000 for the weak genes (50 bp) and *p*_{0} = 0.0197, *p*_{1} = 0.6915, *p*_{2} = *p*_{3} = 0.01444 for the strong genes (1000 bp). The probability of recovering the correct species tree for one locus is *p*_{1} + *p*_{0}/3. The three methods are not the same when two or more loci are analyzed. For mp-est, the probability of recovering the species tree with two loci can be calculated, by considering the different outcomes of gene tree reconstruction at the two loci, as + 2*p*_{0}*p*_{1} + + = If the outcome is 11 (both loci producing the matching gene tree *G*_{1}), or 01 or 10, the correct species tree is recovered. If the outcome is 12, 21, 13, or 31, the species tree is half correct, and so on. If *p*_{0} ≈ 0 (as in the case of strong genes), mp-est recovers the correct species tree with nearly the same probability (≈ *p*_{1}) for one locus and two loci (Figure 9A). This is not the case for 3s and bpp.

To gain insights into the counterintuitive behavior of mp-est in the cases of three species (Figure 9A), we study two small analogous problems. The first is a binomial example of coin toss in which a coin is tossed *n* = 100 times to estimate the probability of heads, *p*. If the observation is *x* heads, the estimate has variance Now we add a second experiment, with *x*′ heads out of *n*′ = 10 tosses. The second experiment alone would give us the estimate with variance A simple method of combining the two experiments is to take the average which has the variance and effective sample size = 36.4, which is even smaller than *n* = 100. Thus adding data (from the second experiment) made things worse, because we ignore the fact that the two experiments have very different precisions. A better method is to use the likelihood function to combine the two experiments:(10)The MLE is then with an effective sample size of *n* + *n*′, as expected. The MLE is a weighted average of and with the precision (the reciprocal of the variance) used as weights.

The second analogous example is similar to the mp-est estimation of the species tree, but with two instead of three possible trees to consider. The gene tree that matches the species tree will be called a “good” gene tree, while the mismatching gene tree a “bad” one. Suppose a strong gene and a weak gene produce a good gene tree with probability *p* and *p*′, respectively, with *p* > *p*′ > With *n* strong genes only, the number of good gene trees *x* ∼ Bin(*n*, *p*), so that the correct species tree is recovered with probability(11)if *n* is large. Now consider adding *n*′ weak genes, which will produce *x*′ good gene trees, with *x*′ ∼ Bin(*n*′, *p*′). A simple method for combining the data recovers the correct species tree if and only if more than half of the gene trees are good. This occurs with probability(12)if both *n* and *n*′ are large. Note that *x* and *x*′ are independent binomial variables, so that *x* + *x*′ has mean *np* + *n*′*p*′ and variance *np*(1−*np*) + *n*′*p*′ (1−*p*′).

Is it possible for *P*_{dumb}(*n*, *n*′) < *P*(*n*) for certain values of *n* and *n*′? The answer is Yes. Consider for example *p* = 0.6 for strong genes and *p*′ = 0.525 for weak genes. Thus *P*(50) = 0.9255 with *n* = 50 strong genes only, and *P*_{dumb}(*n*, *n*′) < *P*(*n*) if 1 ≤ *n*′ < 432 (Figure 10). Adding weak genes made things worse, and performance did not recover to the level of *n* = 50 strong genes until 432 weak genes were added. The simple method (Equation 12) does not account for the fact that the gene trees from the weak genes involve larger sampling errors than those from the strong genes.

Liu *et al.* (2015) suggest that “adding weak genes may actually reduce the performance of species tree estimation methods, negating the old adage that “more data is always better.”” The authors further suggest that “[a]n important rule of thumb that has emerged from both simulation and empirical studies is that species trees are only as good as the gene trees on which they are built. This maxim applies both to two-step species tree methods, in which gene trees are used as input data, and to single-step approaches, such as Bayesian methods, in which gene and species trees are estimated simultaneously.” Those statements need to be qualified. Our analysis above suggests that the counterintuitive result of performance deterioration upon addition of weak genes may be more easily explained by the inefficient use of information in the data by the summary methods. For full likelihood methods, the likelihood calculation on the sequence alignment should automatically accommodate the fact that a strong gene with more sites is more informative about the gene tree than a weak gene with fewer sites. The counterintuitive behavior will then not occur. We expect similar patterns if the different loci have different mutation rates and different information contents, if the rate variation among loci is accommodated appropriately in the likelihood model. Thus species trees can be good even if all gene trees are bad. In the extreme, adding weak genes which have only two sequences (and thus no phylogenetic information) should also help, by providing information on parameters in the model.

The performance deterioration for mp-est upon adding the weak genes is so small that the effect does not have any biological impact. Thus the conclusions to be drawn from this analysis are largely conceptual, in that one should not be surprised at counterintuitive behaviors when heuristic methods are used. Note that large performance differences are possible between likelihood methods and summary methods when the data are not very informative (Figure 8).

## Singularity on the Likelihood Surface

As discussed earlier, given that the data are a collection of rooted gene trees with node ages (branch lengths on ultrametric gene trees), the ML estimate of the species tree under the assumption of equal population size for all species is given by the maximum tree algorithm (Liu *et al.* 2010b; see also Mossel and Roch 2010). For the more general problem with different species having independent population size parameters (*θ*s), however, Liu *et al.* (2010a) pointed out that the likelihood may become infinite for certain species tree and parameter values. The likelihood function in this case is the MSC density of gene trees and coalescent times given the species tree, or *f*(*G _{i}*,

*t**|*

_{i}*S*, Θ) in Equation 2 (Rannala and Yang 2003). The observation motivated the development of the summary method mp-est, which ignores branch lengths and uses gene tree topologies only. Here we illustrate how such singularity can occur, and point out that it is not a problem if the sequence data (rather than the estimated gene trees) are analyzed using full likelihood methods. We also discuss ways of avoiding the singularity in the summary methods (Yang 2014, p.338–9).

Suppose that the data consist of one “observed” gene tree, which is *G*_{1}* _{a}* of Figure 3, with branch lengths (node ages)

*t*

_{0}and

*t*

_{1}, and consider the calculation of the likelihood for the species tree

*S*of Figure 3a, with parameters Θ = {

*θ*,

_{AB}*θ*,

_{ABC}*τ*,

_{AB}*τ*}. The likelihood is given as the product of the two exponential densities for the two coalescent events (Rannala and Yang 2003):Note that this is a function of

_{ABC}*θ*,

_{AB}*θ*,

_{ABC}*τ*, and

_{AB}*τ*, with

_{ABC}*t*

_{0}and

*t*

_{1}fixed. Now let

*τ*→

_{AB}*t*

_{1},

*τ*→

_{ABC}*t*

_{1}, and let

*θ*=

_{AB}*c*(

*t*

_{1}–

*τ*) → 0, with

_{AB}*c*> 0 to be a constant. Then the first term = → ∞, and the likelihood function becomes infinite. In other words, when we shrink the internal branch length (

*τ*–

_{ABC}*τ*) in the species tree to zero, collapse the two internal nodes onto the coalescent event, and increase the coalescent rate (2/

_{AB}*θ*) to infinity, the likelihood becomes infinite, with the parameter estimates =

_{AB}*t*

_{1}and = 0 (with indefinite). There are thus rays of singularity on the likelihood surface. In this case, there is no singularity if both coalescent events are assumed to occur in the common ancestor

*ABC*. In other words, the species tree compatible with the given gene tree has infinite likelihood while the mismatching species trees have finite likelihood. In the general case of more than three species and sequences, multiple species trees can have infinite likelihood (Liu

*et al.*2010a). Singularity can occur for an arbitrary species tree and multiple gene trees as long as one can collapse an internal branch on the species tree onto a coalescent event on a gene tree.

Here we point out that such singularity does not occur when full likelihood methods (ML or BI) are applied to sequence data, with the likelihood calculated using the sequence alignments (Equations 2 and 3). Approaches using sequences also have the advantage of accommodating uncertainties in the gene trees and branch lengths. If the gene trees with branch lengths are treated as observed data, singularity disappears if one uses the Bayesian method, as the prior shrinks the parameters away from extreme values (such as *θ _{AB}* = 0, and

*τ*=

_{AB}*τ*), and if one uses the ML method under the assumption that all populations have the same size (same

_{ABC}*θ*) (Liu

*et al.*2010a), as in the maximum tree method or the stem program (Kubatko

*et al.*2009).

## Perspectives and Practical Data Analysis

We have discussed a suite of counterintuitive behaviors of summary methods for species tree estimation, such as unidentifiability, inconsistency (anomaly zone), singularity on the likelihood surface, and deteriorated performance upon inclusion of more data. Those behaviors are due to the information loss when the summary methods use estimated gene tree topologies only, ignoring information in the branch lengths, and due to their failure to account for the uncertainties in the estimated gene trees. They do not occur when the sequence data are analyzed using full likelihood methods. Our analyses demonstrate that large performance differences may exist between full likelihood methods and summary methods even if both are based on the MSC.

While other factors may also cause gene tree-species tree conflicts (Maddison 1997; Nichols 2001; Szollosi *et al.* 2014), we have taken an idealized viewpoint assuming that the MSC is the true data-generating model. A number of challenges exist with current implementations of the Bayesian inference methods. A major problem is the intensive computation involved and the inefficient mixing of the transmodel MCMC algorithms used by Bayesian programs, although improvements are being made (*e.g.*, Rannala and Yang 2016).

We suggest that the utility of summary *vs.* full-likelihood methods will depend on the nature of the species tree estimation problem. For easy problems with long internal branches in the species tree and little incomplete lineage sorting, different methods are likely to produce the same results, and simple methods such as concatenation may have even higher statistical efficiency than coalescent-based full likelihood methods. For shallow species phylogenies, characterized by recent divergences and short internal branches (as occurs in radiative speciation), full likelihood methods may have a big advantage over summary methods or simple methods such as concatenation. Genomic datasets are being generated from a variety of species, such as mosquitos (Fontaine *et al.* 2015), butterflies (Martin *et al.* 2013), hares (Melo-Ferreira *et al.* 2012), bears (Liu *et al.* 2014), and gibbons (Carbone *et al.* 2014). As the species are closely related, the molecular clock holds approximately and can be used to root the tree and to provide information about the node ages (coalescent times). The high sequence similarity and low phylogenetic information content at each locus may be problematic to summary methods that are sensitive to phylogenetic reconstruction errors but should be ideal for full-likelihood methods (Ogilvie *et al.* 2016; Rannala and Yang 2016).

Deep species phylogenies characterized by ancient rapid divergences pose the greatest challenge. Two strategies are possible to account for the violation of the molecular clock in deep phylogenies: (i) the use of relaxed-clock models in Bayesian analysis to root the tree and to extract information about coalescent times in gene-tree branch lengths, and (ii) the use of outgroups to root the tree, ignoring information in branch lengths (or coalescent times) in gene trees, as in astral (Mirarab *et al.* 2014; Mirarab and Warnow 2015) and NJ_{st} (Liu and Yu 2011). In theory relaxed-clock models allow the inference of the root and the node ages in both the species tree and the gene trees. However, several difficulties may arise. First, current relaxed-clock models of rate drift, especially when applied to data of multiple genetic loci, may be highly unrealistic (dos Reis *et al.* 2016). Relaxed-clock models developed for dating species divergences (Thorne *et al.* 1998; Drummond *et al.* 2006; Rannala and Yang 2007) allow the substitution rate to drift over time or among branches, but they do not appear appropriate for analysis under the MSC. A simple modification may be to assign rates at a locus to branches of the species tree (rather than to branches of the gene tree), so that gene-tree branches residing in the same species share the same rate. A gene tree branch may be broken into several segments with different rates, and the branch length or the total amount of evolution along the branch is calculated by summing over the segments. More worrying are the impacts of among-loci heterogeneity in the substitution rate and in the process of substitution rate drift among branches (Zhu *et al.* 2015). Current models assume that rates drift over time independently among loci. This assumption is unrealistic but allows rates at different loci to be treated as independent realizations of the same process and be teased apart from the species divergence times. Allowing for rate correlation among loci or lineage effects may render the model unidentifiable, since both the lineage rates and the species divergence times have genome-wide effects and are thus seriously confounded. Relaxed-clock rooting appears to be less reliable than outgroup rooting when the clock is seriously violated. It is an open question how Bayesian relaxed-clock models compare with summary methods that use unrooted gene trees with outgroup rooting in inferring deep phylogenies.

## Acknowledgments

We thank Liang Liu for extensive discussions, especially concerning mp-est. We are grateful to Scott Edwards, Laura Kubatko, Liang Liu, Mike Steel, and three anonymous referees for many constructive comments on earlier versions of the manuscript.

## Footnotes

*Communicating editor: M. Turelli*

- Received April 6, 2016.
- Accepted September 25, 2016.

- Copyright © 2016 by the Genetics Society of America