Abstract
In general when a phylogeny is reconstructed from DNA or protein sequence data, it makes use only of the probabilities of obtaining some phylogeny given a collection of data. It is also possible to determine the prior probabilities of different phylogenies. This information can be of use in analyzing the biological causes for the observed divergence of sampled taxa. Unusually “rare” topologies for a given data set may be indicative of different biological forces acting. A recursive algorithm is presented that calculates the prior probabilities of a phylogeny for different allelic samples and for different phylogenies. This method is a straightforward extension of Ewens' sample distribution. The probability of obtaining each possible sample according to Ewens' distribution is further subdivided into each of the possible phylogenetic topologies. These probabilities depend not only on the identity of the alleles and on 4Nμ (four times the effective population size times the neutral mutation rate) but also on the phylogenetic relationships among the alleles. Illustrations of the algorithm are given to demonstrate how different phylogenies are favored under different conditions.
HOMOLOGOUS genes that are not identical in sequence are called alleles. These allelic sequences are often used as markers to provide inferences about the biology of the individuals that harbor them. In 1972 Ewens described the probability of obtaining different samples of alleles when each allele is chosen randomly from a large population. This later became known as Ewens' distribution. The probability of different samples is influenced by the population size, by the mutation rates among alleles, and by the history of the population.
In its simplest form this distribution makes the following assumptions. Let N represent the effective number of diploid, randomly mating individuals in a population of constant size. The population has discrete nonoverlapping generations. Consider samples of a locus taken randomly from these individuals and assume that an infinite allele model (Kimura 1983) holds at this locus. Let μ be the number of selectively neutral mutations per generation at this locus. If evolution has proceeded long enough for an equilibrium to be reached (the probabilities at equilibrium are designated by a ∧), then the probability of sampling n_{1} alleles of one sort, n_{2} of another,... up to n_{k} alleles of another sort is
Here, for simplicity of notation, I calculate the same probability but do not multiply by the number of ways that a sample can be obtained. This probability is denoted by P followed by the vector describing the alleles sampled, measures only the probability of a particular sample, but ignores the number of different ways in which this probability could be obtained. This probability can be written as
For samples that consist of just two genes these probabilities are well known and have been shown to be
We follow standard practice and designate these genealogies using the Newick format of nested parentheses. In our case, however, we designate only the allelic genealogy for which we have evidence. This does not consider that genes that are the same allele might have diverged and differ elsewhere in their DNA. For the purposes of the allelic phylogeny, they are identical. Hence, in the case of a sample of two genes both with identical alleles, there is no structured genealogy, only the trivial genealogy that both genes stem from a common ancestor. If the two genes carry different alleles, then the allelic genealogy states only that they are different. Similarly for samples of three genes, the possible topologies are (3), (2, 1) and (1, (1, 1)). The first topology is a trivial one, just stating the genes are allelically identical. Note also, that with these samples, the samples correspond to just one genealogy [the sample (1, (1, 1)) represents just one topology and not the three possible rooted topologies because the alleles are unlabeled].
With samples of two or three genes no partitions within a single sample are required. However, with a sample of four genes there is a partitioning imposed by the genealogy. For samples of four genes (n = 4), there may be anywhere from one to four allelic types present (k = 1,..., 4). If k = 1, again, there is evidence for only one possible topology, e.g., (4). If k = 2 then the topologies are (3, 1) or (2, 2). If k = 3, however, there are two distinct topologies for the same sample—(2, (1, 1)) and (1, (2, 1)); the former indicates that the two unique alleles are more closely related, while the latter indicates a closer relationship between the allele sampled twice and one of the unique alleles. Thus, more than one allelic genealogy is possible for a single sample. When k = 4 there are again two distinct topologies possible—(1, (1, (1, 1))) and ((1, 1), (1, 1)) both from the same sample of four unique alleles. Ewens' distribution provides us the ability to determine the probability of obtaining the sample (1, 1, 1, 1) but not how it is further subdivided into these two possibilities.
A recursive algorithm to subdivide Ewens' sample probabilities into the different topologies possible and to calculate how the probability of a single arbitrary sample is distributed among these different topologies is developed below. This information can be of use in analyzing the biological causes for the observed divergence of sampled taxa. Unusually “rare” topologies for a given data set may be indicative of different biological forces acting.
RESULTS
Counts: To determine the probabilities of different samples it is necessary first to consider the number of ways that branching can occur, the number of tree topologies possible, and the number of samples possible within a tree topology.
Not all topologies are equally likely to occur. This was demonstrated by Tajima (1983) in a calculation based on the assumption that any one branch is as likely to split into two new branches as another. Given this assumption, it is possible to determine the a priori probabilities of different topologies. A recursive algorithm is used starting at the root of the tree. Beginning at the root, the initial branch point in the tree will create two branches subtending m_{1} taxa along one lineage and m_{2} taxa along another lineage. Tajima calculated the probability of obtaining a particular topology for a phylogeny as
In addition to these numbers, it is necessary to ensure that all possible topologies are considered and that all numbers of possibilities within these topologies are considered. The number of tree topologies was determined by CavalliSforza and Edwards (1967). If only the topologies are considered without regard to the order of taxa at the terminal branches, then the number of possible tree topologies is
We also require the number of ways that alleles can be sampled within each branching pattern. The number of combinations of alleles in this case is given by the appropriate multinomial divided by 2^{i}, where i is the number of nodal points in the tree that subtend symmetric branches. For example,
Recursion: The probability of each sample, with consideration for its topological structure, can be determined by evaluating all possible origins of a sample. This methodology follows from Golding (1984). It is assumed that a new population of N diploid individuals is produced each generation by a random sampling (with replacement) of the gametes of the previous generation. Generations are distinct and nonoverlapping. The probability of obtaining particular samples of gametes is used to describe the genetic structure of such a population over time. As an example, consider the probability, P(2), of obtaining a sample of two identical genes (sampled without replacement). It was shown by Malécot (1969) that a recursion relationship can be built for this probability that relates the value in generation t + 1 to the value in generation t. This is done by determining how the mating scheme, mutation, and other factors influence the probability. Along with this information, the probability in generation t is sufficient to describe enough of the structure of the population in generation t + 1 to calculate the new probability. Hence, designating probabilities in the next generation with a prime,
For larger samples of n genes, the multinomial term is ½n(n – 1)1/2N (again ignoring small terms). This is a description of the fraction of the number of ways to sample n genes from a population that involve two genes having a common ancestor from among the 2N gametes in the entire population. For the sample (2, 1), it is possible that genes 1 and 2, 1 and 3, or 2 and 3 were samples of the same gene from the previous generation chosen twice (probability 3/2N). But only one of these events (1 and 2) would contribute to the probability of the sample (2, 1) in the next generation and only if the sample otherwise had the pattern (1, 1). Thus,
It is quite possible to extend these results to add allelic phylogenetic structure to the samples. To illustrate this, consider a sample of five genes with allelic topology (3, (1, 1)). These five genes might have their origins as duplicates of four or fewer preexisting alleles in the sample, they might be new mutations, or they might be neither. These possibilities reflect the actions of random drift, mutation, or of neither event occurring.
For the sample (3, (1, 1)) only the three identical alleles can be duplicates (via random drift) of other alleles in the sample from the previous generation as the remaining two alleles are distinct. Because the other alleles are distinct, they cannot be an identical copy of another allele in the sample from the previous generation. The probability that all three alleles are copies of a single allele in the sample is (1/2N)^{2} [corresponding sample probability P(1, (1, 1))], the probability that two of the three are copies of a single allele in the sample is approximately ½(3 × 2) 1/2N [corresponding sample probability P(2, (1, 1))], and the probability that two of the five are copies of a single allele in the sample is approximately ½(5 × 4) 1/2N (Malécot 1969). Again, many other samples might arise but they do not necessarily contribute to the probability of this sample.
The probability that either of the two unique alleles arose via mutation from the previous generation is 2μ(1 – μ) [arising from the sample (3, 2) to yield (3, (1, 1))]. The probability that both of the unique alleles arose via mutation is μ^{2}.
Again, for simplicity all terms with second order probabilities are ignored. These events and their probabilities do not significantly alter the first order terms (Golding 1984). Therefore, probabilities multiplied by terms such as μ^{2} and (1/2N)^{2} can be ignored and the above terms can be simplified to 3/2NP(2, (1, 1)) and 2μP(3, 2).
The probability that none of these events occurred and that the sample has originated as an identical sample from the previous generation is ∼(1 – 10/2N – 3μ)(Malécot 1969) and hence the complete recursion equation for P(3, (1, 1)) is
These equations can then be solved to show that at equilibrium
Slightly more complicated examples of these recursion equations are
The general format of these equations is
At equilibrium the general solution is given by the recursive formula
These equations cannot be generally solved in closed form. As an example of their intractability, consider an example with just six alleles sampled. The equilibrium solution is
As θ → ∞ these probabilities are
The relationship of these samples as θ → ∞ can be used as a rapid estimate of the probability for simulation and likelihood methods. These results show, however, that while Tajima's number is quite adequate for samples where all genes are unique (i.e., k = n), it is not when some alleles are sampled more than once (k < n). In the latter case, the differences in combinatorics again become prevalent.
As θ → 0 the probabilities are also simplified and more easily calculated. For example, in the case of any genealogy with a sample of three alleles, the probabilities are
DISCUSSION
The above methodology provides a rapid way to calculate prior probabilities of different allelic samples including differences in their genealogical relationships. This does not consider the effects of different branch lengths within a genealogy. Since the branch lengths are continuous, there are an infinite number of possible genealogies all within the same sample structure and topology. All of these possibilities are summed within the probabilities as given here and only the probability of the overall topological structure is considered. This methodology is intimately connected with the coalescent process. Individual coalescent paths are summed within the probabilities presented here to yield only those coalescents that are distinguishable via allelic differences.
One of the biggest influences on the relative probabilities of different allelic samples is simply the combinatorics of the number of ways to obtain a particular sample. The sample (1, (3, 1)) is twice as likely as (3, (1, 1)) due simply to this combinatorial difference. There is still an influence of mutation rate on the relative likelihoods of these samples; the frequency of sample (3, (1, 1)) is higher if θ is larger, but this effect is comparatively small.
The genealogical relationships among samples of (2, 2, 1, 1), (2, 1, 1, 1, 1), and (1, 1, 1, 1, 1, 1) are shown in Figures 1, 2, and 3, respectively. These are plotted from the exact numerical solution of the equilibrium recursion equations. The figures all have n = 6 and illustrate several features of these probabilities. First, the dependence of the probabilities on the value of θ can be almost nonexistent or, alternatively, can become very noticeable for the same overall allelic sample. In Figure 1, (2, (1, (2, 1))) changes from 0.2614 to 0.2234% of the total sample (2, 2, 1, 1) as θ changes from 0.1 to 100. But for (2, (2, (1, 1))) the relative probability changes from 0.1163 to 0.1116%. The same is true in Figure 2 for sample (1, ((2, 1), (1, 1))) (0.1137–0.1332%) vs. sample (2, ((1, 1), (1, 1))) (0.0330–0.0333%).
Some of the numerical relationships among genealogies are initially surprising. In Figure 2, the probabilities of samples ((1, (2, 1)), (1, 1)) and ((1, (1, 1), (2, 1))) are numerically nearly identical. For these samples, moving the “2” to the inner or outer cluster of individuals does not matter to the probability. However, moving the 2 for any of the other samples can make a large difference. The samples (1, (1, (1, (2, 1)))) and (1, (1, (2, (1, 1)))) have the same overall branching pattern (third from the top and second from the bottom, respectively). The 2 has simply been moved inside or outside of the innermost cluster of samples. But the former sample is almost three times more likely.
In Figure 3, the samples ((1, 1), ((1, 1), (1, 1))) and (1, (1, ((1, 1), (1, 1)))) (second from the bottom and bottom, respectively) have the sample allelic sample with the same topological structure. They differ only in the placement of the root from which these samples originated and yet this difference is sufficient to cause the former sample to be twice as likely as the latter.
Differences among genealogical structures are at the heart of many modern genetic methods. This work demonstrates that these genealogies can have very different probabilities even in the absence of external forces. This makes a knowledge of their prior probabilities a necessary precondition to the interpretation of genealogies. External forces such as migration will greatly alter these probabilities (results not shown) but without a knowledge of their values in the absence of migration this change cannot be utilized.
Changes to the probabilities of different samples can also be done via the effects of natural selection. Selection studies have shown that unlabeled genealogies do not change in length or shape as selection is present (Golding 1996). This unusual feature was later confirmed by Neuhauser and Krone (1997) and Przeworski et al. (1999). However, this feature is also present only when the genealogy is unlabeled. Figure 4 shows changes in the frequency of the four genealogies that involve three distinct selectively neutral alleles and one selectively deleterious allele. These genealogies are (a, (A, (A, A))), ((A, a), (A, A)), (A, (a, (A, A))), and (A, (A, (a, A))) (top to bottom, respectively, on the left of the graph), where a is a selectively deleterious allele. All three samples with the same topology are now present since the alleles have been effectively “labeled” by the distinction of their selective states.
Note that when selection is weak, the three selectively neutral alleles are more likely to cluster together and form a high frequency class for this sample but when selection is strong this is the least likely topology for the same sample. This appears to be a reflection of the phenomena that selectively deleterious alleles are less likely to leave selectively neutral descendants than are neutral alleles to leave deleterious descendants (Goldinget al. 1986). This graph is based on 10,000 samples of four alleles from a Monte Carlo simulation. The difference in the relative orders of these genealogies occurs only when the count of these samples is very small. Hence the utility of this difference to detect selection is probably limited.
Acknowledgments
The author thanks Dr. R. A. Morton for his helpful comments on this work. This work was supported by a Natural Sciences and Engineering Research Council of Canada grant to G.B.G.
Footnotes

Communicating editor: Y.X. Fu
 Received April 25, 2001.
 Accepted March 4, 2002.
 Copyright © 2002 by the Genetics Society of America