- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Kuhner, M. K.
- Articles by Felsenstein, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Kuhner, M. K.
- Articles by Felsenstein, J.
Usefulness of Single Nucleotide Polymorphism Data for Estimating Population Parameters
Mary K. Kuhnera, Peter Beerlia, Jon Yamatoa, and Joseph Felsensteinaa Department of Genetics, University of Washington, Seattle, Washington 98195-7360
Corresponding author: Mary K. Kuhner, Department of Genetics, University of Washington, Box 357360, Seattle, WA 98195-7360., mkkuhner{at}genetics.washington.edu (E-mail)
Communicating editor: G. A. CHURCHILL
| ABSTRACT |
|---|
Single nucleotide polymorphism (SNP) data can be used for parameter estimation via maximum likelihood methods as long as the way in which the SNPs were determined is known, so that an appropriate likelihood formula can be constructed. We present such likelihoods for several sampling methods. As a test of these approaches, we consider use of SNPs to estimate the parameter
= 4Neµ (the scaled product of effective population size and per-site mutation rate), which is related to the branch lengths of the reconstructed genealogy. With infinite amounts of data, ML models using SNP data are expected to produce consistent estimates of
. With finite amounts of data the estimates are accurate when
is high, but tend to be biased upward when
is low. If recombination is present and not allowed for in the analysis, the results are additionally biased upward, but this effect can be removed by incorporating recombination into the analysis. SNPs defined as sites that are polymorphic in the actual sample under consideration (sample SNPs) are somewhat more accurate for estimation of
than SNPs defined by their polymorphism in a panel chosen from the same population (panel SNPs). Misrepresenting panel SNPs as sample SNPs leads to large errors in the maximum likelihood estimate of
. Researchers collecting SNPs should collect and preserve information about the method of ascertainment so that the data can be accurately analyzed.
MODERN population genetics methods require large samples of population level genetic information to answer questions about population size, migration, selection, and other factors. Many researchers have recently begun collecting single nucleotide polymorphism (SNP) data in the hope of addressing these questions, as well as for their applications in gene mapping (for an overview, see ![]()
Appropriate analysis of SNPs depends on knowing how they were collected. Critical pieces of information include the following:
- Were candidate sites chosen from completely linked, partially linked, or unlinked regions of the genome?
- Were sites defined as SNPs on the basis of their polymorphism in the sample at hand, a panel drawn from the sample population, or a panel drawn from a different population?
As one possible measure of the usefulness of SNP data, we consider the estimation of the parameter
= 4Neµ, four times the product of the effective population size and the neutral mutation rate. We look at the simple case of a single random-mating diploid population of constant effective population size Ne. At each site, selectively neutral mutations occur with probability µ per generation.
This parameter is estimated using the Metropolis-Hastings method of ![]()
is found specifically in the branch lengths of the sampled trees: therefore, estimation of
indirectly tests the ability of a SNP likelihood model to accurately estimate branch lengths.
We test the effect of both unacknowledged and acknowledged recombination on the accuracy of these estimates, since SNP data often come from a context where recombination is possible.
The use of SNP data to estimate parameters not closely related to branch length, such as tree topology and map order, will be considered in a future article. It seems likely that SNP data will be more powerful for such parameters than they are for branch lengths; however, the pitfalls found here will probably be found in those areas as well.
| METHODS |
|---|
Likelihood framework for SNPs:
The estimation of phylogenies by maximum likelihood from DNA sequences (![]()
To analyze SNPs using this approach, several aspects of the way the SNPs were collected must be taken into consideration: this is analogous to considering ascertainment of individuals in a case/control disease study.
In the first section, we consider the question of whether the SNPs were gathered from a single completely linked region, from various unlinked locations, or from a linked region with some internal recombination. In the following section, we consider the question of whether the SNPs were defined on the basis of polymorphism in the sample or in a panel. These sections define a toolkit from which SNP corrections for many specific cases can be constructed. Table 1 and Table 2 give an overview of the issues and the appropriate means of analysis for each set of conditions.
|
|
Linkage considerations:
At one extreme, candidate sites could be drawn from a single, nonrecombining stretch of DNA and evaluated to find SNPs. In this case, all of the SNPs, as well as the unobserved sites around them, would have the same underlying coalescent genealogy. We call this case "fully linked SNPs." At the other extreme, candidate sites could be chosen at random from a recombining genome in such a way that successive draws represent completely independent coalescent genealogies ("unlinked SNPs"). Between these extremes, successive candidates represent correlated, but not always identical, coalescent genealogies: this is the case for long sequences taken from a genome with recombination, and we call it "partially linked SNPs."
An additional consideration is whether any information at all is available on the number of sites not selected as SNPs. Some methods of data collectionfor example, using anonymous probes to detect SNPswill not give us any information on how many "unobserved" sites were in the region of interest. Other methodsfor example, choosing SNPs from a region of known lengthwill allow us to determine this. As is shown, if this information is available it allows us to use alternative methods of constructing the likelihood, and in the case of recombination it allows analysis of otherwise intractable data.
We first consider the case where information is available only on the SNPs themselves. Here, we modify the usual DNA likelihood model by conditioning on the site being a SNP (by whatever criteria are in use), an event with probability P(SNP|G). The difference between the analysis of linked and unlinked SNPs lies in the choice of genealogies to consider. This approach is referred to as the "conditional likelihood" method.
The conditional probability of observing fully linked SNPs depends on the distribution of SNP and non-SNP sites across a single shared genealogy, since in the absence of recombination all sites must derive from the same genealogy. This probability can then be computed as the sum, over all possible genealogies, of the probability of observing a particular site configuration (necessarily a SNP) divided by the probability that a given site is a SNP. The product is over all SNP sites:
![]() |
(1) |
(The derivation of this equation is given in more detail in the Appendix)
The conditional probability of observing unlinked SNPs, in contrast, depends on the distribution of SNP and non-SNP sites across all possible genealogies. The observed SNPs will tend to come from underlying genealogies that are longer than average (shorter genealogies are less likely to have undergone a mutation yielding a SNP). It would be incorrect to consider them only in the context of their longer-than-average genealogies; to correctly assess
, we must consider the context of the entire distribution of genealogies. Thus, two summations over all genealogies are needed, one assessing probability of the observed site pattern given that the site is a SNP, and one assessing the overall probability of obtaining a SNP:
![]() |
(2) |
(The derivation is given in the Appendix.)
The direct calculation of these likelihoods is not possible because of the summations over all genealogies, but they can be approximated by importance sampling methods such as the Metropolis-Hastings sampler (cf. ![]()
![]()
![]()
The case of partial linkage, where successive sites have correlated genealogies, is more difficult. It may in fact be impossible to solve without information about the number and location of unobserved sites. The difficulty is that the unobserved sites are drawn from a distribution that is correlated with, but not identical to, the distribution of the observed SNPs. Without knowing this distribution, we cannot construct an appropriate correction.
A more complete analysis can be made if we have additional information: the number u of sites not selected as SNPs ("unobserved sites") and the probability
that a potential SNP site will actually be detected as such. If
is less than 1, the unobserved sites will be a mixture of SNPs that were missed during sampling and non-SNPs. For example, if the region is broken into fragments of equal length and 10% of the fragments are exhaustively searched for SNPs while the rest are ignored,
will equal 0.1. In contrast, if SNPs are found by exhaustive sequencing of the entire region,
will equal 1.
We call this the "reconstituted DNA" method because it essentially tries to reconstruct the original DNA sequence. For fully linked data, the reconstituted DNA method can be expressed as follows (where the subscript s refers to the observed SNPs, and u to the unobserved sites),
![]() |
(3) |
which can be reduced to
![]() |
(4) |
For unlinked SNPs,
![]() |
(5) |
For partially linked SNPs, the summation becomes more complex. In the presence of recombination, each site has an individual site tree, but successive sites may have distinct, although correlated, site trees. The data set as a whole is represented by a recombinant genealogy, which is a collection of these site trees. Let Gi stand for the site tree (contained in the full recombinant genealogy G) for site i. The sum over G for partially linked sequences is over all possible recombinant genealogies, that is, all possible combinations of one or more (up to the number of sites) site trees Gi. The term
o is 1 if the site is observed and 0 if it is unobserved. Note that with partially linked SNPs it is not enough to know the total number of unobserved sites: we must know the number of unobserved sites between each successive pair of SNPs so that we can accurately take into account the probability of recombination between SNPs:
![]() |
(6) |
The reconstituted-DNA method, relying on information about the unobserved sites, allows us to analyze partially linked SNPs, which are intractable under the conditional-likelihood method. This approach can be incorporated into a recombination-aware Metropolis-Hastings sampler (![]()
A specialized case worth considering is the one in which short chunks of DNA from well-separated locations are searched for SNPs. If the chunks are short and recombination is infrequent, it may be reasonable to treat such chunks as fully linked internally and completely unlinked with one another. They can then be analyzed under the reconstituted-DNA model using a combination of Equation 4 (within chunks) and the logic for combining multiple unlinked loci in a Markov chain Monte Carlo (MCMC) sampler described in ![]()
The conditional-likelihood approach could also be used if the number of SNPs sampled from one chunk were independent of the density of SNPs in that chunk (for example, if the researcher examined each chunk until five SNPs were found and then stopped). However, in the more usual case, where all SNPs in the chunk are reported, straightforward application of the conditional-likelihood approach leads to a bias. Chunks with a deep genealogy produce many SNPs, and the likelihood curves from such chunks are therefore well defined. Chunks with a shallow genealogy produce few SNPs and a relatively flat likelihood curve. The deeper genealogies thus dominate the combined estimate, leading to an overestimate of
. It may be possible to overcome this effect with an appropriate conditioning on the number of SNPs in each chunk, but the reconstituted-DNA approach seems simplerin effect it does the necessary conditioning automatically.
Methods of defining a SNP:
The next question of interest is how a candidate site is determined to be a SNP once it is drawn. There are at least three possibilities. The site might be classified as a SNP because it is polymorphic in the actual sample under consideration ("sample SNPs"); because it is polymorphic in a panel drawn from the same population ("panel SNPs"); or because it is polymorphic in a panel from a different, though presumably related, population ("different-population panel SNPs").
We consider a site to be polymorphic if at least one nucleotide difference is seen. More restricted definitions, such as "a site is polymorphic if the frequency of the most common allele is less than 0.95," can in principle be handled by modifications of these approaches.
The terms we need to consider are the sitewise data likelihoods P(D&SNP|G) (the probability that a site of a given configuration D will be obtained as a SNP) and P(SNP|G) (the probability that a site will be a SNP). It is useful to define I as an invariant site, Ip as a site that is invariant in the panel, and Is as a site that is invariant in the sample.
Sample SNPs:
Here the data contain only polymorphic sites by definition, and the data likelihood P(D&SNP|G) reduces to P(D|G) since the conditional probability that the observed site is polymorphic is 1. This likelihood can be calculated as a standard DNA likelihood (![]()
![]()
Panel SNPs:
Here the genealogy G must be widened to include relationships not only among the sampled individuals but among the panel individuals and between panel and sample individuals. We assume that as long as a site is found to vary in the panel, it will be included in our calculation even if it proves not to vary in the sample, since this will generally not be known until after the data are collected. The method can readily be modified to cover the case where sites that prove to be invariant in the sampled individuals are discarded, but this should be avoided if at all possible as it loses information unnecessarily.
We call the data for a given site in the actual sample Ds and the data for that site in the panel Dp. If Dp is known, we should merge panel and sample together and use a modified form of the analysis given below. Usually all we know is that Dp was not invariant. The term P(D&SNP|G) can then be written as the probability of observing Ds given that the site is variable in the panel, which is equal to the probability of Ds minus the probability of cases in which the panel was invariant: P(Ds|G) - P(Ds&Ip|G). This is somewhat cumbersome to bookkeep in a Metropolis-Hastings context, but presents no theoretical difficulties. It will, however, slow such an analysis by increasing the size of the search space, since G must now include the possible genealogy of the panel as well as the sample.
Different-population panel SNPs:
This case is more demanding, as the genealogy G that relates sample and panel must take into account the relationship between the two populations. If this relationship is basically that of two static populations undergoing migration, an appropriate method would be to use the logic of Migrate (![]()
If the relationship between the two populations is common descent from an ancestral population, a Metropolis-Hastings sampler incorporating this relationship could also be constructed. There is currently no practical experience to tell us how much power would be available to such a sampler, though unless the separation of the populations is recent relative to the age of the mutations causing the SNPs, many of the SNP sites defined on the panel may be uninformative on the sample.
It is clear that if the approach used to define the SNPs is not known, there is not enough information available to construct a full likelihood method.
Consistency of estimation with SNPs:
In this section, we investigate whether SNP data can be used to make a consistent ML reconstruction of the tree, since if the coalescent tree can be reconstructed consistently, a consistent estimate of
will naturally follow.
Maximum likelihood phylogeny reconstruction from DNA data can be shown to be consistent when a correct model of sequence evolution is used (![]()
It is immediately apparent that the conditional-likelihood approach will fail in some cases where DNA (or reconstituted DNA) estimation would succeed. Consider unlinked sites from two individuals and a Jukes-Cantor model (![]()
However, with three or more individuals, sufficient information is available with SNP data even if the number of unobserved sites is not known. For unlinked SNPs, three individuals allow the possibility of site classes xxy and xyz (for unlinked SNPs, possibilities such as xyx are equivalent to xxy). Using the Jukes-Cantor model, we can derive (by taking the expectation of the multinomial sampling probability over the distribution of allele frequencies in a symmetrical four-allele model, as in ![]()
:

This function is monotonically decreasing in
and thus any value of the ratio corresponds to a unique estimate of
. We believe, although it is difficult to show analytically, that the same is true for linked SNPs where the actual genealogy is being estimated. Since the genealogy has multiple parameters, more information is needed, but more is available since classes xxy, xyx, and yxx can now be distinguished.
It should be noted that if the model of DNA evolution used to derive the site-class expectations is not identical to the one that governed the actual generation of the data, there is no guarantee that the maximum likelihood estimate will be consistent. (This is not a flaw in likelihood analysis: no method can be expected to be consistent if it is based on an incorrect model.) However, as long as three or more sequences are sampled, properly conditioned maximum likelihood analysis of SNPs should be consistent under the same conditions as maximum likelihood analysis of the underlying DNA data.
Bias of estimation with SNPs:
It would be natural to ask whether estimation of
using finite amounts of SNP data is biased: that is, what is the expectation of the estimate with finite data? Surprisingly enough, however, the mean bias is infinite for both SNP data and full DNA data. For both SNPs and full DNA data, there exist possible data configurations for which the estimate of
is infinite. Thus the expectation is a mean over terms, some of which are infinite, and the mean bias must be infinite.
At first this seems very alarming. However, the cases that give an infinite estimate are extremely unlikely with reasonably sized data sets (>510 sites), and for the vast majority of data sets a good estimate is produced. This suggests that if we want practical guidance as to whether the method is working well in a particular case, simulations are still relevant, even though we expect that if enough simulations were performed, the mean estimate would be infinite.
Even in the absence of infinitely large estimates, we expect an upward tendency in the results of estimations using SNPs. In the tree of three tips, all of the information is in the ratio of xxy to xyz sites. Sites of the xyz class are rare for reasonable values of
(cases where they are not rare are unlikely to come up in biological practice, since they would represent DNA so divergent that homology and alignment would become problematic). Since they are rare, their frequency will be poorly estimated by small data sets. The relationship between f(xyz) and the estimate of
is nonlinear, with upward deviations in f(xyz) producing a much larger effect than downward ones. If we estimate f(xyz) with error that is symmetrically distributed around the true value, we expect an upward tendency in our results. The full DNA model is not as vulnerable to this effect because it is not solely dependent on f(xyz) for its information: the proportion of sites that are xxx is also informative, and since they are common, much more information is available.
In the simulation section, we provide some empirical exploration of the amount of SNP data needed to get an accurate estimate.
Computer simulations:
Random coalescent trees for a given value of
and recombining-coalescent trees for given values of
and r were made using a program provided by Richard Hudson (personal communication). DNA data were simulated on these trees using a modification of the program treedna (J. FELSENSTEIN, unpublished results), which uses the Kimura two-parameter model (![]()
To create sample SNP data, we simulated trees containing the desired number of tips (sequences) and recorded all polymorphic sites generated. To create panel SNP data, we simulated trees containing a number of tips equal to the sum of panel and sample sizes and then chose the panel out of these tips at random. The panel was used to determine which sites to sample, and these sites were then sampled from the remaining tips, even if they were not polymorphic among those tips. Data from the panel individuals were then discarded. This method is appropriate because a coalescent tree generated for a given number of individuals is statistically indistinguishable from one generated for the entire population and then subsampled to give that number of individuals.
The parameter
(the chance that, if a site were eligible to be a SNP, it would be detected as one) was set to 1.
Estimates of
were made using the program Recombine from the LAMARC package (http://evolution.genetics.washington.edu/lamarc.html). Recombine is an extension of Coalesce (![]()
and, optionally, r.
The program runs a series of Markov chains that sample these coalescent trees. From the sample of trees in each chain, a new estimate of the parameters is made: this estimate in turn provides a starting point for the next chain. Finally, a longer chain is run to produce the final estimate. In this study, for each estimation we ran 10 short chains of 500 trees each and 1 long chain of 10,000 trees, sampling every 20th tree. The program was provided with the correct nucleotide frequencies and transition/transversion ratio for the DNA model used.
Some data sets did not contain any variable sites. Rather than attempting to make an estimate of
in these cases, we assigned the obvious estimate of zero. (The Recombine program would move toward an estimate of zero, but would encounter problems such as arithmetic underflow before actually arriving there.)
To test whether the presence of some recombination in the sequences disrupts the estimate, we created sequences with various levels of recombination and analyzed them with and without coestimating recombination.
| RESULTS |
|---|
We did not explore the case of unlinked SNPs. The Metropolis-Hastings sampler is overkill on such data, since single unlinked sites do not provide enough information for any kind of tree estimate. An analytic solution may be possible if the mutational model is not too complex.
As expected, representing SNPs as DNA leads to drastic overestimation of
(Table 3A and Table 3B, line 1).
|
For fully linked SNPs, when
was quite high (0.1), both sample SNPs and panel SNPs gave estimates close to the truth and had similar standard deviations (Table 3A, lines 2 and 4). The standard deviations were about twice as high as observed in ![]()
). Use of reconstituted DNA led to standard deviations nearly as good as those from the full DNA model (Table 3A, lines 3 and 5).
When panel SNPs were misrepresented as sample SNPs, the estimates were approximately 10-fold too low (Table 3A, line 6). In this high-
case, both sample and panel SNPs can be successful at recovering
, but only if the method of ascertainment is correctly specified. Why is incorrect specification so disastrous? Mislabeling panel SNPs as sample SNPs forces us to discard sites that are found to be invariant in the sample, reducing the amount of available data, but this is not the main reason for the poor results; if it were, results with 1000 bp of DNA and mislabeled SNPs (yielding an average of 180 analyzable sites) should be superior to results with 500 bp of DNA and panel SNPs (yielding an average of 118 analyzable sites). In fact, they were much inferior. The fundamental problem is that panel SNPs are depleted for variable sites arising from mutations in tipward branches, since such sites will not be shared between panel and sample members: analyzing them as sample SNPs ignores this depletion, leading to misinterpretation of the data.
Use of reconstituted DNA reduced the size of this error, leading to estimates that were
60% of the true value (Table 3A, line 7).
For our lower value of
(0.01), which is still quite high by biological standards, the results (Table 3B) were less encouraging. Even with correct labeling, sample SNPs overestimated
by a factor of
2, and panel SNPs by a factor of
3. The standard deviations were
10 times higher than would have been obtained with use of the full DNA model (compare with Table 1 of ![]()
The results with mislabeled data at the lower
were again drastically too low, and, again, reconstituted DNA appeared to improve the situation but not to produce a correct estimate. The high result for the case with 500 bp and R-mislabeled SNPs is the mean of many results lower than the truth and a single extremely high estimate and probably does not indicate a repeatable upward tendency.
Table 4 shows the result of ignoring recombination when it is present. The higher the recombination parameter r, the more severe the upward bias in
. Between-site inconsistencies that are introduced by recombination must, in a no-recombination model, be interpreted as multiple mutations, inflating the estimate.
|
Table 5 shows that when recombination is explicitly modeled, only a slight upward bias in
remains. It is especially striking that allowing for recombination improves the estimate of
(compare Table 4 and Table 5) even when there is clearly insufficient information to allow a good estimate of the recombination parameter r itself. (Our experience of the Recombine program is that it cannot make accurate estimates of r on such short sequences.)
|
| DISCUSSION |
|---|
When
is relatively low, estimates based on SNP data will be inaccurate because site patterns other than the most common one will be very infrequent, and thus their frequency will be poorly estimated. In such cases, a method that assumes an infinite-sites model and estimates a per-locus rather than per-site 4Neµ, such as the methods of ![]()
![]()
, extremely large numbers of SNPs will be required.
However, when
is relatively high, accurate estimates are possible, though at some loss of efficiency compared to use of the full DNA model. In cases where SNP data is less expensive to collect than full DNA data, this trade-off may be worthwhile. Somewhat surprisingly, a panel of 10 individuals appears sufficient, in the high-
case, to give results nearly as good as those obtained by choosing SNPs from the sample itself.
Intuitively one might expect that considering only the "informative" variable sites from a piece of DNA would preserve most of its information value. While this may be true for estimation of tree topology, it is not true for estimation of
or other parameters that are based on branch lengths. Much of the lost information can be recovered by the reconstituted-DNA approach, though this will be subject to errors if the estimate of
is incorrect or the SNPs are for some reason not characteristic of the sequence in which they are embedded.
If there is any chance that recombination is present, a model that allows for recombination will produce more accurate estimates of
than one that denies it; the gain due to better matching of model to reality appears to easily offset the cost of estimating an additional parameter.
Of the two models we present, conditional-likelihood and reconstituted DNA, the conditional-likelihood method is simpler and requires less additional information, but it appears to be less accurate and cannot be extended to cases in which there is some recombination. The reconstituted-DNA approach appears preferable whenever a reasonable guess can be made about the frequency of SNPs in the unobserved sites.
An important additional question is whether SNPs will be fully informative for gene mapping by coalescent-based methods. On theoretical grounds, we believe that an accurate SNP-likelihood model will be important in obtaining good gene-mapping results: while use of an inaccurate model may produce a mapping curve with the same peaks, it will distort the heights of the peaks and thus lose information about the reliability of the map. The distortion arises because, in the absence of accurate branch lengths, the program will incorrectly weigh the competing alternatives of recombination and multiple mutations. We plan to test this by simulation as mapping algorithms become available.
Information about the makeup of the panel is crucial if an accurate likelihood estimate is to be made from panel SNPs. The panel is not just preliminary work: it is a key part of the final data set and must be treated as such. In some cases, incorrectly specifying the means by which SNPs were determined can change the results by >10-fold. Anyone considering publishing a set of SNP probes for general use should, at a minimum, include the source and number of individuals sampled and the criteria for deciding which sites were considered to be SNPs: ideally, full haplotypes of the entire panel should be made available. If the SNPs are to be used in a population other than the one from which they were sampled, details of the population structure, including subpopulation membership of each panel individual, are also important.
Finally, the more divergent the population on which SNPs are defined is from the population under study, the more analytic power is likely to be lost; and the more complex the procedure by which SNPs are defined, the more difficult and time-consuming the analysis is likely to be. Ad hoc rules for accepting or rejecting a site as a SNP may be attractive in the laboratory, but they will hamper analysis of the resulting data.
| ACKNOWLEDGMENTS |
|---|
We thank Richard Hudson for providing the tree-generating program and Maynard Olson for information on how SNPs are ascertained in practice. Discussions with Lindsey Dubb contributed substantially to our understanding of the SNP likelihoods. The analysis in Table 3 was suggested by Elain Fu. Two anonymous reviewers provided useful comments. This work was supported by National Institutes of Health grants R01 GM51929 and R01 HG01989, both to J.F.
Manuscript received October 15, 1999; Accepted for publication May 22, 2000.
| APPENDIX |
|---|
We present here the derivation of Equation 1 and Equation 2.
- Fully linked SNPs (Equation 1):

We condition on the fact that only SNPs are observed

and for the case of SNPs ascertained from the sample, P(Ds&SNP|G) is simply P(Ds|G)

- Unlinked SNPs (Equation 2):

In this case, we cannot assume that the genealogy G that generated the SNP is the same genealogy that would have generated putative non-SNP sites, so we need two independent summations,

and for the case of SNPs ascertained from the sample, P(Ds&SNP|G) is simply P(Ds|G):

| LITERATURE CITED |
|---|
BEERLI, P. and J. FELSENSTEIN, 1999 Maximum likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152:763-773
CHANG, J., 1996 Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci. 137:51-73[Medline].
FELSENSTEIN, J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376[Medline].
FELSENSTEIN, J., 1992 Phylogenies from restriction sites: a maximum likelihood approach. Evolution 46:159-173.
GRIFFITHS, R. C. and S. TAVARÉ, 1993 Sampling theory for neutral alleles in a varying environment. Proc. R. Soc. Lond. Ser. B 344:403-410.
JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21132 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.
KIMURA, M., 1980 A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120[Medline].
KUHNER, M. K., J. YAMATO, and J. FELSENSTEIN, 1995 Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140:1421-1430[Abstract].
KUHNER, M. K., J. YAMATO, and J. FELSENSTEIN, 1998 Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149:429-434
KUHNER, M. K., J. YAMATO, and J. FELSENSTEIN, 2000 Maximum likelihood estimation of recombination rates from population data. Genetics 156(3 in press).
SYVANEN, A. C., U. LANDEGREN, A. ISAKSSON, U. GYLLENSTEN, and A. BROOKS, 1999 Enthusiasm mixed with scepticism about single nucleotide polymorphism markers for dissecting complex disorders. Eur. J. Hum. Genet. 7:98-101[Medline].
WATTERSON, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256-276[Medline].
WATTERSON, G. A., 1977 Heterosis or neutrality? Genetics 85:789-814
| NOTE ADDED IN PROOF |
|---|
The correction we use for conditional-likelihood SNPs is related to one first developed by W. J. Ewens for RFLP data (W. J. EWENS, R. S. SPIELMAN and H. HARRIS, 1981, Estimation of genetic variation at the DNA level from restriction endonuclease data. Proc. Natl. Acad. Sci. USA 78: 37483750).
This article has been cited by other articles:
![]() |
H. J. Ryynanen, A. Tonteri, A. Vasemagi, and C. R. Primmer A Comparison of Biallelic Markers and Microsatellites for the Estimation of Population and Conservation Genetic Parameters in Atlantic Salmon (Salmo salar) J. Hered., November 5, 2007; (2007) esm093v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Knudsen and M. M. Miyamoto Incorporating Experimental Design and Error Into Coalescent/Mutation Models of Population History Genetics, August 1, 2007; 176(4): 2335 - 2342. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. B. Rosenblum and J. Novembre Ascertainment Bias in Spatially Structured Populations: A Case Study in the Eastern Fence Lizard J. Hered., July 4, 2007; (2007) esm031v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Pariset, I. Cappuccio, P. Ajmone-Marsan, M. Bruford, S. Dunner, O. Cortes, G. Erhardt, E.-M. Prinzenberg, K. Gutscher, S. Joost, et al. Characterization of 37 Breed-Specific Single-Nucleotide Polymorphisms in Sheep J. Hered., September 1, 2006; 97(5): 531 - 534. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. K. Kuhner LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters Bioinformatics, March 15, 2006; 22(6): 768 - 770. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. G. Clark, M. J. Hubisz, C. D. Bustamante, S. H. Williamson, and R. Nielsen Ascertainment bias in studies of human genome-wide polymorphism Genome Res., November 1, 2005; 15(11): 1496 - 1502. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Nielsen, M. J. Hubisz, and A. G. Clark Reconstituting the Frequency Spectrum of Ascertained Single-Nucleotide Polymorphism Data Genetics, December 1, 2004; 168(4): 2373 - 2382. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Adams and R. R. Hudson Maximum-Likelihood Estimation of Demographic Parameters Using the Frequency Spectrum of Unlinked Single-Nucleotide Polymorphisms Genetics, November 1, 2004; 168(3): 1699 - 1712. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Polanski and M. Kimmel New Explicit Expressions for Relative Frequencies of Single-Nucleotide Polymorphisms With Application to Statistical Inference on Population Growth Genetics, September 1, 2003; 165(1): 427 - 436. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-C. Nicod and C. R. Largiader SNPs by AFLP (SBA): a rapid SNP isolation strategy for non-model organisms Nucleic Acids Res., March 1, 2003; 31(5): e19 - e19. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Akey, K. Zhang, M. Xiong, and L. Jin The Effect of Single Nucleotide Polymorphism Identification Strategies on Estimates of Linkage Disequilibrium Mol. Biol. Evol., February 1, 2003; 20(2): 232 - 242. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Akey, G. Zhang, K. Zhang, L. Jin, and M. D. Shriver Interrogating a High-Density SNP Map for Signatures of Natural Selection Genome Res., December 1, 2002; 12(12): 1805 - 1814. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Tsitrone, F. Rousset, and P. David Heterosis, Marker Mutational Processes and Population Inbreeding History Genetics, December 1, 2001; 159(4): 1845 - 1859. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Beerli and J. Felsenstein Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach PNAS, March 29, 2001; (2001) 81068098. [Abstract] [Full Text] |
||||
![]() |
M. K. Kuhner, J. Yamato, and J. Felsenstein Maximum Likelihood Estimation of Recombination Rates From Population Data Genetics, November 1, 2000; 156(3): 1393 - 1401. [Abstract] [Full Text] |
||||
![]() |
P. Beerli and J. Felsenstein Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach PNAS, April 10, 2001; 98(8): 4563 - 4568. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Kuhner, M. K.
- Articles by Felsenstein, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Kuhner, M. K.
- Articles by Felsenstein, J.












