## Abstract

Most single-nucleotide polymorphism (SNP) data suffer from an ascertainment bias caused by the process of SNP discovery followed by SNP genotyping. The final genotyped data are biased toward an excess of common alleles compared to directly sequenced data, making standard genetic methods of analysis inapplicable to this type of data. We here derive corrected estimators of the fundamental population genetic parameter θ = 4*N*_{e}*μ* (*N*_{e}, effective population size; *μ*, mutation rate) on the basis of the average number of pairwise differences and on the basis of the number of segregating sites. We also derive the variances and covariances of these estimators and provide a corrected version of Tajima's *D* statistic. We reanalyze a human genomewide SNP data set and find substantial differences in the results with or without ascertainment bias correction.

THE HapMap data (International HapMap Consortium 2007) and other genomewide single-nucleotide polymorphism (SNP) data sets provide a valuable resource for population genetic analysis. Much interest in the analysis of such data has focused on estimating demographic parameters or inferring natural selection (*e.g.*, Bamshad and Wooding 2003; Wooding 2004; Carlson *et al.* 2006; Sabeti *et al.* 2006; Voight *et al.* 2006; Wang *et al.* 2006; Tang *et al.* 2007; Williamson *et al.* 2007). However, many of the studies of genomewide SNP data have been challenged by the fact that the SNP genotyping data have been obtained by a process in which SNPs are first discovered in a small panel of individuals and subsequently typed in a much larger panel (*e.g.*, Picoult-Newberg *et al.* 1999; Altshuler *et al.* 2000; Mead *et al.* 2003). Although this procedure provides a much faster and cheaper way of generating data than direct sequencing of the full panel, it also produces data with a relative excess of alleles of intermediate frequencies compared to directly sequenced data. Rare SNPs are more easily discovered in large panels than in small panels, so an initial discovery process based on a small panel produces an excess of high-frequency alleles in the genotyped sample. As a consequence, the data will be different from what is assumed in standard population genetic models with respect to allele frequency distribution (*e.g.*, Nielsen 2000; Wakeley *et al.* 2001), patterns of linkage disequilibrium (Nielsen and Signorovitch 2003), and level of population subdivision (Nielsen 2004). This ascertainment bias toward high-frequency alleles can have serious consequences when standard population genetic tools (*e.g.*, Tajima 1989; Fu and Li 1993; Fay and Wu 2000; Ramos-Onsins and Rozas 2002) are used for the analysis of the data. For example, Kreitman and Di Rienzo (2004) and Soldevila *et al.* (2005) showed that the apparent effects of balancing selection detected in the prion protein gene (*PRPN*) by Mead *et al.* (2003) in fact were an artifact caused by this type of ascertainment bias.

Three different approaches have been used to address the problem of ascertainment biases in studies of real data: (i) applying methods that may be more robust to the effect of ascertainment bias, such as methods based on haplotype structure (*e.g.*, Sabeti *et al.* 2002), (ii) simulating data under the ascertainment procedure to derive appropriate critical values and confidence intervals using a distribution that directly takes ascertainment into account (*e.g.*, Carlson *et al.* 2004; Voight *et al.* 2006), and (iii) directly correcting the statistical estimators and statistics for the ascertainment bias (*e.g.*, Nielsen 2000; Wakeley *et al.* 2001; Nielsen and Signorovitch 2003; Polanski and Kimmel 2003; Marth *et al.* 2004; Nielsen *et al.* 2004) in specific models. However, hitherto there have been no ascertainment correction methods available for some of the most basic population genetic tools. Here we derive ascertainment corrected estimators of the fundamental population genetic parameter θ = 4*N*_{e}*μ* (*N*_{e}, effective population size; *μ*, mutation rate) and an ascertainment corrected version of the popular statistic used for detecting selection: Tajima's *D*. Our results are for a neutral locus, without recombination, sampled from a panmictic population of constant size.

## THEORY AND METHODS

#### Estimators of θ:

Tajima's *D* (Tajima 1989) is calculated as the difference between Tajima's estimator of θ, θ_{T} (Tajima 1989), and Watterson's estimator of θ, θ_{w} (Watterson 1975). Tajima's estimator is based on the average number of pairwise differences (*π*) and is given by(1)where η_{i} is the number of derived alleles segregating at a frequency of *i/n*, in a sample of *n* chromosomes. The calculation of is identical for arbitrarily labeled alleles; however, we use the definition on the basis of knowing which allele is derived, to keep a consistent notation throughout. Watterson's estimator is given by(2)where is the number of segregating sites.

We assume an ascertainment model in which a subset of *d* chromosomes has been chosen independently among the *n* chromosomes for ascertainment, as the data to be analyzed later are the result of this same discovery procedure. We further assume that the chromosomes chosen for ascertainment are independent among SNPs; that is, each SNP has been ascertained from a different set of chromosomes. This procedures simulates the data obtained from procedures such as shotgun or array-based resequencing (used in Perlegen data), where different individuals are sequenced and the fragments obtained are aligned using a reference sequence. The probability of ascertainment of a SNP with alleles of frequencies *i/n* and (*n − i)*/*n* is then(3)(Nielsen 2004), where we use the definition if *k > n*. The final sample after ascertainment is denoted the *genotyped sample*.

The expected number of segregating sites in the genotyped sample under this ascertainment scheme, *S*^{(A)}, is then simply the sum over all allelic classes of the expected number of segregating sites of that allelic class (*E*[η_{i}] = θ/*i*; Tajima 1989; Fu 1995) multiplied by the probability of ascertainment of the allelic class:(4)An unbiased, ascertainment corrected method-of-moments estimator of θ, similar to Watterson's estimator, is then given by(5)The expected number of pairwise differences in the genotyped sample is similarly given by the sum over all allelic classes of the expected contribution to the pairwise differences of the allelic class multiplied by the probability of ascertainment of the allelic class:(6)An unbiased, ascertainment corrected method-of-moments estimator similar to Tajima's estimator is then given by(7)Note that these estimators are identical to the traditional estimators, and , when there is no ascertainment bias; *i.e.*, .

#### Variances of the estimators:

We use notation and some results from Durrett (2008, Chap. 2), to derive covariance and variances of these estimators assuming no intralocus recombination. In the absence of any ascertainment bias (Fu 1995; Durrett 2008),where equals(8) equalsandUsing the conditional variance formula(9)with and , we get(10)Also, from Equation 8 and from the independence among SNPs of the ascertainment probabilities, we haveThen, recalling that , we obtain(11)We can then easily get the variance of *S*^{(A)}:(12)The variance of the ascertainment corrected estimator of θ based on the number of segregating sites is then given by(13)The variance of the estimator based on the average number of pairwise differences becomes(14)

#### Covariances and Tajima's *D*:

Defining the coefficientswe have(15)by using Equations 10 and 11 and expanding the covariance of the sums as the sum of the covariances.

Also(16)

We now define an ascertainment corrected Tajima's *D* as(17)To calculate for real data we need to know the value of θ and θ^{2}. We estimate θ using , similarly to the usual use of for calculating the classical Tajima's *D* statistic. We estimate θ^{2} as(18)The *D*_{C} statistic is identical to the traditional Tajima's *D* in the absence of an ascertainment, *i.e.*, when .

#### Simulations:

Simulated data were generated using the standard coalescent simulation program ms (Hudson 2002) with 10,000 and/or 1,000,000 replicates. We explored three different values of θ: 2.23, 22.33, and 89.30, corresponding to the estimates of θ based on Watterson's estimator calculated from the minimum, average, and maximum number of segregating sites found in the genes represented in the SeattleSNP database (http://pga.gs.washington.edu/, Crawford *et al.* 2005). We also explored results for an extreme value of θ, θ = 150. To generate ascertainment samples from the simulated data, we subsampled *d* (= 2, 5, or 10) gene copies from each segregating site in the sample of size *n* (= 20 or 50). Moreover, we have also generated a set of samples of size *n* = 100 to explore the relationship between *d* and the variance in the estimators. If the segregating site was polymorphic in the subsample, it was included in the final sample; otherwise it was ignored. In all cases, the recombination rate was set to 0.

#### Perlegen data:

Genotype data from Perlegen were obtained from http://genome.perlegen.com/browser/download.html, and we used information regarding the ascertainment protocol discussed in Clark *et al.* (2005) and Hinds *et al.* (2005). For each SNP, the number of individuals that have been included in the discovery panel is known for 69% of the SNPs (ascertainment panel A), and only these SNPs are included in our analysis. Ascertainment of SNPs in this panel was done by genomewide shotgun resequencing. Data from all populations were pooled, and Tajima's *D* was calculated chromosome by chromosome through a sliding window of 100 and 500 kb, sliding by 10 kb at a time. To take into account varying sample size (*n*) and varying ascertainment sample size (*d*), for each window, we use(19)where is the average sample size in the window, *f*(*d*) is the proportion of SNPs with ascertainment sample size *d*, and *d*_{max} and *d*_{min} are the maximal and minimal values of *d* observed in the window. We use this approach instead of a SNP-by-SNP correction to reduce the computational complexity of the problem. Only those windows that contained at least 10 class A SNPs were included in the analysis. To examine the effect of the ascertainment bias, we have included results for both the uncorrected and the corrected values of Tajima's *D*.

## RESULTS

We evaluate the corrections of and , their variances and covariances, and Tajima's *D* using coalescent simulations in the first three subsections. Subsequently, we apply the corrected Tajima's *D* on the Perlegen data set.

#### Correction of the estimators of θ:

Figure 1 shows the distribution of the estimates based on the uncorrected estimators and , in the presence of an ascertainment bias (*d* = 5) and without an ascertainment bias (*d* = *n* = 50), and the corresponding distributions of the corrected estimates, and , in the presence of an ascertainment bias for *n* = 50 and θ = 150. For , the average estimate of θ is 69.82 with and 150.07 without an ascertainment bias, respectively. However, the ascertainment corrected estimate is = 150.11. For , the average estimate of θ is 104.18 with and 150.11 without an ascertainment bias, respectively, and the ascertainment corrected estimate is = 150.14. For θ = (2.23, 22.33, and 89.20) we also found large differences between the average estimates of θ with and without an ascertainment bias, while the true value was recovered under ascertainment when the corrected estimators were used (see supplemental data A for more details). This shows that the traditional estimators, as expected, are biased in the presence of an ascertainment bias, but that the ascertainment corrected estimators derived here recover an unbiased estimate.

#### Correction of the variances and the covariance:

As seen in Figure 1, the variance in the corrected estimates of θ is increased in the presence of an ascertainment bias when the number of SNPs in the data set is held constant. Equations 13 and 14 quantify the variance in the estimate and have been verified by simulations (not shown).

Figure 2 shows the relationship between *d* and the variance in the estimators for *n* = 100. When the ascertainment sample size is small compared to the size of the sample, the variances and covariances are greatly increased [for *d* = 2 the variance of Tajima's θ () is nearly doubled, and the variance of is nearly multiplied by four]. However, when *d* approaches *n*/2, the difference between the real variance and the estimated variance is drastically reduced.

#### Correction of Tajima's *D*:

Figure 3 shows the distribution of Tajima's *D* and *D*_{C} values for *n* = 50, *d* = 5, and θ = 150. When there is no ascertainment bias, the distribution of Tajima's *D* values using Equation 17 is identical to the one obtained using the standard method, with mean = −0.1103 in both cases, while when there is ascertainment bias and we do not apply the corrected formula, the distribution is greatly skewed toward positive values (mean = 1.5170). If the correction is applied to the simulated data suffering from the ascertainment bias, the nonascertained distribution is approximately recovered and its mean, −0.2497, gets closer to the nonascertainment one. However, because the correction is nonlinear, it does not match the original distribution exactly but is slightly skewed toward negative values compared to the original distribution and has a slightly larger variance. Neither the original Tajima's *D* in the absence of an ascertainment bias nor the current ascertainment corrected estimator in the presence of ascertainment bias has expectation equal to zero. Both rely on a ratio of two correlated statistics, so even though the numerator has expectation equal to zero, the expectation of the ratio is not equal to zero. Also, it is not surprising that the variance is slightly larger for the ascertainment corrected statistic. It suggests that some information has been lost by the ascertainment process. The same tendencies can be seen for the other values of θ explored (see supplemental data A).

#### Analysis of Perlegen data:

To illustrate the use of the correction of Tajima's *D*, we applied it to a Perlegen data set (Hinds *et al.* 2005), previously analyzed by Clark *et al.* (2005) without correcting for ascertainment bias. The Perlegen data were analyzed chromosome by chromosome, taking windows of 100 spanning 10 kb obtaining, on average, 12,221 windows per chromosome. A total of 74.47% of the windows have ≥10 SNPs and are, therefore, included for the comparison between the corrected and the uncorrected Tajima's *D* values.

An example of the result, using windows of 500 kb on chromosome 1, is shown in Figure 4. Positive Tajima's *D* values (1.9) are found in the area containing the genes *TMEM57*, *MAN1C1*, and *LDLRAP1*. The former is a transmembrane protein and the second a mannosidase. The latter encodes for a cytosolic protein that interacts with the LDL receptor, and mutations in it have cause hypercholesterolemia, an autosomal recessive disorder (Mishra *et al.* 2005; Quagliarini *et al.* 2007). Negative Tajima's *D* values ∼ −2 were found in windows containing *HIST2H**, *FCGR1A*, and *PPIAL4*, a histone cluster, a fragment of the IgG receptor, and the peptidylprolyl isomerase A, respectively. *D* values of −1.6 were found around the *SRGAP2* gene, whose mRNA has been found in melanoma, germ cell tumors, chondrosarcoma, and retinoblastoma (Katoh and Katoh 2003).

Figure 5 shows the correlation of Tajima's *D* results with and without correction for all chromosomes. As expected, the *D* values are higher than the *D*_{C} values. We examine windows with extreme values of Tajima's *D*, which we have arbitrarily defined as those with values < −2 or >2, in more detail. While there are 210 windows with *D*_{C} ≤ −2, there are only 17 windows with *D* ≤ −2. Likewise, there are 99 windows with *D*_{C} ≥ 2 and 8317 with *D* ≥ 2. Table 1 summarizes the information about the 50 windows with the most extreme values of *D*_{C} (25 lowest and 25 highest). Of the 25 windows with lowest values of *D*_{C}, 3 would not be found among the 25 most significant windows using *D*, and 8, including the *GPC3* gene, would be excluded on the basis of the *D* ≤ −2 criterion. Among the 25 most significant windows with positive values of *D*_{C}, 10 of them are not included in the set of the 25 most extreme genes based on *D*. Among these windows there are genes such as *BRCA1* or *NF1*. Fourteen of 15 genes are located on the X chromosome. A possible explanation is that selection on the X chromosome is more efficient because recessive mutations are exposed to selection in males (see, *e.g.*, Schaffner 2004).

## DISCUSSION

We have here derived estimators of the population genetic parameter θ, and the variances and covariances of the estimators, under a model with ascertainment bias. This leads us to an ascertainment correction of Tajima's *D*. We note that similar corrections could easily be derived for other statistics as well, particularly if they can be written as functions of site frequency spectrum, *i.e.*, *η*_{i}, *i* = 1, 2, … , *n* − 1. Statistics such as Fu and Li's *D* (Fu and Li 1993) and Fay and Wu's *H* (Fay and Wu 2000) are included in this category. We also emphasize that while the ascertainment scheme here is quite specific, and the results may therefore not always apply to real data, all results are expressed in terms of the probability of ascertainment of a SNP as a function of its frequency, . It is, therefore, quite trivial to extend this work to other ascertainment schemes, including the ones considered in Nielsen *et al.* (2005), as long as appropriate ascertainment information is available.

The methods applied here assume that there is no intralocus recombination, as expressions explicitly incorporating recombination are not tractable for Tajima's *D*. There is a tradition for applying Tajima's *D* and other similar statistics that are derived assuming no recombination, even in the presence of recombination. As recombination tends to reduce the variance of Tajima's *D* among regions, such applications are considered conservative (*e.g.*, Ramírez-Soriano *et al.* 2008).

The analysis of the Perlegen data illustrates that ascertainment bias correction is of great importance when analyzing SNP genotyping data. Even when just applying outlier approaches in studies of natural selection, the ranking of different genes is likely to change with and without ascertainment bias correction. Likewise, any study aimed at quantifying variability on the basis of typical SNP data will be challenged by the ascertainment bias. It is, therefore, highly desirable that SNP genotyping projects keep close track of the SNP discovery/selection protocols used. Only when such detailed data regarding these protocols are available will it be possible to make accurate ascertainment bias corrections of the data.

A computer program implementing the ascertainment bias corrections discussed in this article can be downloaded from http://www.snpator.com/public/downloads/aRamirez/tajimasDCorrector/. A list of corrected Tajima's *D* values for different regions of the human genome can be found in supplemental data B.

## Acknowledgments

We thank Marta Melé and Francesc Calafell for their comments on this manuscript. This work was supported by National Institutes of Health grant U01HL084706 and by the Danish National Science Council.

## Footnotes

Communicating editor: M. K. Uyenoyama

- Received July 17, 2008.
- Accepted December 10, 2008.

- Copyright © 2009 by the Genetics Society of America