## Abstract

In the post-genome era, disease gene mapping using dense genetic markers has become an important tool for dissecting complex inheritable diseases. Locating disease susceptibility genes using DNA-pooling experiments is a potentially economical alternative to those involving individual genotyping. The foundation of a successful DNA-pooling association test is a precise and accurate estimation of allele frequency. In this article, we propose two new adjustment methods that correct for preferential amplification of nucleotides when estimating the allele frequency of single-nucleotide polymorphisms. We also discuss the effect of sample size when calibrating unequal allelic amplification. We conducted simulation studies to assess the performance of different adjustment procedures and found that our proposed adjustments are more reliable with respect to the estimation bias and root mean square error compared with the current approach. The improved performance not only improves the accuracy and precision of allele frequency estimations but also leads to more powerful disease gene mapping.

LOCATING disease susceptibility genes is an important topic in the postgenome era. To detect small genetic effect for many susceptible genes, a plausible etiological model for complex traits, many large case-control studies have been launched. Advances in biological techniques have made available thousands of single-nucleotide polymorphisms (SNPs) for disease gene mapping. The availability of these dense markers vastly improves the power of association tests and increases the resolution of gene mapping in candidate region research and genome scanning studies.

Conventional case-control association studies are popular for disease gene mapping using individual genotyping data. However, analyses of large samples are often impractical due to the expense of individual genotyping. In this regard, DNA-pooling experiments may represent an economical alternative. As the name implies, DNA pooling involves the mixing of genomic DNA from many different individuals. Allele frequencies for each SNP marker in the pooled DNA are measured using the same principles that apply to genotyping.

A complete DNA-pooling experiment consists primarily of three stages. The first stage is a pilot study in which heterozygous individuals are collected to estimate the coefficient of preferential amplification (CPA). The coefficient is subsequently used to correct the estimates of allele frequencies in the second stage. In the second stage, DNA-pooling experiments are conducted for a large number of SNPs, and pooling association tests are carried out to screen for potential genetic markers. Only a small proportion of markers selected from the second-stage experiments are included in the third stage in which all individuals are genotyped to confirm the validity of the markers selected from the second stage. As a consequence of the preliminary screen in the second stage, the number of SNPs in the third stage is drastically reduced, thereby lowering genotyping costs.

DNA pooling is an efficient screening method for locating disease susceptibility genes (Bansal* et al*. 2002; Sham* et al*. 2002). However, this cost-saving alternative is efficient only if the estimation of allele frequency is accurate and precise. Biased or unreliable estimation of allele frequencies can lead to spurious results in association studies. Variation in the data from a DNA-pooling study may arise from several different sources, such as pool formation, polymerase chain reaction (PCR) amplification, allele frequency measurement, and other uncontrollable experimental errors (Barratt* et al*. 2002; Visscher and Le Hellard 2003). Importantly, preferential amplification is a natural chemical attribute of PCR; it arises from both heterogeneous nucleotide incorporation during primer extension and differential efficiency of nucleotide detection during DNA quantification (Sham* et al*. 2002). These factors perturb the measurement of the intensity of different nucleotides and, consequently, the estimation of allele frequency. Under such a research background, we have focused on the impact of preferential amplification and propose new adjustment methods to rectify the problems inherent in the process. We also discuss the issue of sample size when correcting for unequal allelic amplification.

## RESEARCH METHODS

### Data and notation:

First, we discuss the design of the pilot stage. Consider that the two SNP-containing alleles are denoted *A* and *a*, where allele *A* is of interest. Given *n*_{total} samples randomly drawn from a target population, individual genotyping results show that there are *n*_{heter} heterozygous individuals and *n*_{homo} homozygous individuals in the sample, *i.e*., *n*_{total} = *n*_{heter} + *n*_{homo}. The pair of peak intensities for each heterozygous individual is determined (*e.g.*, from MALDI-TOF spectrometry) as the area under the nucleotide-mapping curve. The readings for alleles *A* and *a* are denoted . These bivariate vectors are used to quantify the magnitude of preferential amplification.

In the second stage of the screening experiment, genomic DNA from *m* different individuals is pooled. Applying the same genotyping principle used in the pilot study, we obtain a reading of the peak intensities for allele typing in a DNA-pooling experiment. The reading is the summary measure of this pool composed of genomic DNA from *m* individuals and is defined as . These data are used to estimate the allele frequency.

Let the population allele frequency of allele *A* be *p _{A}*, the main parameter of interest. We define CPA κ as a measure of the peak intensities of allele

*A*relative to allele

*a*, κ = μ

*/μ*

_{A}*, where μ*

_{a}*and μ*

_{A}*denote the average peak intensities of alleles*

_{a}*A*and

*a*in the population. In other words, κ is the relative magnitude of the averaged amplified intensities of two different nucleotides and is an unknown calibration parameter that serves as an adjustment factor for allele frequency estimation. For κ > 1, the first nucleotide tends to be amplified more than the second; for κ < 1, the second nucleotide is likely to be amplified less than the first; for κ = 1, equal amplification is likely. The following sections introduce the statistical model/estimation of CPA, the estimation of population allele frequency, and association tests.

### Statistical model and estimation of CPA:

The population allele frequency *p _{A}* is defined as

*p*=

_{A}*N*/(

_{A}*N*+

_{A}*N*), where

_{a}*N*and

_{A}*N*denote the number of alleles

_{a}*A*and

*a*in the population. In individual genotyping experiments, the population allele frequency can be estimated by directly counting the number of alleles from representative samples. The direct counting approach does not apply to DNA-pooling experiments because only the peak intensities are measured.

The relationship between peak intensity and allele frequency is the kernel of allele frequency estimation in a DNA-pooling study. The bridge is CPA, which connects both of them as indicated in the following expression, 1where *H _{A}* and

*H*denote accumulated peak intensities of alleles

_{a}*A*and

*a*, respectively. The allele frequency can be estimated by calculating the proportion of the peak intensities. However, if the amplification rate varies depending on the specific nucleotide at the SNP, then parameter κ must be estimated and considered in the estimation of allele frequency. Below, we discuss the procedure for estimating preferential amplification using individual genotyping data from heterozygous individuals.

Suppose that there are *n*_{heter} independent heterozygous individuals in the individual genotyping pilot study. Let the intensities of the two peaks for the *j*th heterozygous individual be *h*^{I}_{A} and *h*^{I}_{a}, *j* = 1, … , *n*_{heter}. The two-dimensional peak intensities are assumed to follow a bivariate distribution *G*, where (μ* _{A}*, μ

*) are the population means of the peak intensities for alleles*

_{a}*A*and

*a*, are the variances of the peak intensities for alleles

*A*and

*a*, and ρ denotes the correlation of the two intensities.

Under this model, we propose two measures to estimate CPA and compare them with the previous adjustment method for unequal amplification proposed by Hoogendoorn* et al*. (2000). Previously, their adjustment factor was defined as the arithmetic mean of ratios, *i.e.*, 2This pioneering approach has been adopted by many researchers in cases of preferential amplification (Le Hellard *et al*. 2002; Mohlke* et al*. 2002; Werner* et al*. 2002). The advantage of this method is very simple in concept and calculation.

Our first proposed adjustment reduces the bias in Hoogendoorn's method using a bias-reduction technique and can be represented as 3where and . The difference between κ̂_{H} and κ̂_{U} in Equation 3 is the estimated bias of Hoogendoorn's method. A detailed derivation is presented in appendix A. The ratio of a pair of peak intensities often exhibits a skew distribution and log transformation is often considered to reduce the skewness and variability. Therefore, our second proposed adjustment factor is the geometric mean of ratios: 4This factor can be regarded as an exponential function whose power is the arithmetic mean of log-scale ratio.

The standard error of each adjustment measure reflects sampling variability and is critical for the association test in the next stage. Because the number of heterozygous individuals might be small, and an exact statistical distribution of these adjustment measures is difficult to derive, a bootstrapping procedure (Efron and Tibshirani 1993) is recommended to estimate the standard errors. Original data are used to estimate the hyperparameter by a moment-based or likelihood-based approach to obtain the empirical distribution *G*. Pseudo-samples are generated using resampling from the empirical distribution with replacement. Suppose the number of bootstrap replications is *B*. Each adjustment method in Equation 3 or 4 is applied to the samples to obtain the corresponding estimates (κ̂_{1}, … , κ̂* _{B}*). Hence, the standard error of the adjustment measure can be calculated by taking the sample standard deviation of the bootstrap estimates, 5where .

### Estimation of allele frequency and test of allelic association:

In this section, we discuss the estimation of allele frequency when preferential amplification is involved. The genomic DNA from all cases is mixed together in a pool and that of controls is mixed in the other pool. The pairs of peak intensities in control and case groups are denoted by and *H*^{P,case} = , respectively.

If there is no preferential amplification, then the coefficient κ will be approximately one, and hence no adjustment is necessary. The allele frequencies of allele *A* in control and case groups can be estimated directly by calculating the proportion of peak intensities as follows: 6If κ is larger than one, then allele *A* tends to be amplified more than allele *a*, and vice versa. In these two cases, the scales of the two intensities differ, and the population-level relative proportion of the two amplified abilities is simply κ. To adjust for nonequivalent allelic amplification, the method proposed by Hoogendoorn* et al*. (2000) increased the suppressed intensity by multiplying the CPA. This transformation procedure standardizes the two intensities in scale. At the population level, substituting *H̃ _{a}* = κ

*H*for

_{a}*H*makes the equality on the left side of Equation 1 hold even for κ ≠ 1; at the sample level, and are used to adjust for unequal amplification. The adjusted terms that improve the estimated allele frequency in Equation 6 for control and case groups are as follows: 7Because the allele frequency estimator is a function of CPA, it varies with the adjustment factor κ̂. The performances should be evaluated. In the simulation study section below, we discuss how simulation studies assess the performance of these adjustment factors.

_{a}Screening potential SNP markers associated with a disease locus is the main purpose of a DNA-pooling study (Bansal* et al*. 2002; Sham* et al*. 2002). This can be achieved using the pooling-based association test 8

(Visscher and Le Hellard 2003), where 9

*n*_{case} and *n*_{control} are the numbers of individuals in case and control groups, and σ^{2}_{E} denotes the experimental variation. The sampling distribution of the test statistic follows a chi-square distribution with 1 d.f. asymptotically. The first two terms after the equality in Equation 9 are the variance components due to sampling variation; the third term results from the adjustment variation of preferential amplification; the fourth term is the experimental variation from several different sources, such as pool formation. All of the parameters in Equation 9 are unknown and therefore must be estimated.

Parameter κ is estimated by our proposed method in Equation 3 or Equation 4; variance *V*(κ̂) is estimated by the proposed bootstrap variance in Equation 5; the allele frequencies are estimated using Equation 7. Finally, the experimental variance can be estimated by calculating the mean square errors (Barratt* et al*. 2002) or using the restricted maximum-likelihood method (Downes* et al*. 2004) based on a hierarchical experimental design.

Estimation of CPA affects both the denominator and the numerator of the test statistic in Equation 8 simultaneously. The impact of CPA on the denominator is explicit in Equation 9. CPA affects the numerator by way of allele frequency estimates. On the basis of the adjusted allele frequency defined in Equation 7, the expectation of difference between the estimated allele frequencies in case and control groups is zero under null hypothesis (no association). If the adjusted allele frequency in Equation 7 is replaced by the unadjusted allele frequency in Equation 6 for constructing the test statistic, the zero expectation may not hold true under the null hypothesis. Upon algebraic manipulation, we have proved that the discrepancy between case and control groups with regard to CPA-adjusted and unadjusted allele frequencies (*i.e.*, the difference between the CPA-adjusted case and control group allele frequencies minus the difference between the unadjusted case and control group allele frequencies) is The numerator represents three cases in which there is no effect of adjustment: (1) no preferential amplification, (2) no difference in allele frequency between case and control groups, and (3) the sum of the case group allele frequency with (without) adjustment and the control group allele frequency without (with) adjustment is equal to one.

### Sample size in the pilot study:

In the pilot study, the peak intensities of heterozygous individuals are needed to estimate the CPA. An immediate question is how many heterozygous individuals are required to obtain a precise estimate of CPA. Proceeding in the context of confidence intervals, we calculate the sample size under risk α and a specified absolute error ξ as 10where and , *n*_{heter}. Equation 10 is derived on the basis of Hoogendoorn's method, and the details are shown in appendix B. From our simulation study (discussed below), we find that our proposed method gives a lower standard error for the variance compared with Hoogendoorn's method. Hence, the sample size in Equation 10 is regarded as the upper bound for our proposed estimators. Under α = 0.05 and ξ = 0.05, 0.075, and 0.10, the relationships between the sample size and different parameters are shown in Figure 1. The results show that sample size correlates positively with CV* _{r}* and is inversely proportional to ξ. The symmetry and highest points of the sample size curve occur concurrently when the allele frequency is 0.5.

Sample sizes for additional genotyping must be evaluated to achieve the required number of heterozygous individuals derived from Equation 10. This aspect depends on the design of the genotyping experiment, the genetic background of SNP markers, and population characteristics. In a sequential genotyping experiment, *n*_{total} is a random variable that follows a negative binomial distribution with successful probability (probability of heterozygote) *p*_{H} = *p _{Aa}*. However, in large-scale genotyping experiments, individuals are genotyped simultaneously, not sequentially. Under this circumstance,

*n*

_{total}is prespecified and

*n*

_{heter}is a random variable from a binomial distribution with successful probability

*p*

_{H}=

*p*.

_{Aa}The theory based on the assumption of individual homogeneity is sometimes too stringent. Heterogeneity among individuals may be due to various individual covariates or unobserved attribution. If this potential factor is ignored, the genotyping efforts will be underestimated. A possible remedy is to assume a random-effect model. A beta distribution β(θ, τ) with the corresponding mean θ/(θ + τ) and coefficient of variation (CV) {τ/[θ(θ + τ + 1)]}^{1/2} can be used to model random allele frequency (RAF) and random genotype frequency (RGF), respectively. The corresponding marginal distributions of random variables *n*_{total} and *n*_{heter} with respect to the sequential genotyping experiment and large-scale genotyping experiment can be obtained by integrating out the hyperparameters in the beta distribution as where *B*(·,·) is the conventional beta function. A detailed derivation is presented in appendix C. The relationships between sample size and the observation probability under different genotyping strategies are shown in Figures 2 and 3, respectively. Only the results based on genotype frequency are shown. If hyperparameters in the beta distribution are unity, then the random-effect model reduces to the special case in which no individual heterogeneity exists.

Figure 2 shows the distribution of *n*_{total} with a genotype frequency range of 0.05–0.5 in increments of 0.05. The pattern reveals the positive correlation between *P*(*N* ≤ *n*_{total}) and *n*_{total}. A heterozygous genotype frequency of a SNP of >0.15 almost guarantees that eight heterozygous individuals can be observed after genotyping 96 individuals (Figure 2A); the probability of observing 16 heterozygous individuals is >0.8 (Figure 2B). Figure 2, C and D, presents the results under the condition of individual random effects. A lower *E*(*p _{Aa}*) corresponds to a less polymorphic case, and therefore genotyping requires many more individuals. The coefficient of variation CV(

*p*) affects the curvature of different lines. In general, a smaller CV(

_{Aa}*p*) yields a steeper slope; in other words, the marginal increase in cumulative probability (corresponding to an increase in

_{Aa}*n*

_{total}) is larger when CV(

*p*) is smaller. The required number of genotyped individuals can be obtained using a prespecified probability from these figures.

_{Aa}Figure 3 shows the probability distribution of *n*_{heter}. Figure 3, A and B, presents the cases for *n*_{total} = 48 and *n*_{total} = 96 in the absence of individual heterogeneity, and Figure 3, C and D, presents the case when individual heterogeneity is present. The genotype frequencies range is 0.05–0.5 in increments of 0.05. In general, the correlation between *n*_{heter} and *P*(*N* ≥ *n*_{heter}) is negative. If 96 individuals are genotyped, Figure 3, B and D, shows that, even with individual heterogeneity, eight heterozygous individuals can be observed in most cases except for the nonpolymorphic one. A similar pattern is evident in Figure 3, A and C, for the case *n*_{total} = 48, but the probability *P*(*N* ≥ *n*_{heter}) decreases. The index of heterogeneity CV(*p _{Aa}*) affects the pattern of the probability of identifying heterozygous individuals. Figure 3, C and D, shows the curves with a smaller CV(

*p*) have a sharper reduction whereas those with a large CV(

_{Aa}*p*) have a gentler slope.

_{Aa}In this section, we focused exclusively on sample size in a pilot study of a DNA-pooling study. Discussions concerning sample size in the second stage can be found in Barratt* et al*. (2002).

## SIMULATION STUDY

### Procedures:

We carried out simulation studies to assess both the performance of different adjustment factors for estimating CPA in the first stage and the consequential impact on the pooling-based association test in the second stage.

For the first stage, the simulation procedures can be outlined as follows:

*Specify the simulation conditions:*Consider that the number of heterozygous individuals*n*_{heter}ranges from 8 to 40 with an increment of 16, and the true CPA κ is set to 0.5, 1, and 2.*Generate peak intensity data:*Because a gamma distribution can cover many different random patterns, we considered a bivariate gamma distribution of the peak intensities in the simulation study. The parameters for the bivariate gamma distribution were set to yield CVs of 0.1 or 0.3 for the peak intensities, and the correlation of the pair of peak intensities was 0.5.*Estimate the adjustment factor and hyperparameters:*On the basis of the data from step 2, we calculated the adjustment factor κ^{(}^{s}^{)}and estimated the hyperparameters of the gamma distribution using the moment method, where the superscript was the simulation index.*Calculate bootstrap standard error:*Bootstrapping data from the empirical gamma distribution Γ(μ̂, μ̂_{A}, σ̂_{a}^{2}_{A}, σ̂^{2}_{a}, ρ̂) for*B*times were used to estimate the CPA . The bootstrap standard error of the adjustment factor was obtained by calculating sample standard deviation over the*B*estimates, , where .*Calculate the estimation bias, standard error, and root mean square error:*Steps 2–4 were repeated for*S*simulation replications. The estimation bias and standard error were calculated using and , respectively. The root mean square error (RMSE) was RMSE = (BIAS^{2}+ SE^{2})^{1/2}.

In the second stage, we simulated case-control association tests using the following procedures:

*Specify the simulation conditions:*The sample sizes were*n*_{case}=*n*_{control}= 500. The population allele frequencies in case and control groups were (*p*_{case}= 0.25,*p*_{control}= 0.25) and (*p*_{case}= 0.25,*p*_{control}= 0.15) for the calculation of type I error and power, respectively. The standard deviation of experimental error was set to σ_{E}= 0.02 or σ_{E}= 0.05.*Generate allele frequency:*The sample frequencies in the case group were generated from a normal distribution*N*(*p*_{case}, Var(*p̂*_{case}|κ̂)), and a similar approach was applied to the control group.*Summarize the test results:*On the basis of the simulated data, we calculated the type I error when the case and control groups had the same true allele frequency; we calculated the power when the groups had different true allele frequencies.

A total of 12 simulation conditions were considered and were arranged in the following order: (σ_{E}, CV, *n*_{heter}) = {(0.02, 0.1, 8), (0.02, 0.1, 24), (0.02, 0.1, 40), (0.02, 0.3, 8), (0.02, 0.3, 24), (0.02, 0.3, 40), (0.05, 0.1, 8), (0.05, 0.1, 24), (0.05, 0.1, 40), (0.05, 0.3, 8), (0.05, 0.3, 24), (0.05, 0.3, 40)}. All simulations were carried out using 200 simulation replications and 500 bootstrap replications.

### Results:

In the simulation studies, we explored the impact of several elements on the adjustment factor and association test. These elements included the structure of peak intensity, degree of preferential amplification, experimental error, and sample size. The simulation results of κ = 0.5, κ = 1, and κ = 2 are summarized in Tables 1, 2, and 3, respectively.

With regard to the performance of adjustment factors, we found several meaningful patterns. The estimated CPA is affected by the CV of the peak intensities. As CV increases, *i.e.*, in the case of large variability of peak intensities, there is a concomitant rise in the estimation bias, standard error, and RMSE of κ̂. Relative to CV, the mean or variance of the peak intensities alone is insufficient to explain the changes in performance of adjustment factors. As more heterozygous individuals were collected in the pilot study, the standard error and RMSE of κ̂ were reduced; however, the effect on the bias of κ̂ was not obvious.

Regarding the performance of the association test, we found that the increase in experimental variation dramatically reduced the statistical power of the association test. Generally speaking, from σ_{E} = 0.02 to σ_{E} = 0.05, the power decreases from 0.85 to 0.30. Moreover, the type I error correlates positively with the CV of the peak intensities. In most cases, the type I error range was 0.03–0.07. Although larger numbers of heterozygous individuals are useful to reduce the RMSE of the adjustment factor, the efficacy of the association test depends on the degree of experimental error. When σ_{E} = 0.02, the reduction in the variation of κ̂ improves the power; when σ_{E} = 0.05, the improvement in the adjustment is neutralized by an increase in the experimental error. The same idea applies to the CV of peak intensity.

In general, the proposed adjustment measures yielded better performance than Hoogendoorn's method with respect to the estimation of κ and the association test. In all simulation trials, we found that the two proposed adjustment factors yielded a smaller bias, standard error, and RMSE compared with Hoogendoorn's method. Given a prespecified test size, κ̂_{U} yielded the highest power among the three adjustment methods with regard to relatively small experimental errors (σ_{E} = 0.02); κ̂_{G} yielded the highest power in cases of larger experimental errors (σ_{E} = 0.05).

## ANALYSIS OF A LABORATORY EXAMPLE

We conducted a DNA-pooling study to illustrate the efficacy of the proposed adjustment methods and to facilitate a comparison with Hoogendoorn's method (Hoogendoorn* et al*. 2000). Using normal control samples that we collected previously, 95 individuals were randomly chosen and genotyped individually. There are six SNPs in total. The peak intensities of heterozygous individuals were used to calculate the various estimates of CPA. Later, the genomic DNA of 30 individuals randomly selected from the 95 individuals was pooled to estimate allele frequency. The true minor allele frequency of the 30-individual population was attained on the basis of individual genotyping data using the allele-counting approach. The results are shown in Table 4 (column 2).

In the DNA-pooling experiment, each individual's genomic DNA was diluted to 12.5 ng/μl and quantified using the PicoGreen assay (Molecular Probes, Eugene, OR). Equimolar amounts of genomic DNA from the 30 individuals were then pooled. PCR amplification and primer extension reactions were performed using an ABI 9700 system (AME Bioscience, Towaco, NJ). The peak intensities of alleles were measured using a MALDI-TOF spectrometer (Sequenom) based on wavelet technology. Hence, the unadjusted allele frequencies for the six SNPs could be estimated on the basis of Equation 6, and the results are shown in Table 4 (column 3).

Hoogendoorn's adjustment and our two proposed methods were applied to this data set. For the six SNPs, the estimates of CPA were as follows: based on Hoogendoorn's method, 1.259, 0.765, 0.662, 0.873, 1.771, and 2.288; based on the geometric-mean method, 1.252, 0.687, 0.634, 0.859, 1.749, and 2.265; and based on the bias-reduction method, 1.221, 0.674, 0.632, 0.851, 1.788, and 2.320. The adjusted allele-frequency estimates, based on Equation 7, are shown in Table 4 in columns 4 (Hoogendoorn's method), 5 (geometric-mean method), and 6 (bias-reduction method).

To summarize the findings in this analysis of laboratory data, we found that it is essential to adjust for preferential amplification. This adjustment reduced the estimation bias of the allele frequencies except for the second SNP in this data set. In this case, the serious underestimate of allele frequency might have arisen from uncontrollable experimental variations, such as overestimation of the extended primer or an effect of DNA quality on SNP variance (Werner* et al*. 2002). For this specific SNP, the adjustment procedure yielded only limited improvement.

In most cases in our study, Hoogendoorn's method reduced the discrepancy between the allele frequencies estimated from the individual genotyping and pooling experiments. Our proposed adjustments reduced the error further than did Hoogendoorn's method. Moreover, our proposed methods yielded a smaller variation in the CPA compared with Hoogendoorn's method, and in turn our methods gave a smaller variation in allele frequency estimation. Overall, our results demonstrate that the proposed adjustments provide a more accurate and reliable estimation of allele frequency for this data set.

## DISCUSSION

Preferential amplification of nucleotides occurs frequently in DNA-pooling studies. Therefore, it is critical to adjust this interference presented in two nucleotides in the same SNP so as to avoid a severe bias in the allele frequency estimation and a consequent reduction in the power of the association test. In this work, we propose two adjustment methods that improve on the Hoogendoorn's adjustment (Hoogendoorn* et al*. 2000). The performance was evaluated by simulation studies. The new methods yield not only lower bias, standard error, and RMSE in allele frequency estimation, but also better statistical power for genetic association mapping under the given test size.

In our method, type I error is usually controlled well except when the CV of the peak intensity is high and κ > 1. Also, the performance of κ̂ is apparently not symmetric when κ = 1. In general, a smaller CPA yields a correspondingly smaller RMSE. Regarding the instances of κ = 2 and κ = 0.5, the latter gives better performance with respect to RMSE, power, and type I error. Hence, we suggest that the preferentially amplified allele be placed in the denominator when calculating the adjustment factor (see Equations 3 and 4).

In addition to our two new proposed adjustment factors, we also investigated several other methods, including the median-based measure, harmonic-mean-based measure, and some modified ratio estimators (Beale 1962; Tin 1965). Although some of these methods yielded a better estimate of CPA than Hoogendoorn's method did during the simulation study (Yang* et al*. 2003), they are not superior to our proposed adjustment factors in this article.

We investigated the role of sample size during the pilot stage of the DNA-pooling study. The use of a large number of heterozygous individuals reduces the RMSE of the adjustment factor. However, 24 heterozygous individuals should be sufficient to obtain a good adjustment factor. To explore the sample size requirement, we considered both fixed-effect and random-effect models for different scenarios. In general, random-effect models yield a larger sample size than fixed-effect models when the same set of parameters is used. For example, given that the genotype frequency in the population is 0.45 and 8 heterozygous individuals are required, 21 individuals need to be genotyped under the fixed-effect model [*i.e.*, CV(*p _{Aa}*) = 0]; 24 and 34 individuals need to be genotyped under CV(

*p*) = 0.25 and 0.5, respectively, in the random-effect model. This is so because the random-effect models take into consideration variations among individuals, resulting in an increase in total variation that requires a larger sample size to compensate. Ignoring the heterogeneity may lead to a serious underestimation of sample sizes.

_{Aa}DNA-pooling studies can achieve the objective of screening for important genetic variants at a reasonable cost. However, the unavoidable drawback is that genotypic information and individual features are lost once genomic DNA is mixed. Although some advanced studies have attempted to reconstruct the lost information (Ito* et al*. 2003), a bottleneck still exists due to the mass of genotype and haplotype combinations in the pool. Stringent limitations must be satisfied for small pools to reduce the number of combinatorial calculations. To date, DNA-pooling experiments have been used primarily as a screening technique rather than as a legitimate replacement for individual genotyping studies.

Use of DNA pooling is a potentially cost-effective alternative to individual genotyping. Association testing in DNA pools is an efficient method for screening important genetic markers and has been applied successfully to many practical applications and found significant conclusions (Carmi* et al*. 1995; Bansal* et al*. 2002). Our new approaches provide valid adjustments and further improve the conventional method, thereby the reliable allele frequency estimation and powerful association tests. These advantages enhance greatly the applicability of the DNA-pooling experiment.

## APPENDIX A:

### PROPERTIES OF THE PROPOSED ADJUSTMENT FACTOR

Suppose that the population size of heterozygous individuals is *N*_{heter} and *n*_{heter} samples are randomly drawn from the population. Let . Hence, the bias of Hoogendoorn's measure can be calculated using the following: We used the estimated bias to correct the bias in Hoogendoorn's method. This estimator is based on an infinite-population issue relative to the consideration of finite population in Hartley and Ross (1954).

## APPENDIX B:

### SAMPLE SIZE CALCULATION IN A PILOT STUDY

Suppose that the estimated allele frequency *p̂* is a differentiable function of κ̂ and *h*^{P}_{A}/*h*^{P}_{a} at a point (κ, *H _{A}*/

*H*). By a first-order Taylor expansion of two variables κ̂ and

_{a}*h*

^{P}

_{A}/

*h*

^{P}

_{a}at (κ,

*H*/

_{A}*H*), we know that where and satisfies

_{a}*R*

_{1}/||η|| → 0 as ||η|| → 0. The approximate mean and variance can be calculated as and where and . The pure variation due to the adjustment of preferential amplification can be obtained by setting , resulting in the following variance: B1If κ̂ = κ, then we obtain the pure variance due to the measurement of the ratio of the peak intensities,

To calculate the required samples size, the variance in Equation B1 must be rewritten as a function of *n*_{heter}. If the adjustment of Hoogendoorn* et al*. (2000) is applied, then the variance in Equation B1 can be represented as where and , *j* = 1, … , *n*_{heter}. Under risk α and the specified absolute error ξ, the required number of heterozygous individuals is where *P*{|*p*(κ̂) − *p*|< ξ} ≥ 1 − α.

## APPENDIX C:

### THE MARGINAL DISTRIBUTIONS OF SAMPLE SIZES UNDER DIFFERENT GENOTYPING EXPERIMENTS

First, under sequential genotyping experiments, *n*_{total} is a random variable that follows a negative binomial distribution with successful probability *p*_{H} (probability of heterozygote). Under the RAF model, we assume allele frequency *p _{A}* follows from a beta distribution beta(θ, τ). Hence, the marginal distribution of

*n*

_{total}can be derived as follows: Under the RGF model, we assume genotype frequency

*p*follows from a beta distribution beta(θ, τ). The marginal distribution of

_{Aa}*n*

_{total}is Second, under large-scale genotyping experiments,

*n*

_{heter}is a random variable that follows a binomial distribution with successful probability

*p*

_{H}. Under the beta-binomial random allele frequency model, the marginal distribution of

*n*

_{heter}is Under the beta-binomial random genotype frequency model, the marginal distribution of

*n*

_{heter}is

## Acknowledgments

We thank Jer-Yuan Wu and the National Genotyping Center and National Clinical Core at Academia Sinica for genotyping support. We appreciate the two anonymous reviewers for providing insightful suggestions and comments, which have greatly enhanced the presentation of this article. This research was supported by a National Science Council grant (NSC 92-3112-B-001-014) and an Academia Sinica grant (93IBMS2PP-C) of Taiwan.

## Footnotes

Communicating editor: R. W. Doerge

- Received June 4, 2004.
- Accepted September 20, 2004.

- Genetics Society of America