## Abstract

Using striped bass (*Morone saxatilis*) and six multiplexed microsatellite markers, we evaluated procedures for estimating allele frequencies by pooling DNA from multiple individuals, a method suggested as cost-effective relative to individual genotyping. Using moment-based estimators, we estimated allele frequencies in experimental DNA pools and found that the three primary laboratory steps, DNA quantitation and pooling, PCR amplification, and electrophoresis, accounted for 23, 48, and 29%, respectively, of the technical variance of estimates in pools containing DNA from 2–24 individuals. Exact allele-frequency estimates could be made for pools of sizes 2–8, depending on the locus, by using an integer-valued estimator. Larger pools of size 12 and 24 tended to yield biased estimates; however, replicates of these estimates detected allele frequency differences among pools with different allelic compositions. We also derive an unbiased estimator of Hardy–Weinberg disequilibrium coefficients that uses multiple DNA pools and analyze the cost-efficiency of DNA pooling. DNA pooling yields the most potential cost savings when a large number of loci are employed using a large number of individuals, a situation becoming increasingly common as microsatellite loci are developed in increasing numbers of taxa.

MICROSATELLITE loci are important in population and quantitative genetic analyses, and the number of known loci is increasing for a wide variety of taxa (Zane *et al.* 2002). Increasing the number of loci in a study typically improves the performance of statistical procedures (*e.g.*, Weir and Cockerham 1984; Waples 1989), and genomewide analyses of microsatellite loci will require many loci (*e.g.*, Lee *et al.* 2004). However, increasing the number of loci also increases laboratory costs, sometimes prohibitively. Accordingly, DNA pooling has been proposed as a means of reducing laboratory costs in population and quantitative genetic studies. Furthermore, DNA pools can also arise naturally in polyploid organisms, microbial samples, and forensic analyses.

Laboratory DNA pooling is carried out by combining DNA from two or more individuals prior to implementing the polymerase chain reaction (PCR) and electrophoresis. The resulting data typically take the form of some measure of copy number (*e.g.*, fluorescence intensity) for each fragment size (allele) in the DNA pool. Quantitative genetic studies in human and agricultural systems have sought to find an association between a trait and a marker by pooling DNA from individuals that share a common phenotype and/or ancestry, a technique sometimes called “bulked segregant analysis.” Methods for comparing microsatellite DNA pools among experimental populations vary from more qualitative approaches such as counting alleles (Hillel *et al.* 2003) or comparing measures of relative fluorescence intensity (Daniels *et al.* 1998; Moritz *et al.* 2003; Lee *et al.* 2004) to more quantitative approaches such as attempting to estimate allele frequencies within a pool (Perlin *et al.* 1995; Barcellos *et al.* 1997; Band and Ron 1998; Lipkin *et al.* 1998, 2002; Rutyer-Spira *et al.* 1998; Kirov *et al.* 2000; Mosig *et al.* 2001; Schnack *et al.* 2004). More generally, DNA pooling might be applied in any genetic study relying on allele frequency estimation, including marker screening and association mapping, analyses of population differentiation, and analyses of effective population size.

The appeal of DNA pooling relative to individual genotyping arises from its potential for cost savings when evaluating many loci. The cost of genotyping *n* individuals iswhere *l* is the number of independent PCRs (*e.g.*, loci, or multiplexed sets), and *c*_{I}, *c*_{A}, and *c*_{E} are the unit costs of DNA isolation, PCR amplification, and electrophoresis, respectively. The cost of estimating allele frequencies in a single pool of *m* individuals iswhere *c*_{Q} is the unit cost of DNA quantitation and pooling. DNA pooling incurs the cost of DNA quantitation and pooling, but saves on PCR amplification and electrophoresis since a pool can be used repeatedly for different multiplexes (*e.g.*, Kirov *et al.* 2000; Schnack *et al.* 2004). If the same number of individuals is used in a single pool so that *m* = *n*, then the percentage of savings from DNA pooling isAs the number of loci employed increases, the percentage of savings approaches 100(1 − 1/*n*)%. Hence, in principle, pools of 2 individuals can yield cost savings of 50% if a sufficiently large number of loci are employed. Pools of 100 individuals can potentially result in 99% cost savings given a sufficient number of loci. However, for a particular situation, the cost savings will depend on several details, such as the unit costs, the size of the pools employed, and the statistical properties of allele-frequency estimates based on these DNA pools.

While previous studies in human and agricultural systems have demonstrated applications of DNA pooling, the statistical features of allele-frequency estimates based on DNA pools are not well characterized. Understanding the statistical consequences of DNA pooling is a prerequisite to assessing its methodological potential and cost-effectiveness. Accordingly, we extend previous work by conducting a case study using six loci in two three-locus multiplexes in striped bass (*Morone saxatilis*). In particular, we (i) derive two kinds of allele frequency estimators based on DNA pools; (ii) estimate the variance components of allele-frequency estimates associated with the three primary laboratory steps, DNA quantitation and pooling, PCR amplification, and electrophoresis; (iii) evaluate allele-frequency estimates over a range of pool sizes and types and two different protocols; (iv) derive an estimator of the Hardy–Weinberg disequilibrium coefficient using DNA pools; and (v) use our estimates of technical variance as a function of pool size in a cost analysis of DNA pooling relative to individual genotyping.

## MATERIALS AND METHODS

#### Pooling experiments:

We conducted two experiments (experiments 1 and 2) to evaluate microsatellite allele frequency estimation using DNA pooling with six striped bass loci in two three-locus multiplexes [multiplex 1, *SB91* (Roy *et al.* 2000), *SB6* (García de León *et al.* 1995), and *SB108* (I. Wirgin, personal communication); multiplex 2, *AT150-2#4*, *AG25-1#1* (Brown *et al.* 2003), and *MSM1067* (Couch *et al.* 2006)]. To explore different allele frequency distributions, we used DNA isolated from 12 individuals (appendix a). Each locus had three to six alleles and allele size ranges varied from 8 to 44 bp (Figure 1, appendix a).

In experiment 1, we estimated the variance components of allele-frequency estimates attributable to the three laboratory steps used for DNA pooling: (i) DNA quantitation and pooling, (ii) PCR amplification, and (iii) electrophoresis. For multiplex 1, individuals 1–6 were used to construct pools of sizes 2, 3, 6, 12, and 24 having, respectively, 3, 2, 1, 3, and 3 different allelic compositions (12 total different pool types; appendix a). Individuals 7–12 were used to construct the same kinds of pools for multiplex 2 (appendix a). When necessary, different pool types used DNA from the same individual more than once. To estimate variance components, each pool type was created in triplicate, each pool was PCR amplified in duplicate, and each amplification was electrophoresed in duplicate for a total of 12 observations per pool type. In experiment 1, the total DNA concentration in the pools was diluted to 1 ng/μl (∼1000 haploid genomic copies; 1 pg of DNA contains ∼1 haploid copy of the striped bass genome) (Hardie and Hebert 2004), 30 cycles of PCR were performed, and DNA samples were pooled on the basis of one replicate quantitation. Motivated by our results from experiment 1, we conducted a two-factor PCR experiment to examine the roles of initial amounts of DNA and the number of PCR cycles in determining patterns of fluorescence intensity. Using multiplex 1 with DNA from individual 1, the initial DNA amount was cross-classified with the number of PCR cycles (24 and 26 cycles with 2.5, 5, and 10 ng of initial DNA and 28 and 30 cycles with 1.25, 2.5, and 5 ng of initial DNA using eight replicates of each treatment combination). In experiment 2, we modified our procedures on the basis of the results in experiment 1 and the PCR experiment and created pools of sizes 4, 6, and 8 having 3, 2, and 3 different allelic compositions, respectively (appendix a). Each pool type was replicated four or six times (for a total of 12 replicates per pool size), PCR amplified, and electrophoresed. In experiment 2, 24 cycles of PCR were performed, the DNA concentration in pools was diluted to contain 1 ng/μl (1000 haploid genomic copies) of DNA per individual, and DNA samples were pooled on the basis of three replicate quantitations.

#### DNA isolation, quantitation, and pool construction:

DNA was isolated from finclips (sampled from a captive striped bass family and stored for 2 years in 70% ethanol) using a modified commercial protocol (PUREGENE DNA purification kit, RNase A step omitted; Gentra Systems, Research Triangle Park, NC), hydrated in TE buffer (10 mm Tris, 0.1 mm EDTA, pH 8.0), and quantified using the PicoGreen assay (Molecular Probes, Eugene, OR) in a Turner Quantech fluorometer (Barnstead Thermolyne, Dubuque, IA). We isolated an average of 1291 ± 54.13 ng (mean ± SE) of DNA per milligram of preserved fin tissue. Prior to pool construction, samples were diluted to a DNA concentration of ∼10 ng/μl. DNA concentration was then estimated as the mean of three to eight replicate quantitations for each sample from our group of 12 individuals (the relative standard error was 1–1.5%; all standard curves had *r*^{2} > 0.99). Different pool types were constructed using the genotypes of these 12 individuals (appendix a). The empirical distribution of quantitation residuals (∼40 residuals) was used to model the laboratory error introduced by quantitation (residual equals the estimated DNA concentration from one quantitation of a sample minus the mean DNA concentration calculated over replicate quantitations of the same sample). Simulated quantitations were constructed by adding randomly drawn residuals to the mean estimated DNA concentration for each individual, and aliquot volumes based on these simulated quantitations were used to construct pools. Separate sets of residuals were generated for multiplexes 1 and 2 in experiments 1 and 2. This procedure allowed us to simulate the laboratory process of DNA pooling using a minimum of DNA isolations and quantitations.

#### PCR amplification:

PCR amplification was conducted in multiplex using 1 μl of template in a 12-μl reaction volume with 1.5 mm MgCl_{2}, 0.2 mm dNTPs, 0.5 μm forward primer (Integrated DNA Technologies), 0.5 μm reverse primer (5′ labeled with 6FAM, PET, or NED fluorophores; Applied Biosystems, Foster City, CA), and 0.5 units of HotStarTaq DNA polymerase (QIAGEN, Valencia, CA). Thermal-cycling conditions consisted of an initial denaturation at 95° for 15 min; followed by 30 (experiment 1) or 24 (experiment 2) cycles of 94° for 30 sec, 49° (multiplex 1) or 58° (multiplex 2) for 30 sec, and 72° for 40 sec; and with a final extension at 72° for 5 min. We used the same amplification protocols and reagent quantities for individual and pooled samples.

#### Electrophoresis:

Amplification products were purified by gel filtration (PERFORMA DTR 96-well plate; Edge Biosystems), and then 0.70 μl of product was combined with formamide (Applied Biosystems) and GeneScan 500 LIZ size standard (Applied Biosystems), denatured at 95° for 5 min, and separated by capillary electrophoresis for fluorometric quantitation on an ABI PRISM 3700 DNA analyzer (Applied Biosystems). Raw data files containing fluorescence intensities were read using a script written in MATLAB (The Mathworks). Fluorescence peaks associated with amplified fragments were detected by an algorithm that searches for localized regions of high peak intensity relative to baseline noise, a Gaussian model of peak shape was statistically fit to the peak, and peak height was calculated from the fitted model (MATLAB scripts are available from the authors). Examples of fluorescence data are shown in Figure 1 for pools of six individuals for each of the six loci. Each data file contains ∼10^{4} fluorescence intensity values per locus plus ∼10^{4} values for the size standard for a total of ∼4 × 10^{4} data values per sample file. We employed peak height instead of peak area because peak height is less vulnerable to errors introduced by overlapping peaks, and because peak height is proportional to the number of molecules of a given fragment length under a diffusion model of electrophoresis transport.

#### Statistical analysis:

Microsatellite allele frequencies are estimated from DNA pools using the fluorescent intensities associated with each allele's peak height (Figure 1). To complicate matters, the PCR amplification process introduces stutter peaks and differential amplification errors that are reflected in the fluorescence intensities. A stutter peak occurs when an allele of repeat length *i* produces PCR products of length *j* ≠ *i*, where *j* is typically <*i* (Figure 1). Differential amplification is the process by which shorter alleles amplify with higher efficiency than longer alleles, and thus two alleles that are present in equal quantities (*e.g.*, as in a single heterozygote individual) can produce different fluorescence intensities. However, statistical modeling of stutter peaks and differential amplification permits estimation of allele frequencies from DNA pools. Using a simple statistical model, the allele counts in a pool of size *m* can be estimated usingwhere is the vector of estimated allele counts, **y** is the vector of the observed fluorescent peak intensities (fluorescent peak height or area; we employed peak height), is the inverse of an estimate of a parameter matrix **R** that models stutter and differential amplification, and | | denotes the operation of summing vector elements (appendix b). The matrix is obtained using fluorescence data from individual genotypes. In appendix b, we show that the estimator depends only on the relative fluorescence values, **y/**|**y**|, and that the matrix **R** can be estimated using only relative fluorescence values. The need for only relative fluorescence intensities is fortunate, because, in our experience, absolute fluorescence intensities are quite variable.

The estimator contains real numbers, whereas allele counts are nonnegative integers. Accordingly, we introduce an integer-valued estimator, , that is the value of **u** that minimizes subject to the constraint that **u** contains nonnegative integers, and |**u**| = 2*m* (the ′ denotes matrix transposition). We refer to as the “real-valued estimator” and to as the “integer-valued estimator.” We estimated allele frequencies in pools in experiments 1 and 2 using both the real- and integer-valued estimators and evaluated their statistical properties, including their technical bias (estimated minus expected value) and variance (variance of replicate estimates). During variance components estimation in experiment 1, we focused on the role of pool size. Hence, we describe the relative contributions of quantitation and pooling, PCR amplification, and electrophoresis to total variation by averaging over all alleles, loci, and pool types for each pool size. Variance components in experiment 1 were estimated by maximum likelihood (Searle *et al.* 1992), using the real-valued estimator for each allele and locus separately. We used the real-valued estimator to estimate variance components and not the integer-valued estimator because the act of rounding in the latter may obscure variance components. Variance components were averaged over all alleles within each locus–pool type combination, and then proportional variance components were averaged over all loci–pool type combinations within each pool size to control for differences in total variance among loci.

The statistical properties of allele-frequency estimates may be allele specific. Using maximum-likelihood estimates of the bias and variance of the real-valued estimator for each locus and allele as our dependent variables, we tested for allele-specific effects within each locus using a one-way ANOVA with allele as the factor. To further understand allele-specific effects, we used a general linear model to address how locus, allele frequency, and the presence/absence of stutter peaks (a stutter effect is scored as present for an allele of length *i* if a stutter peak exists, generated by an allele of length *j* ≠ *i* that overlaps the fluorescence peak associated with allele *i*; otherwise the stutter effect is scored as absent) affect the bias and variance of real-valued estimates. Finally, for the PCR experiment we measured the magnitude of differential amplification (as the ratio of the fluorescent peak height of the shorter allele to that of the longer allele) and stutter peaks (as the ratio of the fluorescent peak height of the stutter fragments of length *i* − 1 to the peak height of the fragments of length *i*, where *i* denotes the longer allele in a heterozygote) for each locus in multiplex 1, using individual 1's heterozygous genotypes. We analyzed these measures of differential amplification and stutter peaks, using multiple linear regression with cycle number and initial DNA amount as independent variables.

## RESULTS

#### Experiment 1—variance components:

The estimates of laboratory variance components are similar across pool sizes (Figure 2). Averaging over pool sizes, DNA quantitation and pooling, PCR amplification, and electrophoresis accounted for 23, 48, and 29% of the observed variation in allele-frequency estimates based on the real-valued estimator. Any laboratory step may introduce a large error into an allele-frequency estimate, and these proportional variance components represent averages. Nonetheless, these results suggest that optimizing the PCR amplification step offers the most room for improvement in terms of reducing variance in allele-frequency estimates based on DNA pools. These variance component estimates also should apply to individual genotyping when the quantitation and pooling variance is set to zero, since, for example, diploid individual genotypes are pools containing two genomic copies in equal quantities. Thus, PCR amplification and electrophoresis should account for ∼62% [62% = 48%/(48% + 29%)] and 38% [38% = 29%/(48% + 29%)], respectively, of the variation in fluorescence intensity in individual genotyping results.

#### Experiment 1—allele-frequency estimates:

When we used the integer-valued estimator to estimate allele frequencies, all of the loci exhibited similar levels of technical bias and variance with the exception of locus *AT150-2#4*, which exhibited higher bias and variance than the other loci (Figure 3). For small pools of sizes 2 and 3, all of the loci except for *AT150-2#4* yielded exact estimates for all replicates. The locus *MSM1067* yielded exact estimates for all replicates for pools of sizes 2, 3, and 6. For most loci, the bias and variance began to increase above zero at pools of size 6 and then showed little difference between larger pools of sizes 12 and 24. Bias and variance in the real-valued estimator were constant across all pool sizes (data not shown), and these values were very similar to those exhibited by the integer-valued estimator in pools of sizes 12 and 24. These results indicate that technical bias and variance are essentially constant across pool sizes except for small pool sizes when real-valued estimates can be rounded to their expected integer values using the integer-valued estimator.

The previous results assess the properties of the estimators in terms of absolute accuracy and precision. In many actual applications (*e.g.*, Daniels *et al.* 1998; Moritz *et al.* 2003; Lee *et al.* 2004), however, the actual value of population allele frequencies is not relevant; instead the main concern is the relative frequencies of alleles between two or more populations. In this situation bias may not be a concern, and we can ask how well larger DNA pools detect differences in allelic composition between populations. For pools of sizes 12 and 24 for which we created three pool types, we calculated the mean estimated difference in allele counts (using the real-valued estimator) between two of the pool types relative to the third and plotted it against the expected difference in pool types (Figure 4). There was a strong correspondence between estimated and expected differences, and, most importantly, estimates were positive or negative when their true values were positive or negative, respectively. These results suggest that larger pools, even in the presence of bias, can be used to detect relative differences in allele-frequency composition among populations.

The above results summarize averages taken over alleles. A one-way ANOVA indicated significant allele-specific effects on the bias and variance of estimates for most loci (bias, *P* < 0.05 for all loci except *AG25-1#1*; variance, *P* < 0.05 for all loci except *SB6* and *MSM1067*). On average, allele-frequency and stutter effects explained little of the among-allele variation in bias (*P* > 0.05), with the effect of allele frequency being somewhat locus dependent (allele frequency by locus interaction, *F*_{5,160} = 7.24, *P* < 0.001). Bias increased with increasing allele frequency for loci *AT150-2#4* and *SB6* and was independent of allele frequency for the other loci. In contrast, the variance for all loci was affected by allele frequency (*F*_{1,160} = 16.52, *P* < 0.001), stutter effects (*F*_{1,160} = 13.15, *P* < 0.001), and their interaction (*F*_{1,160} = 10.05, *P* = 0.002). The presence of stutter effects increased the variance. The variance increased with increasing allele frequency, and allele frequency had a slightly stronger effect on the variance in the absence of stutter effects. Finally, bias and variance were strongly locus dependent (bias, *F*_{5,160} = 5.35, *P* < 0.001; variance, *F*_{5,160} = 7.72, *P* < 0.001), as indicated in Figure 2.

#### PCR experiment and experiment 2—allele-frequency estimates in small pools:

Motivated by the magnitude of the variance component attributable to the PCR amplification and the role of stutter effects in affecting allele-frequency estimates, we examined by experiment the effects of initial amount of DNA and cycle number on differential amplification and stutter peaks using multiplex 1. The magnitude of both differential amplification and stutter peaks increased with increasing cycle number in a graded manner across all loci (linear coefficients for all loci: *P* < 0.001). Similarly, the magnitude of differential amplification increased with increasing initial amount of DNA for all loci (linear coefficients for all loci: *P* < 0.001). In contrast, the effect of DNA amount on stutter was more ambiguous, causing increased stutter in SB91 (*F*_{1,87} = 10.55, *P* = 0.002), marginally significant increased stutter in SB6 (*F*_{1,87} = 3.56, *P* = 0.063), and decreased stutter in SB108 (*F*_{1,87} = 18.23, *P* < 0.001). On average, the effect of cycle number on differential amplification and stutter peaks was stronger than the effect of increasing the initial amount of DNA; hence, PCR products from a 24-cycle reaction starting with 10 ng of DNA yielded less differential amplification and stutter than a 30-cycle reaction starting with 1.25 ng of DNA.

On the basis of results from experiment 1 and the PCR experiment, we further explored the estimation of allele frequencies using the integer-valued estimator with small pools with two aims in mind. First, we wanted to assess the extent to which technical variance could be reduced relative to experiment 1 by using 24 PCR cycles instead of 30 and constructing pools on the basis of three replicate quantitations instead of one. We hypothesized that reducing the number of PCR cycles could reduce the technical variance via the kind of reduction in differential amplification and stutter peaks that we observed in the PCR experiment. Given a reduction in technical variance, our second aim was to estimate with increased replication (12 independent replicates per pool size) the extent to which allele frequencies can be estimated without error using the integer-valued estimator. We focused on pools of sizes 4, 6, and 8, since the integer-valued estimator began to show nonzero technical bias and variance for most loci at pools of size 6 in experiment 1.

In experiment 2, all loci except *AT150-2#4* produced exact estimates for all replicates in pools of size 4 (*AT150-2#4* yielded exact estimates in all but one replicate), four loci (*SB91*, *AG25-1#1*, *SB108*, and *MSM1067*) produced exact estimates for all replicates in pools of size 6, and three loci (*AG25-1#1*, *SB108*, and *MSM1067*) produced exact estimates for all replicates in pools of size 8. Hence, depending on the locus, allele frequencies can be estimated exactly for pools of sizes 4, 6, and 8 using three replicate quantitations per contribution to a pool, 24 cycles of PCR, and the integer-valued estimator. Focusing on pools of size 6 since they were constructed in both experiments 1 and 2, the average variance of allele frequency estimates over all loci decreased from 8.8 × 10^{−4} in experiment 1 to 2.0 × 10^{−4} in experiment 2, a significant decrease of ∼77% (*F*_{5,11} = 4.40, *P* = 0.019).

#### Estimation of Hardy–Weinberg disequilibrium coefficient using DNA pools:

The pooling of genes over individuals in DNA pools implies that information concerning within-individual covariances among alleles, such as Hardy–Weinberg disequilibrium coefficients, is lost during pooling. However, some information on among-allele covariances is retained during pooling. The Hardy–Weinberg disequilibrium coefficient can be estimated if allele frequencies from more than one pool are available. For the multiallelic case, the Hardy–Weinberg disequilibrium coefficient for alleles *a* and *b* iswhere *p _{ab}*,

*p*, and

_{a}*p*are genotype and respective gene frequencies for alleles

_{b}*a*and

*b*(Weir 1996). For simplicity, we consider the situation where allele frequencies are estimated exactly using small pools. Letting and be the observed frequencies of alleles

*a*and

*b*in pool

*i*,

*D*can be estimated from

_{ab}*r*replicate pools of size

*m*usingan estimator that is unbiased (appendix c). Simulations suggest that the variance of this estimator increases with increasing pool size and decreases with increasing numbers of replicate pools. These results show that some information regarding Hardy–Weinberg equilibrium can be recovered from pooled DNA, and this estimator could be used with individual genotype data (which would also be used to estimate

**R**) to provide information on Hardy–Weinberg disequilibrium. Further, given estimates of the disequilibrium coefficient over all pairs of alleles, population genotype frequencies can be estimated. We have considered the idealized case of zero technical bias and variance; thus, the effect that errors in allele frequency estimation from DNA pools may have on these estimates should be examined in specific cases prior to implementing these procedures.

#### Cost analysis:

Whether DNA pooling should be implemented, beyond its feasibility, lies in its cost-effectiveness. To evaluate cost-effectiveness, estimates of technical variance as a function of pool size are needed because the experimental design decision is to choose the best pool size from the set of possible pool sizes (including pools of size 1 that correspond to individual genotyping). Knowledge of the relative costs of laboratory procedures is also required to evaluate cost-effectiveness. Experiments 1 and 2 provide estimates of the technical variance for different pools sizes and allow us to calculate cost-effectiveness. In the following calculations, we assume that the technical bias is zero, because small pools are used, or that bias is not a concern because only relative differences among population allele frequencies are of interest. We calculate, for a given pool size, the percentage of cost savings realized by DNA pooling that achieves the same statistical precision as genotyping *n* individuals. Using the cost equations presented earlier and assuming individual genotyping of *n* individuals for an allele having frequency *p*, *r* replicate pools of size *m* yield a percentage of cost savings of(1)where σ_{T}^{2} is the technical variance; *l* is the number of separate PCRs (*e.g.*, multiplexes); and *c*_{I}, *c*_{Q}, *c*_{A}, and *c*_{E} are the relative unit costs (*e.g.*, per individual) of DNA isolation, quantitation and pooling, PCR amplification, and electrophoresis, respectively (appendix d).

Three qualitative points can be illustrated, using a graphical example based on our laboratory costs (estimated labor plus reagent costs) and Equation 1, and assuming, on the basis of our data, that the technical variance is constant as a function of pool size. First, the percentage of savings increases with both pool size and the number of PCRs (Figure 5; the parameters σ_{T}^{2}, *p*, *c*_{I}, *c*_{Q}, *c*_{A}, and *c*_{E} are held fixed in Equation 1 and Figure 5, while pool size and number of PCRs are varied). Most of the savings is acquired as pool size initially increases, and there are diminishing returns with subsequent increases in pool size. The same pattern holds for increasing the number of PCRs, with most of the savings being realized after 10 PCRs in our example. These diminishing returns suggest that small pools can be almost as cost-effective as large pools. Second, the percentage of cost savings reaches a maximum value (appendix d) when the pool size is given byHowever, while an optimum pool size exists, it is not much more cost-effective than many other pool sizes because the cost savings curves tend to be flat near the optimum pool size (optimum pool size is not shown in Figure 5 because the curves are very flat for larger pool sizes). Finally, Equation 1 shows that the percentage of cost savings is independent of *n*, where *n* (the number of individuals genotyped) determines the desired level of statistical precision. Hence, absolute cost savings will be proportional to Equation 1. The qualitative features of the cost savings curves depicted in our example are relatively insensitive to realistic variations in the statistical and cost parameters. Changing the allele frequency *p* has a small effect, as does varying the technical variance between 0 and 0.001, a realistic range based on our experiments. Of course, the relative costs associated with DNA pooling will be laboratory specific, depending on the costs and procedures of individual genotyping, available equipment, reagents used, institution-specific fees, the capacity for multiplexing, and labor costs. Hence, laboratories can employ Equation 1 and the associated ideas using costs that match their situation. We have considered only recurring costs associated with DNA pooling. For some laboratories, there can be substantial preliminary costs, including optimization of laboratory procedures, the acquisition of appropriate software for analyzing pooled data, and the purchasing of equipment for DNA quantitation. Laboratory-specific cost analyses should consider whether these costs can be recovered through subsequent savings in recurring costs. DNA pooling will be most cost-effective when recurring costs are large, a situation that arises when many individuals are used with many loci.

## DISCUSSION

DNA pooling has been proposed as a cost-effective means to estimate microsatellite allele frequencies in studies using many loci. As microsatellite markers are developed in growing numbers in many species, the cost of analyzing many loci using many individuals will become increasingly relevant. Using six multiplexed microsatellite loci in striped bass, we evaluated the statistical properties of procedures that estimate allele frequencies using DNA pools and used these results to assess the feasibility and cost-effectiveness of DNA pooling.

We provide a statistical and theoretical basis for estimating microsatellite allele frequencies using DNA pools by deriving a moment-based estimator. The procedures implemented in previous studies (Perlin *et al.* 1995; Barcellos *et al.* 1997; Band and Ron 1998; Lipkin *et al.* 1998, 2002; Kirov *et al.* 2000; Mosig *et al.* 2001; Schnack *et al.* 2004) have used some version of this estimator, but a probabilistic basis for the estimator has not been presented. Our results show explicitly that the estimator depends only on the relative fluorescence intensities associated with each allele, explaining why allele frequency estimation from DNA pools is tractable empirically, even if absolute fluorescence intensities are quite variable.

Our results suggest that the technical bias, technical variance, and proportional variance components are approximately constant as a function of pool size, except when the integer-valued estimator was employed for small pools, in which case the estimates of bias and variance were zero for all but one locus. Allele frequency estimates from larger pools can be biased, but such estimates are still able to detect relative differences in allele frequencies among DNA pools. Previous microsatellite DNA pooling studies have focused on pool sizes of ∼ ≥20 and information is lacking concerning estimators over a number of pool sizes and types, laboratory variance components, and integer-valued estimators. The comparison of quantitative results among different studies is complicated by differences in the properties of microsatellite loci and methodologies used in various studies. Nonetheless, our estimates of technical bias and variance as a function of pool size are consistent in terms of magnitude and pattern with the findings of other studies for which this kind of comparison can be made in an approximate way (Band and Ron 1998; Lipkin *et al.* 1998, 2002). Our results also demonstrate that allele frequency estimation using DNA pools is feasible using a simple DNA isolation protocol with ethanol-preserved finclips, a convenient tissue sampling procedure in many fishes. Further research on the effect of pooling methodology and allele size range and diversity on estimation procedures as well as the contrasting properties of di-, tri-, and tetranucleotide markers in DNA pools is needed.

We suspect that the unbiased estimation of allele frequencies from large DNA pools could be difficult to achieve for many loci, and that the difficulty increases with increasing pool size. Because **R** is estimated rather than known exactly, even if relative fluorescent intensities are measured without error (which is not the case), there will be some bias between the estimated allele counts, , and the expected allele counts, **x**, given bya quantity that is proportional to pool size *m*. Hence, as *m* increases, the bias in allele counts increases. Once the bias becomes >, then the integer-valued estimator will not round to the true value. The above equation also predicts that bias will be constant across pool sizes, which is the pattern that we observed empirically. Thus, as *m* increases, **R** must be estimated with increasing precision to yield unbiased estimates. In our present study, we found that unbiasedness vanishes between pool sizes of ∼6 and 12, depending on the locus. Additional research on the best methods for estimating **R** is needed (Schnack *et al.* 2004).

The results of experiment 1 suggest that the DNA quantitation and pooling, PCR amplification, and electrophoresis steps account for ∼23, 48, and 29%, respectively, of the total technical variance of allele frequency estimates based on DNA pools. These results imply that optimizing the PCR amplification step offers the most potential, in terms of variance reduction, to minimize the technical variance. Our PCR experiment showed that differential amplification and stutter peaks decrease with decreasing cycle number, findings consistent with related theoretical and empirical work on the PCR process (Lai and Sun 2003; Shinde *et al.* 2003). In experiment 2, we found that the technical variance could be reduced relative to experiment 1 by reducing cycle number and increasing the number of replicate quantitations. The results from our experiments suggest that microsatellite allele frequencies can be estimated exactly, or very nearly exactly, for pools of sizes 2–8, depending on the locus. Employing replicate quantitations and a reduction in PCR cycle number in concert with an integer-valued estimator may be a means for improving DNA pooling protocols.

We also investigated the estimation of Hardy–Weinberg disequilibrium coefficients from DNA pools by deriving an unbiased estimator of Hardy–Weinberg disequilibrium coefficients for multiple alleles, using allele frequencies estimated from multiple DNA pools. The estimator also can be used to recover genotype frequencies using only information on allele frequencies. In these analyses, we assumed that allele frequencies were estimated without error using DNA pools; however, the effects of violations of this assumption on the performance of the estimator should be explored. The estimation of Hardy–Weinberg disequilibrium coefficients is similar in concept to the estimation of linkage disequilibrium coefficients and haplotype estimation, a topic that has been explored in association studies (Wang *et al.* 2003; Yang *et al.* 2003). The effects of allele frequency estimation using DNA pools on the estimation of other quantities, such as effective population size and measures of population differentiation, also warrant investigation.

The utility of DNA pooling depends upon both its feasibility and its cost-effectiveness. Our study shows that DNA pooling is feasible for both small (sizes 2–8) and larger pools (sizes 12 and 24). Small pools offer the possibility of unbiased and even exact estimates. Larger pools can yield biased estimates, but may remain useful for assessing relative differences in allele frequencies among populations, the main approach of previous pooling studies (Barcellos *et al.* 1997; Band and Ron 1998; Lipkin *et al.* 1998, 2002; Kirov *et al.* 2000; Mosig *et al.* 2001; Schnack *et al.* 2004). We used our estimates of technical variance to assess the cost-effectiveness of DNA pooling relative to individual genotyping. Analyses with cost estimates from our laboratory suggest that most of the increase in relative cost savings with increasing pool size can be realized with smaller pools. Multiple small pools are cost-effective relative to fewer large pools because averaging many replicate pools acts to decrease the technical variance. The cost-effectiveness of small pools coupled with the possibility of estimating both allele frequencies and Hardy–Weinberg disequilibrium coefficients using an integer-valued estimator adds to the appeal of small pools. The performance of these methods across a large number of loci remains to be investigated, but we expect that some loci, such as the tetranucleotide locus *AG25-1#1*, will work well in DNA pools, while others, such the dinucleotide locus *SB6*, with its large range in allele sizes and accompanying differential amplification, will prove more problematic. Costs will be situation specific, and the cost equations that we provided will allow laboratories to conduct individualized cost analyses. High-throughput software that implements the estimation procedures will be required. Because diploid individuals are the special case of pools containing two genomic copies, software for analyzing DNA pools should be similar to existing packages employed in individual genotyping (*e.g.*, Pálsson *et al.* 1999). DNA pooling should be most cost-effective when many individuals are typed at many loci, a situation that will become increasingly common as microsatellite loci are developed in increasing numbers in many different species.

Microsatellite allele frequency estimation using DNA pools can potentially be applied to a variety of quantitative and population genetic situations, including any situation where the allele frequency in a population is the fundamental quantity of interest. Allelic association studies, analyses of spatial population structure and effective population size, and the estimation of genetic diversity can potentially be addressed with DNA pooling. Aspects of marker development, such as the screening of many loci for polymorphisms and for sex linkage (Lee *et al.* 2004), should be amenable to DNA pooling. The possibility of pooling certain tissues directly, such as pooling eggs, blood samples, or groups of larvae, also merits consideration. Finally, more natural DNA pools such as polyploid, single-celled organisms and forensic DNA mixtures arise in the course of many investigations, and the procedures of DNA pooling might be applied in these situations.

## APPENDIX A: INDIVIDUAL GENOTYPES AND POOL COMPOSITIONS

## APPENDIX B: ALLELE-FREQUENCY ESTIMATION WITH DNA POOLS

We consider a simple linear model of the process by which template DNA is amplified into large numbers of locus-specific fragments and subsequently fluorescently detected following electrophoresis. For a given locus, let **x** denote the vector of allele counts, where the alleles are ordered in **x** by increasing fragment length. Hence, the length of **x** is the number of alleles, the *i*th element of **x** corresponds to the *i*th longest allele, each element of **x** is the number of copies of a particular allele, and the sum of the elements of **x** is the total number of gene copies. For pools of diploid individuals, the sum of the elements of **x**, |**x**|, is 2*m*, where *m* is the number of individuals in the pool. Let **y** denote the observed vector of fluorescence intensities (calculated as fluorescent peak heights or peak areas) associated with each of the alleles in **x**. A simple stochastic model of the dependence of **y** on **x** is the linear modelwhere **Λ** is a nonnegative square matrix of parameters describing the amplification and fluorescent detection of the alleles in **x** that results in the observed fluorescence intensities in **y**, and **ϵ** is a vector of stochastic errors introduced in the laboratory having expectation *E*(**ϵ**) = 0. In general, the elements of **Λ** will depend upon the details of the laboratory protocol, including the properties of the microsatellite locus as well as the PCR and electrophoresis conditions. The off-diagonal parameters in **Λ** model the stutter effects, so that λ_{ij} is the stutter peak produced at position *i* by allele *j*. The relative values of the diagonal elements of **Λ** determine differential amplification, so that λ_{ii}/λ_{jj} measures the amplification of allele *i* relative to allele *j*. All of the diagonal elements in **Λ** will be positive, but many of the off-diagonal elements will be zero because each allele's stutter peaks usually affect only the next few smaller alleles (Figure 1).

Thus, if **y** and **Λ** are known or estimated, then **x** can be estimated. We show that **x** can be estimated using only the relative fluorescence intensities. We can define the matrix of relative amplification and fluorescence detection parameters as **R** = **Λ**/λ*, where λ* is any nonzero element of **Λ**, and note that **x** = **q**2*m*, where **q** is the vector of allele frequencies. Accordingly,implying that the ratio of these expectations isHence,Thus, **x** can be calculated viawhich, upon replacing expectations with observed values and using an estimate of **R**, denoted , leads to the moment-based estimator of **x**,(B1)The fluorescence intensities in **y** are absolute values, but Equation B1 shows, explicitly, that the estimation of allele counts depends only on the relative fluorescence intensities in a DNA pool, **y/|y|**. Moreover, the elements of **R** are parameters that are scaled relative to one of the elements of **Λ**; thus, only relative fluorescence intensities are required to estimate **R**. This result is important for two reasons: first, it provides a theoretical basis for the procedure of estimating allele counts from DNA pools, and second, it shows that, in a simple model, relative fluorescence intensities contain all of the information needed for this calculation.

Previous studies that estimate microsatellite allele frequencies from DNA pools either used an estimation procedure that is equivalent to Equation B1 (Barcellos *et al*. 1997; Kirov *et al.* 2000) or employed a simplification thereof by imposing constraints on the structure of **R** (Perlin *et al*. 1995; Band and Ron 1998; Lipkin *et al.* 1998, 2002; Mosig *et al.* 2001; Schnack *et al*. 2004). We estimated the matrix **R** by nonlinear least-squares regression via the modelwhere the *y _{i}*'s are the dependent variables measured from individuals of known genotype (12–24 individual samples per multiplex, except for locus

*SB6*in experiment 2 when 24 independent pools of known allelic content were used) so that

**x**is known, (

**z**)

_{i}denotes the

*i*th element of a vector

**z**, and

*e*is random, uncorrelated error. Typically, estimates of

**R**have nonzero diagonal elements, some nonzero upper-diagonal elements, and all other elements are zero (see Perlin

*et al.*1995; Barcellos

*et al.*1997).

## APPENDIX C: ESTIMATION OF HARDY–WEINBERG DISEQUILIBRIUM COEFFICIENTS FROM DNA POOLS

Let and be the observed frequencies of alleles *a* and *b* in pool *i*. Using the definition of *D _{ab}*, the expected values of the products of these frequencies within and among

*r*pools of size

*m*areandTaking an appropriate linear combination of and yields an unbiased estimator of

*D*,Simulations suggest that the variance of increases with increasing

_{ab}*m*and decreases with increasing

*r*.

## APPENDIX D: COST EQUATIONS FOR DNA POOLING

For individual genotyping of *n* individuals, the statistical variance of the allele frequency estimate iswhere *p* is the population allele frequency. We assume that the equivalent variance of an estimate calculated from *r* pools of size *m* is the sum of the technical variance and the statistical variance, yieldingwhere σ_{T}^{2} is the technical variance. Equating these expressions and solving for *r* yields the number of replicate pools of size *m* that must be constructed to obtain the same level of precision as genotyping *n* individuals:To calculate the percentage of cost savings of DNA pooling relative to individual genotyping, this expression for *r* is used in the cost-savings equation presented in the Introduction, yieldingDifferentiation leads to the pool size that maximizes percentage of cost savings,

## Acknowledgments

We thank Chris Smith for assistance with programming, the North Carolina State University Genome Research Laboratory for assistance with fragment analysis, and two anonymous reviewers for comments on an earlier draft of this manuscript. This research was supported by grants from the National Institutes of Health (NIH-ES 07329) and the National Science Foundation (NSF-DEB 03-43761).

## Footnotes

Communicating editor: M. Nordborg

- Received November 18, 2005.
- Accepted March 17, 2006.

- Copyright © 2006 by the Genetics Society of America