- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Yang, H.-C.
- Articles by Fann, C. S. J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Yang, H.-C.
- Articles by Fann, C. S. J.
Genetics, Vol. 169, 399-410, January 2005, Copyright © 2005
doi:10.1534/genetics.104.032052
New Adjustment Factors and Sample Size Calculation in a DNA-Pooling Experiment With Preferential Amplification
Hsin-Chou Yang*,
Chia-Ching Pan
,
Richard C. Y. Lu
and
Cathy S. J. Fann*,
,1
* Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan 115
Institute of Public Health, Yang-Ming University, Taipei, Taiwan 112
National Genotyping Center, Academia Sinica, Taipei, Taiwan 115
1 Corresponding author: Institute of Biomedical Sciences, Academia Sinica, 128, Academia Rd., Section 2, Nankang, Taipei, Taiwan 115.
E-mail: csjfann{at}ibms.sinica.edu.tw
>ABSTRACT
RESEARCH METHODS
SIMULATION STUDY
ANALYSIS OF A LABORATORY...
DISCUSSION
APPENDIX A:
APPENDIX B:
APPENDIX C:
ACKNOWLEDGEMENTS
LITERATURE CITED
In the post-genome era, disease gene mapping using dense genetic markers has become an important tool for dissecting complex inheritable diseases. Locating disease susceptibility genes using DNA-pooling experiments is a potentially economical alternative to those involving individual genotyping. The foundation of a successful DNA-pooling association test is a precise and accurate estimation of allele frequency. In this article, we propose two new adjustment methods that correct for preferential amplification of nucleotides when estimating the allele frequency of single-nucleotide polymorphisms. We also discuss the effect of sample size when calibrating unequal allelic amplification. We conducted simulation studies to assess the performance of different adjustment procedures and found that our proposed adjustments are more reliable with respect to the estimation bias and root mean square error compared with the current approach. The improved performance not only improves the accuracy and precision of allele frequency estimations but also leads to more powerful disease gene mapping.
LOCATING disease susceptibility genes is an important topic in the postgenome era. To detect small genetic effect for many susceptible genes, a plausible etiological model for complex traits, many large case-control studies have been launched. Advances in biological techniques have made available thousands of single-nucleotide polymorphisms (SNPs) for disease gene mapping. The availability of these dense markers vastly improves the power of association tests and increases the resolution of gene mapping in candidate region research and genome scanning studies.
Conventional case-control association studies are popular for disease gene mapping using individual genotyping data. However, analyses of large samples are often impractical due to the expense of individual genotyping. In this regard, DNA-pooling experiments may represent an economical alternative. As the name implies, DNA pooling involves the mixing of genomic DNA from many different individuals. Allele frequencies for each SNP marker in the pooled DNA are measured using the same principles that apply to genotyping.
A complete DNA-pooling experiment consists primarily of three stages. The first stage is a pilot study in which heterozygous individuals are collected to estimate the coefficient of preferential amplification (CPA). The coefficient is subsequently used to correct the estimates of allele frequencies in the second stage. In the second stage, DNA-pooling experiments are conducted for a large number of SNPs, and pooling association tests are carried out to screen for potential genetic markers. Only a small proportion of markers selected from the second-stage experiments are included in the third stage in which all individuals are genotyped to confirm the validity of the markers selected from the second stage. As a consequence of the preliminary screen in the second stage, the number of SNPs in the third stage is drastically reduced, thereby lowering genotyping costs.
DNA pooling is an efficient screening method for locating disease susceptibility genes (BANSAL et al. 2002; SHAM et al. 2002). However, this cost-saving alternative is efficient only if the estimation of allele frequency is accurate and precise. Biased or unreliable estimation of allele frequencies can lead to spurious results in association studies. Variation in the data from a DNA-pooling study may arise from several different sources, such as pool formation, polymerase chain reaction (PCR) amplification, allele frequency measurement, and other uncontrollable experimental errors (BARRATT et al. 2002; VISSCHER and LE HELLARD 2003). Importantly, preferential amplification is a natural chemical attribute of PCR; it arises from both heterogeneous nucleotide incorporation during primer extension and differential efficiency of nucleotide detection during DNA quantification (SHAM et al. 2002). These factors perturb the measurement of the intensity of different nucleotides and, consequently, the estimation of allele frequency. Under such a research background, we have focused on the impact of preferential amplification and propose new adjustment methods to rectify the problems inherent in the process. We also discuss the issue of sample size when correcting for unequal allelic amplification.
ABSTRACT
>RESEARCH METHODS
SIMULATION STUDY
ANALYSIS OF A LABORATORY...
DISCUSSION
APPENDIX A:
APPENDIX B:
APPENDIX C:
ACKNOWLEDGEMENTS
LITERATURE CITED
Data and notation:
First, we discuss the design of the pilot stage. Consider that the two SNP-containing alleles are denoted A and a, where allele A is of interest. Given ntotal samples randomly drawn from a target population, individual genotyping results show that there are nheter heterozygous individuals and nhomo homozygous individuals in the sample, i.e., ntotal = nheter + nhomo. The pair of peak intensities for each heterozygous individual is determined (e.g., from MALDI-TOF spectrometry) as the area under the nucleotide-mapping curve. The readings for alleles A and a are denoted
. These bivariate vectors are used to quantify the magnitude of preferential amplification.
In the second stage of the screening experiment, genomic DNA from m different individuals is pooled. Applying the same genotyping principle used in the pilot study, we obtain a reading of the peak intensities for allele typing in a DNA-pooling experiment. The reading is the summary measure of this pool composed of genomic DNA from m individuals and is defined as
. These data are used to estimate the allele frequency.
Let the population allele frequency of allele A be pA, the main parameter of interest. We define CPA
as a measure of the peak intensities of allele A relative to allele a,
= µA/µa, where µA and µa denote the average peak intensities of alleles A and a in the population. In other words,
is the relative magnitude of the averaged amplified intensities of two different nucleotides and is an unknown calibration parameter that serves as an adjustment factor for allele frequency estimation. For
> 1, the first nucleotide tends to be amplified more than the second; for
< 1, the second nucleotide is likely to be amplified less than the first; for
= 1, equal amplification is likely. The following sections introduce the statistical model/estimation of CPA, the estimation of population allele frequency, and association tests.
Statistical model and estimation of CPA:
The population allele frequency pA is defined as pA = NA/(NA + Na), where NA and Na denote the number of alleles A and a in the population. In individual genotyping experiments, the population allele frequency can be estimated by directly counting the number of alleles from representative samples. The direct counting approach does not apply to DNA-pooling experiments because only the peak intensities are measured.
The relationship between peak intensity and allele frequency is the kernel of allele frequency estimation in a DNA-pooling study. The bridge is CPA, which connects both of them as indicated in the following expression,
![]() | (1) |
must be estimated and considered in the estimation of allele frequency. Below, we discuss the procedure for estimating preferential amplification using individual genotyping data from heterozygous individuals.
Suppose that there are nheter independent heterozygous individuals in the individual genotyping pilot study. Let the intensities of the two peaks for the jth heterozygous individual be hIA
and hIa
, j = 1, ... , nheter. The two-dimensional peak intensities
are assumed to follow a bivariate distribution G
, where (µA, µa) are the population means of the peak intensities for alleles A and a,
are the variances of the peak intensities for alleles A and a, and
denotes the correlation of the two intensities.
Under this model, we propose two measures to estimate CPA and compare them with the previous adjustment method for unequal amplification proposed by HOOGENDOORN et al. (2000). Previously, their adjustment factor was defined as the arithmetic mean of ratios, i.e.,
![]() | (2) |
Our first proposed adjustment reduces the bias in Hoogendoorn's method using a bias-reduction technique and can be represented as
![]() | (3) |
and
. The difference between
H and
U in Equation 3 is the estimated bias of Hoogendoorn's method. A detailed derivation is presented in APPENDIX A. The ratio of a pair of peak intensities often exhibits a skew distribution and log transformation is often considered to reduce the skewness and variability. Therefore, our second proposed adjustment factor is the geometric mean of ratios:
![]() | (4) |
The standard error of each adjustment measure reflects sampling variability and is critical for the association test in the next stage. Because the number of heterozygous individuals might be small, and an exact statistical distribution of these adjustment measures is difficult to derive, a bootstrapping procedure (EFRON and TIBSHIRANI 1993) is recommended to estimate the standard errors. Original data are used to estimate the hyperparameter by a moment-based or likelihood-based approach to obtain the empirical distribution G
. Pseudo-samples are generated using resampling from the empirical distribution with replacement. Suppose the number of bootstrap replications is B. Each adjustment method in Equation 3 or 4 is applied to the samples to obtain the corresponding estimates (
1, ... ,
B). Hence, the standard error of the adjustment measure can be calculated by taking the sample standard deviation of the bootstrap estimates,
![]() | (5) |
.
Estimation of allele frequency and test of allelic association:
In this section, we discuss the estimation of allele frequency when preferential amplification is involved. The genomic DNA from all cases is mixed together in a pool and that of controls is mixed in the other pool. The pairs of peak intensities in control and case groups are denoted by
and HP,case =
, respectively.
If there is no preferential amplification, then the coefficient
will be approximately one, and hence no adjustment is necessary. The allele frequencies of allele A in control and case groups can be estimated directly by calculating the proportion of peak intensities as follows:
![]() | (6) |
is larger than one, then allele A tends to be amplified more than allele a, and vice versa. In these two cases, the scales of the two intensities differ, and the population-level relative proportion of the two amplified abilities is simply
. To adjust for nonequivalent allelic amplification, the method proposed by HOOGENDOORN et al. (2000) increased the suppressed intensity by multiplying the CPA. This transformation procedure standardizes the two intensities in scale. At the population level, substituting
a =
Ha for Ha makes the equality on the left side of Equation 1 hold even for
1; at the sample level,
and
are used to adjust for unequal amplification. The adjusted terms that improve the estimated allele frequency in Equation 6 for control and case groups are as follows:
![]() | (7) |
. The performances should be evaluated. In the SIMULATION STUDY section below, we discuss how simulation studies assess the performance of these adjustment factors.
Screening potential SNP markers associated with a disease locus is the main purpose of a DNA-pooling study (BANSAL et al. 2002; SHAM et al. 2002). This can be achieved using the pooling-based association test
![]() | (8) |
(VISSCHER and LE HELLARD 2003), where
![]() | (9) |
ncase and ncontrol are the numbers of individuals in case and control groups, and
2E denotes the experimental variation. The sampling distribution of the test statistic follows a chi-square distribution with 1 d.f. asymptotically. The first two terms after the equality in Equation 9 are the variance components due to sampling variation; the third term results from the adjustment variation of preferential amplification; the fourth term is the experimental variation from several different sources, such as pool formation. All of the parameters in Equation 9 are unknown and therefore must be estimated.
Parameter
is estimated by our proposed method in Equation 3 or Equation 4; variance V(
) is estimated by the proposed bootstrap variance in Equation 5; the allele frequencies are estimated using Equation 7. Finally, the experimental variance can be estimated by calculating the mean square errors (BARRATT et al. 2002) or using the restricted maximum-likelihood method (DOWNES et al. 2004) based on a hierarchical experimental design.
Estimation of CPA affects both the denominator and the numerator of the test statistic in Equation 8 simultaneously. The impact of CPA on the denominator is explicit in Equation 9. CPA affects the numerator by way of allele frequency estimates. On the basis of the adjusted allele frequency defined in Equation 7, the expectation of difference between the estimated allele frequencies in case and control groups is zero under null hypothesis (no association). If the adjusted allele frequency in Equation 7 is replaced by the unadjusted allele frequency in Equation 6 for constructing the test statistic, the zero expectation may not hold true under the null hypothesis. Upon algebraic manipulation, we have proved that the discrepancy between case and control groups with regard to CPA-adjusted and unadjusted allele frequencies (i.e., the difference between the CPA-adjusted case and control group allele frequencies minus the difference between the unadjusted case and control group allele frequencies) is
![]() |
Sample size in the pilot study:
In the pilot study, the peak intensities of heterozygous individuals are needed to estimate the CPA. An immediate question is how many heterozygous individuals are required to obtain a precise estimate of CPA. Proceeding in the context of confidence intervals, we calculate the sample size under risk
and a specified absolute error
as
![]() | (10) |
and
, nheter. Equation 10 is derived on the basis of Hoogendoorn's method, and the details are shown in APPENDIX B. From our simulation study (discussed below), we find that our proposed method gives a lower standard error for the variance compared with Hoogendoorn's method. Hence, the sample size in Equation 10 is regarded as the upper bound for our proposed estimators. Under
= 0.05 and
= 0.05, 0.075, and 0.10, the relationships between the sample size and different parameters are shown in Figure 1. The results show that sample size correlates positively with CVr and is inversely proportional to
. The symmetry and highest points of the sample size curve occur concurrently when the allele frequency is 0.5.
|
Sample sizes for additional genotyping must be evaluated to achieve the required number of heterozygous individuals derived from Equation 10. This aspect depends on the design of the genotyping experiment, the genetic background of SNP markers, and population characteristics. In a sequential genotyping experiment, ntotal is a random variable that follows a negative binomial distribution with successful probability (probability of heterozygote) pH = pAa. However, in large-scale genotyping experiments, individuals are genotyped simultaneously, not sequentially. Under this circumstance, ntotal is prespecified and nheter is a random variable from a binomial distribution with successful probability pH = pAa.
The theory based on the assumption of individual homogeneity is sometimes too stringent. Heterogeneity among individuals may be due to various individual covariates or unobserved attribution. If this potential factor is ignored, the genotyping efforts will be underestimated. A possible remedy is to assume a random-effect model. A beta distribution ß(
,
) with the corresponding mean
/(
+
) and coefficient of variation (CV) {
/[
(
+
+ 1)]}1/2 can be used to model random allele frequency (RAF) and random genotype frequency (RGF), respectively. The corresponding marginal distributions of random variables ntotal and nheter with respect to the sequential genotyping experiment and large-scale genotyping experiment can be obtained by integrating out the hyperparameters in the beta distribution as
![]() |
![]() |
|
|
Figure 2 shows the distribution of ntotal with a genotype frequency range of 0.050.5 in increments of 0.05. The pattern reveals the positive correlation between P(N
ntotal) and ntotal. A heterozygous genotype frequency of a SNP of >0.15 almost guarantees that eight heterozygous individuals can be observed after genotyping 96 individuals (Figure 2A); the probability of observing 16 heterozygous individuals is >0.8 (Figure 2B). Figure 2, C and D, presents the results under the condition of individual random effects. A lower E(pAa) corresponds to a less polymorphic case, and therefore genotyping requires many more individuals. The coefficient of variation CV(pAa) affects the curvature of different lines. In general, a smaller CV(pAa) yields a steeper slope; in other words, the marginal increase in cumulative probability (corresponding to an increase in ntotal) is larger when CV(pAa) is smaller. The required number of genotyped individuals can be obtained using a prespecified probability from these figures.
Figure 3 shows the probability distribution of nheter. Figure 3, A and B, presents the cases for ntotal = 48 and ntotal = 96 in the absence of individual heterogeneity, and Figure 3, C and D, presents the case when individual heterogeneity is present. The genotype frequencies range is 0.050.5 in increments of 0.05. In general, the correlation between nheter and P(N
nheter) is negative. If 96 individuals are genotyped, Figure 3, B and D, shows that, even with individual heterogeneity, eight heterozygous individuals can be observed in most cases except for the nonpolymorphic one. A similar pattern is evident in Figure 3, A and C, for the case ntotal = 48, but the probability P(N
nheter) decreases. The index of heterogeneity CV(pAa) affects the pattern of the probability of identifying heterozygous individuals. Figure 3, C and D, shows the curves with a smaller CV(pAa) have a sharper reduction whereas those with a large CV(pAa) have a gentler slope.
In this section, we focused exclusively on sample size in a pilot study of a DNA-pooling study. Discussions concerning sample size in the second stage can be found in BARRATT et al. (2002).
ABSTRACT
RESEARCH METHODS
>SIMULATION STUDY
ANALYSIS OF A LABORATORY...
DISCUSSION
APPENDIX A:
APPENDIX B:
APPENDIX C:
ACKNOWLEDGEMENTS
LITERATURE CITED
Procedures:
We carried out simulation studies to assess both the performance of different adjustment factors for estimating CPA in the first stage and the consequential impact on the pooling-based association test in the second stage.For the first stage, the simulation procedures can be outlined as follows:
- Specify the simulation conditions: Consider that the number of heterozygous individuals nheter ranges from 8 to 40 with an increment of 16, and the true CPA
is set to 0.5, 1, and 2.
- Generate peak intensity data: Because a gamma distribution can cover many different random patterns, we considered a bivariate gamma distribution of the peak intensities in the simulation study. The parameters for the bivariate gamma distribution were set to yield CVs of 0.1 or 0.3 for the peak intensities, and the correlation of the pair of peak intensities was 0.5.
- Estimate the adjustment factor and hyperparameters: On the basis of the data from step 2, we calculated the adjustment factor
(s) and estimated the hyperparameters of the gamma distribution using the moment method, where the superscript was the simulation index.
- Calculate bootstrap standard error: Bootstrapping data from the empirical gamma distribution
(
A,
a,
2A,
2a,
) for B times were used to estimate the CPA
. The bootstrap standard error of the adjustment factor was obtained by calculating sample standard deviation over the B estimates,
, where
.
- Calculate the estimation bias, standard error, and root mean square error: Steps 24 were repeated for S simulation replications. The estimation bias and standard error were calculated using
and
, respectively. The root mean square error (RMSE) was RMSE = (BIAS2 + SE2)1/2.
In the second stage, we simulated case-control association tests using the following procedures:
- Specify the simulation conditions: The sample sizes were ncase = ncontrol = 500. The population allele frequencies in case and control groups were (pcase = 0.25, pcontrol = 0.25) and (pcase = 0.25, pcontrol = 0.15) for the calculation of type I error and power, respectively. The standard deviation of experimental error was set to
E = 0.02 or
E = 0.05.
- Generate allele frequency: The sample frequencies in the case group were generated from a normal distribution N(pcase, Var(
case|
)), and a similar approach was applied to the control group.
- Summarize the test results: On the basis of the simulated data, we calculated the type I error when the case and control groups had the same true allele frequency; we calculated the power when the groups had different true allele frequencies.
A total of 12 simulation conditions were considered and were arranged in the following order: (
E, CV, nheter) = {(0.02, 0.1, 8), (0.02, 0.1, 24), (0.02, 0.1, 40), (0.02, 0.3, 8), (0.02, 0.3, 24), (0.02, 0.3, 40), (0.05, 0.1, 8), (0.05, 0.1, 24), (0.05, 0.1, 40), (0.05, 0.3, 8), (0.05, 0.3, 24), (0.05, 0.3, 40)}. All simulations were carried out using 200 simulation replications and 500 bootstrap replications.
Results:
In the simulation studies, we explored the impact of several elements on the adjustment factor and association test. These elements included the structure of peak intensity, degree of preferential amplification, experimental error, and sample size. The simulation results of
= 0.5,
= 1, and
= 2 are summarized in Tables 1, 2, and 3, respectively.
|
|
|
With regard to the performance of adjustment factors, we found several meaningful patterns. The estimated CPA is affected by the CV of the peak intensities. As CV increases, i.e., in the case of large variability of peak intensities, there is a concomitant rise in the estimation bias, standard error, and RMSE of
. Relative to CV, the mean or variance of the peak intensities alone is insufficient to explain the changes in performance of adjustment factors. As more heterozygous individuals were collected in the pilot study, the standard error and RMSE of
were reduced; however, the effect on the bias of
was not obvious.
Regarding the performance of the association test, we found that the increase in experimental variation dramatically reduced the statistical power of the association test. Generally speaking, from
E = 0.02 to
E = 0.05, the power decreases from 0.85 to 0.30. Moreover, the type I error correlates positively with the CV of the peak intensities. In most cases, the type I error range was 0.030.07. Although larger numbers of heterozygous individuals are useful to reduce the RMSE of the adjustment factor, the efficacy of the association test depends on the degree of experimental error. When
E = 0.02, the reduction in the variation of
improves the power; when
E = 0.05, the improvement in the adjustment is neutralized by an increase in the experimental error. The same idea applies to the CV of peak intensity.
In general, the proposed adjustment measures yielded better performance than Hoogendoorn's method with respect to the estimation of
and the association test. In all simulation trials, we found that the two proposed adjustment factors yielded a smaller bias, standard error, and RMSE compared with Hoogendoorn's method. Given a prespecified test size,
U yielded the highest power among the three adjustment methods with regard to relatively small experimental errors (
E = 0.02);
G yielded the highest power in cases of larger experimental errors (
E = 0.05).
ABSTRACT
RESEARCH METHODS
SIMULATION STUDY
>ANALYSIS OF A LABORATORY...
DISCUSSION
APPENDIX A:
APPENDIX B:
APPENDIX C:
ACKNOWLEDGEMENTS
LITERATURE CITED
|
In the DNA-pooling experiment, each individual's genomic DNA was diluted to 12.5 ng/µl and quantified using the PicoGreen assay (Molecular Probes, Eugene, OR). Equimolar amounts of genomic DNA from the 30 individuals were then pooled. PCR amplification and primer extension reactions were performed using an ABI 9700 system (AME Bioscience, Towaco, NJ). The peak intensities of alleles were measured using a MALDI-TOF spectrometer (Sequenom) based on wavelet technology. Hence, the unadjusted allele frequencies for the six SNPs could be estimated on the basis of Equation 6, and the results are shown in Table 4 (column 3).
Hoogendoorn's adjustment and our two proposed methods were applied to this data set. For the six SNPs, the estimates of CPA were as follows: based on Hoogendoorn's method, 1.259, 0.765, 0.662, 0.873, 1.771, and 2.288; based on the geometric-mean method, 1.252, 0.687, 0.634, 0.859, 1.749, and 2.265; and based on the bias-reduction method, 1.221, 0.674, 0.632, 0.851, 1.788, and 2.320. The adjusted allele-frequency estimates, based on Equation 7, are shown in Table 4 in columns 4 (Hoogendoorn's method), 5 (geometric-mean method), and 6 (bias-reduction method).
To summarize the findings in this analysis of laboratory data, we found that it is essential to adjust for preferential amplification. This adjustment reduced the estimation bias of the allele frequencies except for the second SNP in this data set. In this case, the serious underestimate of allele frequency might have arisen from uncontrollable experimental variations, such as overestimation of the extended primer or an effect of DNA quality on SNP variance (WERNER et al. 2002). For this specific SNP, the adjustment procedure yielded only limited improvement.
In most cases in our study, Hoogendoorn's method reduced the discrepancy between the allele frequencies estimated from the individual genotyping and pooling experiments. Our proposed adjustments reduced the error further than did Hoogendoorn's method. Moreover, our proposed methods yielded a smaller variation in the CPA compared with Hoogendoorn's method, and in turn our methods gave a smaller variation in allele frequency estimation. Overall, our results demonstrate that the proposed adjustments provide a more accurate and reliable estimation of allele frequency for this data set.
ABSTRACT
RESEARCH METHODS
SIMULATION STUDY
ANALYSIS OF A LABORATORY...
>DISCUSSION
APPENDIX A:
APPENDIX B:
APPENDIX C:
ACKNOWLEDGEMENTS
LITERATURE CITED
In our method, type I error is usually controlled well except when the CV of the peak intensity is high and
> 1. Also, the performance of
is apparently not symmetric when
= 1. In general, a smaller CPA yields a correspondingly smaller RMSE. Regarding the instances of
= 2 and
= 0.5, the latter gives better performance with respect to RMSE, power, and type I error. Hence, we suggest that the preferentially amplified allele be placed in the denominator when calculating the adjustment factor (see Equations 3 and 4).
In addition to our two new proposed adjustment factors, we also investigated several other methods, including the median-based measure, harmonic-mean-based measure, and some modified ratio estimators (BEALE 1962; TIN 1965). Although some of these methods yielded a better estimate of CPA than Hoogendoorn's method did during the simulation study (YANG et al. 2003), they are not superior to our proposed adjustment factors in this article.
We investigated the role of sample size during the pilot stage of the DNA-pooling study. The use of a large number of heterozygous individuals reduces the RMSE of the adjustment factor. However, 24 heterozygous individuals should be sufficient to obtain a good adjustment factor. To explore the sample size requirement, we considered both fixed-effect and random-effect models for different scenarios. In general, random-effect models yield a larger sample size than fixed-effect models when the same set of parameters is used. For example, given that the genotype frequency in the population is 0.45 and 8 heterozygous individuals are required, 21 individuals need to be genotyped under the fixed-effect model [i.e., CV(pAa) = 0]; 24 and 34 individuals need to be genotyped under CV(pAa) = 0.25 and 0.5, respectively, in the random-effect model. This is so because the random-effect models take into consideration variations among individuals, resulting in an increase in total variation that requires a larger sample size to compensate. Ignoring the heterogeneity may lead to a serious underestimation of sample sizes.
DNA-pooling studies can achieve the objective of screening for important genetic variants at a reasonable cost. However, the unavoidable drawback is that genotypic information and individual features are lost once genomic DNA is mixed. Although some advanced studies have attempted to reconstruct the lost information (ITO et al. 2003), a bottleneck still exists due to the mass of genotype and haplotype combinations in the pool. Stringent limitations must be satisfied for small pools to reduce the number of combinatorial calculations. To date, DNA-pooling experiments have been used primarily as a screening technique rather than as a legitimate replacement for individual genotyping studies.
Use of DNA pooling is a potentially cost-effective alternative to individual genotyping. Association testing in DNA pools is an efficient method for screening important genetic markers and has been applied successfully to many practical applications and found significant conclusions (CARMI et al. 1995; BANSAL et al. 2002). Our new approaches provide valid adjustments and further improve the conventional method, thereby the reliable allele frequency estimation and powerful association tests. These advantages enhance greatly the applicability of the DNA-pooling experiment.
ABSTRACT
RESEARCH METHODS
SIMULATION STUDY
ANALYSIS OF A LABORATORY...
DISCUSSION
>APPENDIX A:
APPENDIX B:
APPENDIX C:
ACKNOWLEDGEMENTS
LITERATURE CITED
PROPERTIES OF THE PROPOSED ADJUSTMENT FACTOR
Suppose that the population size of heterozygous individuals is Nheter and nheter samples are randomly drawn from the population. Let
. Hence, the bias of Hoogendoorn's measure can be calculated using the following:
![]() |
![]() |
ABSTRACT
RESEARCH METHODS
SIMULATION STUDY
ANALYSIS OF A LABORATORY...
DISCUSSION
APPENDIX A:
>APPENDIX B:
APPENDIX C:
ACKNOWLEDGEMENTS
LITERATURE CITED
SAMPLE SIZE CALCULATION IN A PILOT STUDY
Suppose that the estimated allele frequency
is a differentiable function of
and hPA/hPa at a point (
, HA/Ha). By a first-order Taylor expansion of two variables
and hPA/hPa at (
, HA/Ha), we know that
![]() |
and satisfies R1/||
||
0 as ||
||
0. The approximate mean and variance can be calculated as
![]() |
![]() |
and
. The pure variation due to the adjustment of preferential amplification can be obtained by setting
, resulting in the following variance:
![]() | (B1) |
=
, then we obtain the pure variance due to the measurement of the ratio of the peak intensities,
![]() |
To calculate the required samples size, the variance in Equation B1 must be rewritten as a function of nheter. If the adjustment of HOOGENDOORN et al. (2000) is applied, then the variance in Equation B1 can be represented as
![]() |
and
, j = 1, ... , nheter. Under risk
and the specified absolute error
, the required number of heterozygous individuals is
![]() |
) p|<
}
1
. ABSTRACT
RESEARCH METHODS
SIMULATION STUDY
ANALYSIS OF A LABORATORY...
DISCUSSION
APPENDIX A:
APPENDIX B:
>APPENDIX C:
ACKNOWLEDGEMENTS
LITERATURE CITED
THE MARGINAL DISTRIBUTIONS OF SAMPLE SIZES UNDER DIFFERENT GENOTYPING EXPERIMENTS
First, under sequential genotyping experiments, ntotal is a random variable that follows a negative binomial distribution with successful probability pH (probability of heterozygote). Under the RAF model, we assume allele frequency pA follows from a beta distribution beta(
,
). Hence, the marginal distribution of ntotal can be derived as follows:
![]() |
,
). The marginal distribution of ntotal is
![]() |
![]() |
![]() |
ABSTRACT
RESEARCH METHODS
SIMULATION STUDY
ANALYSIS OF A LABORATORY...
DISCUSSION
APPENDIX A:
APPENDIX B:
APPENDIX C:
>ACKNOWLEDGEMENTS
LITERATURE CITED
ABSTRACT
RESEARCH METHODS
SIMULATION STUDY
ANALYSIS OF A LABORATORY...
DISCUSSION
APPENDIX A:
APPENDIX B:
APPENDIX C:
ACKNOWLEDGEMENTS
>LITERATURE CITED
BANSAL, A., D. VAN DEN BOOM, S. KAMMERER, C. HONISCH, G. ADAM et al., 2002 Association testing by DNA pooling: an effective initial screen. Proc. Natl. Acad. Sci. USA 99: 1687116874.
BARRATT, B. J., F. PAYNE, H. E. RANCE, S. NUTLAND, J. A. TODD et al., 2002 Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann. Hum. Genet. 66: 393405.[CrossRef][Medline]
BEALE, E. M. L., 1962 Some uses of computers in operational research. Indust. Organ. 31: 5152.
CARMI, R., T. ROKHLINA, A. E. KWITEK-BLACK, K. ELBEDOUR, D. NISHIMURA et al., 1995 Use of a DNA pooling strategy to identify a human obesity syndrome locus on chromosome 15. Mol. Genet. 4: 913.
DOWNES, K., B. J. BARRATT, P. AKAN, S. J. BUMPSTEAD, S. D. TAYLOR et al., 2004 SNP allele frequency estimation in DNA pools and variance components analysis. Biotechniques 36: 840845.[Medline]
EFRON, B., and R. J. TIBSHIRANI, 1993 An Introduction to the Bootstrap. Chapman & Hall, New York.
HARTLEY, H. O., and A. ROSS, 1954 Unbiased ratio estimates. Nature 174: 270271.[CrossRef]
HOOGENDOORN, B., N. NORTON, G. JIROV, N. WILLIAMS, M. L. HAMSHERE et al., 2000 Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools. Hum. Genet. 107: 488493.[CrossRef][Medline]
ITO, T., S. CHIKU, E. INOUE, M. TOMITA, T. MORISAKI et al., 2003 Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am. J. Hum. Genet. 72: 384398.[CrossRef][Medline]
LE HELLARD, S., S. J. BALLEREAU, P. M. VISSCHER, H. S. TORRANCE, J. PINSON et al., 2002 SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi-automated method for data storage and analysis. Nucleic Acids Res. 30: e74.
MOHLKE, K. L., M. R. ERDOS, L. J. SCOTT, T. E. FINGERLIN, A. U. JACKSON et al., 2002 High-throughput screening for evidence of association by using mass spectrometry genotyping on DNA pools. Proc. Natl. Acad. Sci. USA 99: 1692816933.
SHAM, P., J. S. BADER, I. CRAIG, M. O'DONOVAN and M. OWEN, 2002 DNA pooling: a tool for large-scale association studies. Nat. Rev. Genet. 3: 862871.[CrossRef][Medline]
TIN, M., 1965 Comparison of some ratio estimators. J. Am. Stat. Assoc. 60: 294307.[CrossRef]
VISSCHER, P. M., and S. LE HELLARD, 2003 Simple method to analyze SNP-based association studies using DNA pools. Genet. Epidemiol. 24: 291296.[CrossRef][Medline]
WERNER, M., M. SYCH, N. HERBON, T. ILLIG, I. R. KONIG et al., 2002 Large-scale determination of SNP allele frequencies in DNA pools using MALDI-TOF mass spectrometry. Hum. Mutat. 20: 5764.[CrossRef][Medline]
YANG, H.-C., C.-L. CHEN and C. S. J. FANN, 2003 Estimation of allele frequencies with preferential amplification in a DNA-pooling study. Am. J. Hum. Genet. 73: 2625.
This article has been cited by other articles:
![]() |
N. Homer, W. D. Tembe, S. Szelinger, M. Redman, D. A. Stephan, J. V. Pearson, S. F. Nelson, and D. Craig Multimarker analysis and imputation of multiple platform pooling-based genome-wide association studies Bioinformatics, September 1, 2008; 24(17): 1896 - 1902. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-C. Yang, H.-Y. Hsieh, and C. S. J. Fann Kernel-Based Association Test Genetics, June 1, 2008; 179(2): 1057 - 1068. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-C. Yang, Y.-J. Liang, M.-C. Huang, L.-H. Li, C.-H. Lin, J.-Y. Wu, Y.-T. Chen, and C.S.J. Fann A genome-wide study of preferential amplification/hybridization in microarray-based pooled DNA experiments Nucleic Acids Res., September 10, 2006; 34(15): e106 - e106. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gasbarra and M. J. Sillanpaa Constructing the Parental Linkage Phase and the Genetic Map Over Distances <1 cM Using Pooled Haploid DNA Genetics, February 1, 2006; 172(2): 1325 - 1335. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Yang, H.-C.
- Articles by Fann, C. S. J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Yang, H.-C.
- Articles by Fann, C. S. J.































