Recent research has developed various promising methods to control for population structure in genomewide association mapping of complex traits, but systematic examination of how well these methods perform under different genetic scenarios is still lacking. Appropriate methods for controlling genetic relationships among individuals need to balance the concern of false positives and statistical power, which can vary for different association sample types. We used a series of simulated samples and empirical data sets from cross- and self-pollinated species to demonstrate the performance of several contemporary methods in correcting for different types of genetic relationships encountered in association analysis. We proposed a two-stage dimension determination approach for both principal component analysis and nonmetric multidimensional scaling (nMDS) to capture the major structure pattern in association mapping samples. Our results showed that by exploiting both genotypic and phenotypic information, this two-stage dimension determination approach balances the trade-off between data fit and model complexity, resulting in an effective reduction in false positive rate with minimum loss in statistical power. Further, the nMDS technique of correcting for genetic relationship proved to be a powerful complement to other existing methods. Our findings highlight the significance of appropriate application of different statistical methods for dealing with complex genetic relationships in various genomewide association studies.
ASSOCIATION mapping of genetic factors underlying major human diseases has yielded very promising results, demonstrating the power of both genomewide and candidate-gene association mapping approaches (Newport et al. 2007; Samani et al. 2007; Scott et al. 2007; Wellcome Trust Case Control Consortium 2007; Yeager et al. 2007). Significant reductions in the costs of sequencing and genotyping make it feasible to conduct association mapping studies in many other species (Thornsberry et al. 2001; Aranzana et al. 2005; Borevitz et al. 2007; Clark et al. 2007). However, one obstacle inherent to drawing causal inferences from earlier association studies is the confounding effect of population structure (Lander and Kruglyak 1995; Marchini et al. 2004; Hirschhorn and Daly 2005; Wang et al. 2005; Shriner et al. 2007). Because this effect generally increases in proportion to population size (Reich and Lander 2001), population structure remains a major concern in large-scale association analyses (McCarthy et al. 2008).
Several methods have been proposed for using association mapping in populations with complex genetic structure. Besides the earlier methods of genomic control (Devlin and Roeder 1999; Devlin et al. 2004) and structured association based on Bayesian clustering (Pritchard et al. 2000; Falush et al. 2003, 2007), principal component analysis (PCA) was recently suggested as a fast and effective way to diagnose population structure (Price et al. 2006; Patterson et al. 2007). The PCA method summarizes variation observed across all markers into a number of underlying component variables. Outputting each individual's coordinates along axes of variation describes the population membership, or ancestry, of each individual (Novembre and Stephens 2008; Reich et al. 2008). The PCA method makes it computationally feasible to handle a large number of markers (tens of thousands) and correct for subtle population stratification (Hunter et al. 2007; Wellcome Trust Case Control Consortium 2007; Yeager et al. 2007). Statistically, however, because the information compression capability of PCA is generally weak a small number of dimensions may not be able to capture most of the overall variation even when the data are in fact on a very low dimensional manifold (Taguchi and Oono 2005). Thus, PCA may fail to properly correct for population structure in certain situations (Kimmel et al. 2007). Other multivariate statistical analyses such as correspondence analysis and supervised machine learning have also been used to account for population structure (Kraakman et al. 2004; Epstein et al. 2007), and their performance requires further investigation.
Multidimensional scaling (MDS) is a major branch of multivariate analysis that has been widely used to visualize hidden relations among objects in data (Borg and Groenen 2005) and has been applied to genomic data to unravel relational patterns among genes from time series DNA microarray data (Taguchi and Oono 2005; Tzeng et al. 2008). The essence of MDS is to find the configuration of points compatible with the given dissimilarity relations among them, and the resulting configuration should visually exhibit the structures or relations hidden in the original data. The MDS techniques fall into metric MDS and nonmetric MDS (nMDS). Metric MDS transforms the dissimilarity data into a set of Euclidean distances so that the final configuration (i.e., an arrangement of points in a smaller number of dimensions) matches the original dissimilarities as closely as possible. nMDS focuses on the rank order of the dissimilarity matrix and requires the final configuration to contain distances that follow the rank of the original dissimilarity matrix as closely as possible. More recently, metric MDS was applied to controlling for population structure (Purcell et al. 2007; Li and Yu 2008). However, metric MDS based on Euclidean distance is numerically identical to PCA, which also forms the foundation of other methods (Price et al. 2006; Tian et al. 2008). The metric MDS solution may result in Euclidean positions that conflict with the ranking of the original observations; the nMDS solves this problem by iterating between monotone algorithm steps and least squares projection steps (Hardle and Simar 2003). This makes applying nMDS a logical choice in our study.
A unified mixed-model approach for association mapping that accounts for multiple levels of genetic relatedness was recently proposed (Yu et al. 2006) for association studies in human, maize, Arabidopsis, and potato panels (Yu et al. 2006; Malosetti et al. 2007; Zhao et al. 2007). Although the complex genealogical history of a species or a population prevents us from drawing a clear delineation among various scenarios, our attempt to classify samples for association analysis into pragmatically defined categories allows a more systematic comparison of various methods (Yu and Buckler 2006; Yu et al. 2006; Kang et al. 2008) (Figure 1).
We evaluated a range of methods including nMDS, mixed model, and PCA. We first investigated how dimensions of nMDS and PCA can be determined with a two-stage approach. Then we systematically examined how the different methods perform, reducing false positive rates while maintaining power, by using comprehensive simulations of different types of samples (Figure 1) followed by analyses of two real data sets. We concluded that the nMDS method is a powerful method for association mapping in populations with either subtle or complex genetic structure.
MATERIALS AND METHODS
Dissimilarity matrix calculation:
Given a molecular marker data set as a large rectangular matrix M, with n individuals indexed by rows and m markers indexed by columns, let Mij be the marker genotypic value for individual i and marker j. For biallelic markers such as single nucleotide polymorphisms (SNPs), Mij = 0, 1, or 2. The average taxonomic dissimilarities among individuals can be computed as follows because nMDS analysis demands a less rigid relationship between the dissimilarities and distances among individuals (Sneath and Sokal 1973):
The resulting dissimilarity matrix is an n × n inherently symmetrical matrix (so that ), with a dimension equal to the number of individuals in the data set. We also ignore the self-dissimilarities . A monotonic increasing function f, , can be used to generate a set of distances () given dissimilarities (). The function f has the property that if , then .
nMDS analysis based on the distance matrix:
After obtaining the dissimilarities among individuals, the most common approach used to determine the elements and to obtain the coordinates of the individuals given only rank order information is an iterative process commonly referred to as the Shepard–Kruskal algorithm (Kruskal 1964a,b; Shepard 1980). For clarity, we redescribe here the basic steps in a preliminary algorithm.
Step 1: Randomly choose an initial configuration for n individuals in k dimension (; i = 1, … ; n; l = 1,…, k). Results from principal coordinates analysis may be used as initial configuration (Spence 1972). Then calculate the average taxonomic distances (; i = 1,…, n; j = 1,…, n).
Step 2: Determine , from the distances , by pool-adjacent violation algorithm, which is described as follows: This begins with the lowest ranked value of after sorting ascendingly [there should be n(n − 1)/2 ranks]. The adjacent values are compared for each to determine whether they are monotonically related (i.e., if , then ). Whenever a block of consecutive values violates the required monotonic property, the values are averaged together with the most recent nonviolator value to obtain an estimator. Eventually, this value is assigned to all points in the particular block.
Step 3: Compute the STRESS (Kruskal's goodness-of-fit index) value, which assesses how well the derived con-figuration fits the given dissimilarities: .
Step 4: Update the configuration using steepest decent algorithm, , l = 1,…, k, where k is the number of dimensions. The choice of step width α is crucial. Kruskal (1964a) proposed a starting value of α = 0.2.
Step 5: Go back to step 2 unless a maximum number of iterations or a specified small STRESS value is reached.
Association mapping analysis:
Our mixed model approach closely followed that of Yu et al. (2006). The vector of phenotypes, y, was modeled aswhere α is vector of allele effects to be estimated, β is vector of dimension effects, and u is a vector of polygene background effects. The term N contains the coordinates of the individuals of p dimensions by nMDS analysis; X and Z are incidence matrices of 1's and 0's relating y to α and u, respectively, and e is a vector of residual effects. The phenotypic covariance matrix was assumed to have the form , where K is an n × n matrix of relative kinship coefficients that define the degree of genetic covariance between a pair of individuals (Loiselle et al. 1995), I is an n × n identity matrix, is the genetic variance attributable to genomewide effects, and is the residual variance.
For model testing, maximum likelihood rather than restricted maximum likelihood was used to solve the mixed model and obtain the variance component estimates of and because comparisons among models with different random or fixed variables were conducted (Lynch and Walsh 1998). The −2 log-likelihood function of parameter space Ω (i.e., ) given marker and phenotype data (X, N, y) is
The null hypothesis that there is no association between a specific marker locus and trait (H0: α = 0) was tested using the likelihood-ratio (LR) test statisticUnder the null model, LR is approximately distributed as a chi-square variable with 1 d.f.
We compared nMDS with PCA. Estimating population structure using PCA has been described elsewhere, but briefly the marker data were normalized and then the covariance of the resulting matrix was used to compute PCA (Price et al. 2006). We could not apply structured association to our simulated data sets because of the high computational cost of this method. However, we know that samples in these simulated data sets have already been grouped into known subpopulations. Thus, we were able to compare differences among nMDS, PCA, and known structure models.
Dimension determination for nMDS and PCA:
We used two-stage determination of the number of dimensions in nMDS or the number of axes of variation in PCA.
Stage 1: The number of dimensions in nMDS that satisfy STRESS <0.05 was determined. In contrast, for PCA, the number of statistically significant principal components, as measured by the Tracy–Widom (TW) statistic without correcting for linked markers and linkage disequilibrium (Tracy and Widom 1994), was treated as an initial number of axes of variation (significance level at 0.05). In this stage, only marker genotypic data were used to determine the number of dimensions for both nMDS and PCA.
Stage 2: Phenotypic data were incorporated into the process to determine the final number of dimensions for both nMDS and PCA. Using the multilinear regression model, we fitted trait values with different numbers of significant dimensions that had been determined by the first stage. The number of dimensions with minimum Bayesian information criterion (BIC) value (Schwarz 1978) was then selected as the final dimension.
Mixture beta models for allele frequency used in simulation studies:
Assume that we have allele frequencies in S subpopulations at L loci, each locus having two alleles A1 and A2. Let pik be the allele frequency of A1 at locus i in subpopulation k, i = 1,…, L, k = 1,…, S. Working with pik is sufficient because allele frequencies of A1 and A2 sum to 1. Allele frequency can be modeled by the mixture model(Fu et al. 2005), where xik is a set of independent beta variates, yi is another set of independent beta variates, and w is the mixture coefficient of the two beta variates and a number between 0 and 1. Further, xik and yi are assumed to be independent of each other. That is, for any locus, the allele frequency at each subpopulation could be expressed as the weighted sum of two independent components, an individual component for that subpopulation (xik), and a common component across all populations (yi). The common component could be considered as the contribution of shared history or gene flow to allele frequency because both subpopulations share history and gene flow, making them similar to each other. A smaller w is associated with high correlation. When w = 1, pik = xik, which reduces to a formulation similar to the usual beta model (Balding and Nichols 1995; Holsinger et al. 2002; Nicholson et al. 2002; Falush et al. 2003; Marchini et al. 2004; Price et al. 2006). In addition, mixture beta models can be easily extended to the genetic data from different geographical regions (supporting information, File S1).
For each locus, we assumed the same probabilistic structure and focus on loci that have been subjected to similar evolutionary processes. In particular, we assumedwhere is the mean allele frequency across loci for each subpopulation, which can be interpreted as the allele frequency in an ancestral population. The allele frequency across loci has a wide spectrum and its variance depends on ck. Our ck parameters are analogous to Fst values, but with one for each subpopulation (Nicholson et al. 2002; Balding 2003). Using a different value of ck for each subpopulation (Fu et al. 2005) introduces a considerable amount of extra flexibility into the model. The size of ck tells us about the effective population size of subpopulation k during the time since divergence. Large values of ck indicate a smaller effective population size (Nicholson et al. 2002). It follows thatThese formulas show that the same loci from different populations are correlated but different loci from the same or different populations are not.
Simulating samples from different types of populations:
To assess the ability of nMDS and PCA to correct for population structure, we simulated data sets with five types of samples: ideal population and samples with population structure (type I), samples with familial relationship (type II), samples with population structure (type III), samples with both population structure and familial relationship (type IV), and samples with severe population structure and familial relationship (type V) (File S1).
To mimic different types of samples used for association mapping (Figure 1), we considered two different scenarios. For convenience, we assumed the population size was 216 and the total number of SNPs was 2000, each chromosome containing 200 SNP loci. For three subpopulations, c1 = 0.15, c2 = 0.10, c3 = 0.05; π1 = 0.15, π2 = 0.30, π3 = 0.45; and w = 1, c = 0.05, π = 0.5. For eight subpopulations, c1 = 0.15, c2 = 0.10, c3 = 0.05, c4 = 0.01, c5 = 0.15, c6 = 0.10, c7 = 0.05, c8 = 0.01; π1 = 0.05, π2 = 0.10, π3 = 0.15; π4 = 0.20, π5 = 0.25, π6 = 0.30, π7 = 0.35, π8 = 0.40; and w = 1, c = 0.05, π = 0.5.
Genotypes and genotypic values of individuals:
A linkage map of 2000 cM comprising 10 chromosome segments, each 200 cM in length, was considered. An additive genetic model with no dominance or epistasis was assumed. Twenty-five quantitative trait nucleotides (QTNs) at 25 different positions randomly selected from 2000 SNPs were simulated. In all simulations, we set each QTN genotypic value with genotype QQ as 0.5 and with genotype qq as 0, and the overall mean was set at 10. Overall genotypic value of an individual was obtained as the sum of genotypic values across all QTNs plus the overall mean. Individual phenotype was generated as the genotypic value plus a random variable sampled from a standard normal distribution. In addition, we varied these parameters depending on the purpose of the analysis.
Design of simulation studies:
Here we describe different simulation studies carried out for different questions we addressed. Unless specified in the following description, the parameter values for sample simulation were the same as indicated in the previous two sections. The adjustment of parameter values enabled us to use the general simulation framework to answer different questions.
Dimension determination: For all five sample types, the simulation for each parameter configuration was repeated 1000 times. In each replication, we first determined the number of dimensions through Kruskal's STRESS value in nMDS (Kruskal 1964a) (STRESS value <0.05) or the Tracy–Widom test in PCA (P-value <0.05). Then we carried out model fitting and genomewide association tests to get the BIC value and false positive rate at each dimension. Average statistics were calculated over 1000 simulated data sets. We counted the number of cases when the minimal BIC measure exactly corresponded to the minimal false positive rate over 1000 simulated data sets.
False positive rate and power comparison: The preparatory set of experiments was conducted only on type III samples to assess the ability of nMDS in correcting for population structure. We assumed that c1 = 0.015, c2 = 0.060, c3 = 0.075. We considered two different scenarios: (1) π1 = 0.15, π2 = 0.45, π3 = 0.75, and (2) π1 = 0.15, π2 = 0.20, π3 = 0.25. The average false positive rate for each parameter set was computed over 100 repetitions (Figure 2).
After preparatory effectiveness judgment of the nMDS method in correcting for population structure, we further compared the nMDS, PCA, and mixed-model methods with different types of samples. Because of the high computational cost incurred by the mixed-model methods containing the K, nMDS + K, and PCA + K, we estimated the empirical type I error and power by averaging results from 20 simulated data sets under each sample type. The same results were shown for both type I and type II samples because the number of subpopulations did not apply in these cases. In these simulations, 10 additional chromosomes of each 200-cM length were added to the original genetic map of 2000 cM to calculate false positive rate and threshold for power comparison. There were no QTNs located on the additional chromosomes and the 2000 SNPs on these chromosomes were used to calculate the false positive rate and threshold comparison. Both TW statistics and STRESS values were calculated using genotypic data for these 2000 SNPs without QTNs. Phenotypic data were then used to determine the dimension on the basis of BIC values. False positive rate was determined with the test statistics resulting from these 2000 SNPs and the simulation was repeated 1000 times. The average of empirical threshold values was used to make power calculations by comparing the test statistics of the simulated QTN with the chosen threshold value.
Number of SNPs: To examine the effect of number of SNPs required for population structure estimation with either PCA or nMDS, we simulated type III samples with varying numbers of independent SNPs. Parameter values were the same as described previously, except that 40 QTNs at 40 different positions randomly selected from 2000 SNPs were simulated. The simulations were carried out 100 times.
Common correlation: To examine the effects of correlation across subpopulations on false positive rate and detection power, we simulated type III samples with varied mixture coefficients (i.e., different correlation between allele frequencies across subpopulations): w = 0, 0.2, 0.4, 0.6, 0.8, or 1. We generated 100 data sets of 216 individuals from three subpopulations.
Empirical data sets:
Genotypes and three phenotypes (i.e., flowering time, ear height, and ear diameter) from 277 maize strains across 553 SNPs as described previously (Yu et al. 2006) were downloaded from http://www.maizegenetics.net. Arabidopsis genotypes and phenotypes were obtained from a published data set (Zhao et al. 2007) and www.plosgenetics.org. For the Arabidopsis data, 5419 SNPs were used for further analysis with three traits: SDV, JIC4W, and JIC8W. These traits are flowering time measurements under either short or long days with different lengths of vernalization. The missing entries were filled using a least-squares regression-based technique (Alter et al. 2000) and complete marker data were used to carry out kinship, PCA, and nMDS analyses. In both maize and Arabidopsis data sets, the number of dimensions was determined by the two-stage dimension determination approach. The Q, PCA, and nMDS matrices were computed accordingly.
Two-stage dimension determination:
We first investigated dimension determination by traditional methods in nMDS and PCA and the two-stage approach we proposed. Using type V samples with three subpopulations, we computed relevant statistics for models containing different numbers of dimensions chosen by nMDS and PCA methods (Table 1). With a threshold of 0.05, 16 dimensions were suggested to account for the genetic structure of the sample. A well-known method of selecting the dimension in nMDS analysis is the elbow test in which STRESS is plotted against number of dimensions. Ideally, the choice would be obvious from the elbow in the STRESS plot, the point at which stress is not reduced substantially after a certain number of dimensions. However, the STRESS values decreased smoothly with increasing dimension, which made selecting appropriate number of dimensions difficult. In the second step, trait values were further incorporated into the process of dimension determination by computing BIC values for models containing different nMDS dimensions. The minimal BIC measure occurred when the dimension was 4, a point at which the minimal false positive rate was also observed (Table 1). For PCA, using BIC measures reduced the number of dimensions from 22 to 6, the point at which the lowest false positive rate was observed.
Results were similar when we evaluated the empirical type I error using other types of samples with three or eight subpopulations (Table S1, Table S2, Table S3, Table S4, Table S5, Table S6, Table S7), although nMDS and PCA are less sensitive to the number of dimensions when type III samples were simulated than other samples types. In addition, when investigating samples from our simulation with subtle population structure and familial relatedness (type I), the minimum BIC values for both the nMDS and PCA methods occurred when the dimension was one (Table S1). As expected, BIC measures for the simple model were smaller than for the nMDS and PCA models, validating that no population structure is required for an ideal population. After validating its superiority, we used only the final dimensions of PCA and nMDS determined by this two-stage dimension determination method to compare false positive rates and power under different scenarios.
When the above procedures were repeated for type V samples, we found that in >75% of replications, the minimum BIC value exactly corresponded at the point at which the minimum false positive rate occurred with the two-stage dimension determination method for nMDS and PCA (Table 2). Extending the procedures to other types of samples led to very similar results: >72% of replications had a dimension chosen by the two-stage method for which the minimum BIC value exactly corresponded to the minimum false positive rate.
False positive rates:
The first set of experiments used type III samples to assess the ability of nMDS to correct for population structure (Figure 2). Given the structure of these samples, we compared nMDS with PCA but excluded mixed-model methods. The labeled model (i.e., group membership as simulated) and the simple model were included for comparison. The proposed nMDS method could adjust appropriately for the effect of population structure in all scenarios considered, whereas PCA had a notably more inflated false positive rate than the labeled model (Figure 2). The PCA method performed better when the ancestral allele frequency difference among subpopulations was relatively small (π1 = 0.15, π2 = 0.20, and π3 = 0.25) than when the difference was large (π1 = 0.15, π2 = 0.45, and π3 = 0.75).
After evaluating how well the nMDS method corrected population structure, we compared nMDS, PCA, and mixed-model methods (Table 3) for five different sample types. For types III, IV, and V samples with either three or eight subpopulations, the false positive rate for nMDS was lower than or equal to PCA (Figure 3). For types II and IV samples, the K model yielded a lower false positive rate than the nMDS and PCA models because familial relatedness existed only in type II and there was no admixture among subpopulations in type IV samples. Except for type I samples, the false positive rate for the nMDS + K model was lowest across different sample types. For type I samples, overcorrecting population structure with mixed models either containing a PCA or nMDS component or not, slightly increased the false positive rate. Findings of false positive rate were consistent when there were three or eight subpopulations for types III, IV, and V samples.
Comparing the power of different models requires fixing the false positive rate because when population structure exists, even models with PCA, nMDS, K, or combinations of these elements can still yield a false positive rate higher than the expected 0.05 level (hence, higher but meaningless “power”). To overcome this problem, we generated additional panels with the same population structure without causal SNP. We applied each method to these data sets. The 5% quantile of the P-values obtained was then used as the statistical significance threshold in analyzing data generated with causal SNP for power comparison.
As expected, because minimum population structure existed in type I samples, all methods resulted in very similar power (Figure 4). For other sample types, all methods with PCA, nMDS, K, or combinations of these had higher power than the simple model. In addition, the nMDS model, either standing alone or combined with K matrix in a mixed model, was consistently better than PCA. For type V samples, the nMDS had an advantage in the power over PCA of up to 12% with three subpopulations (Figure 4A) and 16% with eight subpopulations (Figure 4B). For types II and IV samples, the K model worked better than nMDS and PCA. But for type III samples, both nMDS and PCA models gave higher power than the K model. Overall, the nMDS + K mixed-model approach performed best, followed by either PCA + K or nMDS (Figure 4).
Number of SNPs needed for population structure correction:
We further investigated the effect of number of SNPs required to estimate population structure. Simulation methods followed type III samples, but we varied the number of independent SNPs. Clearly, using a large panel of SNPs to correct for population structure was better in both nMDS and PCA (Table 4). For example, when 50,000 SNPs were used for population structure, the PCA method performed reasonably well when there was moderate ancestral difference. nMDS performed relatively robustly when the number of SNPs used was >5000. For types IV and V samples, though nMDS consistently outperformed PCA in false positive rate, both nMDS and PCA still had inflated false positive rates even when 50,000 independent SNPs were used to correct for population structure. Mixed-model methods were, not surprisingly, better for types IV and V samples because of the existence of familial relationships.
Effects of common correlation on association analysis:
Most models for analyzing genetic diversity in geographically structured populations have ignored the correlation in allele frequency among population (Fu et al. 2005). Unfortunately, correlation across subpopulations affects estimates of genetic differentiation and population structure. There was a trend toward a decrease in false positive rate under different models as the mixture coefficient decreased (Figure 5). On the other hand, the power of underlying methods tended to increase as correlation across populations decreased. When there was no correlation across subpopulations (w = 1), the simple model had the highest false positive rate, followed by PCA and nMDS. The labeled model had the lowest false positive rate because this model exactly controlled population structure. As expected, all models had the same false positive rate when there was no population structure (w = 0).
Application to empirical data sets:
We compared the nMDS method with PCA, structured association, and mixed-model methods (Table 3) using maize and Arabidopsis data. We first determined the optimal dimension for these two data sets of various phenotypes using our two-stage dimension determination approach. According to the Tracy–Widom test, the significant axes of variation to account for genetic structure were 13 for maize and 17 for Arabidopsis, whereas in nMDS, the dimensions for which STRESS was <0.05 were 10 for maize and 8 for Arabidopsis (Kruskal 1964a) (Tables 5 and 6 ). When trait values are further incorporated into the process of dimension determination, the final numbers of dimensions for different phenotypes were determined to be smaller for both nMDS and PCA models.
With the optimal number of dimensions, we then compared goodness-of-fit of all relevant models over three phenotypes. In most cases, the nMDS model, adjusted for sample size and number of parameters in the model, showed a relatively significant improvement in goodness-of-fit compared with other models (Tables 7 and 8). For all three maize traits, the nMDS + K model resulted in the best approximation to expected F-values, followed by the Q + K, K, and PCA + K models when accounting for kinship among individuals (Figure 6). Without considering familial relatedness, the nMDS model had the best approximation, followed by the Q and PCA models. Results were similar for all three Arabidopsis traits (Figure 7). One main difference between the maize and Arabidopsis data set analysis was that the K model in the maize data set was closer to the standard model than that in the Arabidopsis data set. After evaluating a broad range of methods in both maize and Arabidopsis data sets, we determined that the mixed-model approach generally performed best.
We have presented a methodology for controlling population structure on the basis of nMDS analysis. The method is a multidimensional reduction approach that takes into account dissimilarity among samples and order of distances. Incorporating nMDS analysis in association mapping yielded an increase in power and a decrease in false positive rate relative to other methods. Our simulations showed that existing methods may fail to completely correct for population structure on several types of populations, yielding high false positive rates. In contrast, nMDS maintained a relatively low false positive rate especially for types IV and V samples (i.e., moderate-to-severe population structure and familial relationship). The nMDS method accounts for the population structure effect more accurately than PCA because instead of requiring that the projection of the points into the k-dimensional space explain a maximum percentage of the variation found, nMDS requires only that the distance in the k-dimensional space has a monotonic relationship to the original distances. Furthermore, the nMDS approach has the desirable property of preserving smaller interpoint distances more faithfully than PCA; nMDS maximizes variances, which gives greater weight to the larger distances (Hardle and Simar 2003).
Applications of mixed models to association studies in many plant species have shown that a mixed-model approach outperforms much-discussed methods developed in the context of human genetics (Zhu et al. 2008). To the best of our knowledge, this is the first instance of systematic examination through a series of extensive simulation studies to verify related results. After incorporating pairwise genetic relatedness between every pair of individuals into the model, the mixed model (e.g., nMDS + K or PCA + K) were clearly better than simple multiple regression models (i.e., PCA or nMDS), especially when types IV and V samples were considered. It seems that an nMDS + K model performs best in almost all common sample types (types II, III, IV, and V) when fewer independent SNPs are used to correct population structure. Neither an nMDS method alone nor a K matrix alone is sufficient for population structure correction (Figures 3 and 4). The nMDS matrix was required in addition to kinship effect even when type II samples were analyzed. We suspected that this might be an artifact of how kinship was estimated. Our K matrix was computed on the basis of traditional work on estimating kinship, which is defined as the fraction of the genome shared identically by descent using marker data. Properties and applications of other types of kinships have been recently examined (Kang et al. 2008; Stich et al. 2008). This issue deserves further investigation because of renewed interest in kinship estimation with its new application in association mapping.
An important issue in multidimensional reduction analysis is choosing the number of dimensions for the scaling solution. A configuration with a high number of dimensions achieves very low stress values but frequently results in relatively high false positive rates and is apt to be determined more by noise than by the essential structure in the data. Conversely, a solution with too few dimensions might not reveal enough of the structure in the data. Furthermore, the number of dimensions identified by each method depends on the underlying population structure the data contains, not just the method itself. Traditionally, the number of dimensions has been generally determined on the basis of random marker information without considering phenotypic information. However, the effects of population structure on different complex traits vary dramatically (Aranzana et al. 2005; Flint-Garcia et al. 2005) and it is logical to hypothesize that the numbers of dimensions required for cofactors in detecting marker–trait association are not necessarily the same.
Finding the appropriate number of dimensions in classical nMDS requires looking for an elbow in the STRESS plot, or simply selecting the appropriate dimension with a sufficiently small STRESS value (i.e., <0.05) (Kruskal 1964a). Lee (2001) developed a Bayesian approach to dimension determination on the basis of a probabilistic formulation of nonmetric multidimensional scaling. Patterson et al. (2007) proposed using the Tracy–Widom statistical test (Tracy and Widom 1994) in selecting the appropriate number of dimensions. They showed that PCA results are not sensitive to the number of axes of variation used, provided a sufficient number of axes capture true population structure effects. Simulation results presented here cast doubt on this method, especially with familial relationships among samples. Structured association suggests that the number of subpopulations chosen should be the value that maximizes estimated model log likelihood (Pritchard et al. 2000). Gao et al. (2007) demonstrated that selfing leads to spurious signals of population substructure. Our simulated experiments showed that population structure can exhibit phenotypic specificity and a two-stage dimension determination performs better.
The simulated model of generating samples with population structure is similar to the method of Nicholson et al. (2002) in essence, but we extended the model to characterize allele frequency distribution among populations and incorporated the correlation among populations as well as extended the mixture beta models to data with different clusters of populations (Fu et al. 2005). This situation is commonly encountered in plants, wildlife, model organism populations, and human genetics (Pritchard et al. 2000; Novembre and Stephens 2008). The model assumes that the populations all diverged from a common ancestral population at the same time but allows that populations may have experienced different amounts of drift since this divergence. The assumption of simultaneous divergence, while unrealistic, is appealing in its simplicity (Falush et al. 2003). The alternative approaches would be to use the coalescent-based simulation (Long and Langley 1999; Zollner and Pritchard 2005) or the rescaling technique (Hoggart et al. 2007) but we admit that all three approaches approximate the complex reality to a certain degree. In our simulation designs, we generated subpopulations followed by successive matings within or among subpopulations to examine in detail the effects of subdivision and admixture viewed as processes in population history (Ewens and Spielman 1995), which up to now, have seldom been reported.
In summary, we demonstrated the feasibility and effectiveness of the nMDS method in accounting for population structure in different types of samples with computer simulations and further validated this with empirical data analyses. The challenge of association mapping lies in its multidisciplinary nature. With further advances in genomics technology, experimental design, and methodology development, we believe large-scale association mapping studies will result in more findings that greatly improve our understanding of genetic architecture of complex traits.
We appreciate Dindo Tabanao and two anonymous reviewers for very helpful comments to improve the manuscript. We thank Jennifer Alexander for technical editing of the manuscript. We thank M. Nordborg and K. Zhao for making the Arabidopsis data available for analysis. This project is supported by the National Research Initiative (NRI) Plant Genome Program of the United States Department of Agriculture Cooperative State Research, Education and Extension Service (CSREES) grant no. 2006-35300-17155.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.108.098863/DC1.
Communicating editor: T. P. Brutnell
- Received November 19, 2008.
- Accepted April 21, 2009.
- Copyright © 2009 by the Genetics Society of America