Association or linkage disequilibrium (LD)-based mapping strategies are receiving increased attention for the identification of quantitative trait loci (QTL) in plants as an alternative to more traditional, purely linkage-based approaches. An attractive property of association approaches is that they do not require specially designed crosses between inbred parents, but can be applied to collections of genotypes with arbitrary and often unknown relationships between the genotypes. A less obvious additional attractive property is that association approaches offer possibilities for QTL identification in crops with hard to model segregation patterns. The availability of candidate genes and targeted marker systems facilitates association approaches, as will appropriate methods of analysis. We propose an association mapping approach based on mixed models with attention to the incorporation of the relationships between genotypes, whether induced by pedigree, population substructure, or otherwise. Furthermore, we emphasize the need to pay attention to the environmental features of the data as well, i.e., adequate representation of the relations among multiple observations on the same genotypes. We illustrate our modeling approach using 25 years of Dutch national variety list data on late blight resistance in the genetically complex crop of potato. As markers, we used nucleotide binding-site markers, a specific type of marker that targets resistance or resistance-analog genes. To assess the consistency of QTL identified by our mixed-model approach, a second independent data set was analyzed. Two markers were identified that are potentially useful in selection for late blight resistance in potato.
OVER the past decade, marker-assisted selection has become a standard tool in breeding programs (Young 1999; Asins 2002; Dekkers and Hospital 2002; Collard et al. 2005). The selection of genetically superior individuals using molecular marker alleles linked to quantitative trait loci (QTL) affecting trait variation instead of using phenotypic trait values can speed up genetic improvement considerably (Hospital et al. 1997). As a first step to marker-assisted selection (MAS), marker loci need to be identified that are linked to QTL. In plants, QTL mapping up until now relied principally on the development of segregating populations derived from crosses between inbred lines (F2, backcross, recombinant inbred lines, etc.). In such populations, observed linkage disequilibrium (LD) between markers and QTL alleles by the nature of the population must necessarily be the result of physical linkage. Although designed segregating populations are easy to create, they come with a number of disadvantages. First, the amount of segregating genetic variation within the population is limited, because per locus at most two alleles can segregate (in a diploid species), whereas in the absence of allele polymorphisms between the parents no QTL can be identified. Second, the genetic backgrounds within which mapping studies take place are generally not representative of backgrounds used in elite germplasm (Jannink et al. 2001). Third, the relatively low number of generations after maximum LD, where the maximum LD is reached in the F1, implies a reduced number of sampled meioses within designed populations (typically a few hundred), leading to relatively long stretches of chromosomes being in LD. Consequently, the characteristic size of confidence intervals for QTL locations is between 10 and 20 cM (Darvasi et al. 1993). The recognition of these and other limitations has motivated the interest in alternative mapping strategies popular in human and animal genetics, where these techniques are known as LD mapping and association mapping (Lynch and Walsh 1998; Walsh 2002), designations we use interchangeably.
For association mapping, the population under study consists of a collection of genotypes, and not a highly designed population of genotypes, and the LD required for finding marker–trait associations comes from unrecorded sources of LD in that population (Jannink et al. 2001). Good examples of populations for use in association mapping in plant genetics are breeding and gene bank collections of cultivars, breeding lines, germplasm accessions, etc. Association mapping strategies offer larger flexibility in the choice of genotypes for mapping, which is advantageous from a plant breeding perspective. The genetic variation under study can increase quantitatively, because more than two alleles per locus can be present, and qualitatively, because the QTL alleles can be mapped in germplasm that is directly relevant to breeding programs. Estimates for QTL effects will be more realistic since they are estimated within the relevant genetic background for plant breeders. Moreover, marker–trait associations will result from tight linkage, tighter than in most designed populations, making it possible to map QTL with higher precision. An important asset of association mapping strategies is the straightforward utilization of large amounts of historical phenotypic data that are available for mapping efforts at no or little extra costs. For example, breeders are routinely evaluating their germplasm in the relevant target population of environments, thereby producing a wealth of valuable phenotypic information readily available for QTL detection by association mapping efforts, a procedure referred to as “in silico mapping” by Parisseaux and Bernardo (2004).
In association mapping, linkage is not the only cause of LD. When the set of genotypes for inclusion in the mapping study is not homogeneous in the sense of, for example, not being an F2, recombinant inbred line, or double-haploid population, observed LD could be the consequence of mechanisms other than linkage, such as population admixture, specific mating systems, genetic drift, and selection (Jannink and Walsh 2002; Flint-Garcia et al. 2003). Although possibly confusing, the term LD refers strictly to the dependence of alleles at different loci within gametes, which may be a result of physical linkage between the loci, but not necessarily so (Jannink and Walsh 2002). The success of association mapping efforts depends on the possibilities of separating LD due to linkage from LD due to other causes. Population structure is seen as a second major cause of marker–trait associations in addition to linkage (Pritchard et al. 2000b; Jannink et al. 2001; Flint-Garcia et al. 2003). Most association–mapping strategies therefore start by first inspecting the population to assess whether groups can be discerned within the population and then testing for marker–trait association after correction for group effects, i.e., testing for marker–trait association within groups. The principle is that only associations caused by linkage will survive after removing the genetic correlations due to group effects. A popular method to detect population structure is that proposed by Pritchard et al. (2000b). An implementation of this method is available in the software STRUCTURE (http://pritch.bsd.uchicago.edu/software/structure2_1.html) in which, within a Bayesian framework, molecular marker information is used to assign group membership probabilities to the genotypes. The estimated probabilities for group membership can then be used to assign genotypes to groups within which marker–trait associations are tested (Remington et al. 2001; Simko et al. 2004b; Skøt et al. 2005). Alternatively, the groups can be integrated as an extra factor or a set of covariables in a statistical model relating phenotype to genotype (Thornsberry et al. 2001; Wilson et al. 2004).
As a valid alternative to the use of STRUCTURE, classical multivariate analysis methods can be used to classify genotypes. In that case a matrix of genetic/genotypic distances is calculated from molecular marker information and used as input for clustering and/or scaling techniques (Ivandic et al. 2002; Kraakman et al. 2004). For collections of varieties and breeding lines, genotypic relationships as obtained from the pedigree or from similarities in neutral marker profiles (Yu et al. 2006) can be translated to distances that are subsequently analyzed by cluster analysis. The groups detected by such a cluster analysis can be interpreted as representing population structure and form an approximation to the original relationships between the genotypes as present before the grouping. The identified groups can be used as a kind of correction factor in association analyses. An example of this approach was given by Simko et al. (2004a), where pedigree data produced a relationship matrix that was used as input to a cluster analysis with the aim of classifying potato cultivars. When pedigree or neutral marker information is present for a set of genotypes, this information can also be incorporated directly in a QTL mapping analysis in the form of a genotypic relationship matrix that structures the variance–covariance matrix between genotypes, without making the detour of using the same matrix for finding groups. For maize, Parisseaux and Bernardo (2004) show how to integrate the pedigree-based relationship matrix into a QTL mapping analysis within a mixed-model framework. In Yu et al. (2006), published shortly after submission of this article, both a marker-based relationship matrix and a factor representing population structure are included in a mixed model for association analysis of a single trait in a single environment. In this article, we follow Parisseaux and Bernardo (2004) for the direct incorporation of the pedigree relationships in an association analysis for multienvironment data. However, like Simko et al. (2004a,b) and Gebhardt et al. (2004), we focus our modeling efforts on potato.
Potato (Solanum tuberosum) is an autotetraploid species (2n = 4x = 48), but the simpler genetics at the diploid level combined with the fact that many resistance sources are introgressed from wild relatives have motivated the use of diploid interspecific crosses for most QTL studies in resistance breeding (Collins et al. 1999; Oberhagemann et al. 1999; Ewing et al. 2000; Visker et al. 2003). Nevertheless, examples of mapping of resistance QTL on the basis of tetraploid crosses have also been reported in the literature (Meyer et al. 1998; Bormann et al. 2004). Although classical QTL mapping approaches at the diploid and the tetraploid level have been proven successful, the use of tetraploid potato cultivars in association mapping has advantages both of testing for QTL at the original ploidy level and of testing within a more representative genetic background for the crop. The first examples of successful association mapping studies in tetraploid potato were presented by Simko et al. (2004a,b) and Gebhardt et al. (2004). These applications involved disease resistances for which candidate genes were defined. The statistical methodology for assessing marker–trait associations consisted principally of the comparison of trait means for two allele states by some form of t-test, either the classical t-test or a robust (permutation-based) or nonparametric alternative (Wilcoxon/Mann–Whitney U-test). When population structure was detected, the tests were performed within identified groups (Simko et al. 2004a).
The use of t-tests for establishing marker–trait associations imposes restrictions on the types of analyses and inference. In the context of association mapping, it is attractive to use historical collections of phenotypic data, with the archetypical example of such a collection consisting of the series of trials performed to test candidate varieties on value for cultivation and use (VCU), with the ultimate goal of compiling national and recommended lists of varieties. Such historical data sets are highly unbalanced across years as new genotypes are added each year, while others are discarded. A good method for association analysis should be able to combine information across multiple years and to cope with unbalancedness. Similarly, it should be possible to incorporate pedigree information, as for major crops pedigree information is available on most if not all genotypes submitted to national testing authorities investigating VCU.
In this article, we show how powerful and flexible a mixed-model framework can be for association mapping in plant species. The use of mixed models for genetic studies is more or less the standard in animal breeding applications, but it is far less common in plant studies. We illustrate the use of mixed models by analyzing two independent data sets on resistance to late blight, the most important potato disease, caused by the oomycete Phytophthora infestans, where the second data set served as an empirical check on the QTL identified in the first set. The objective was to identify markers linked to disease resistance, where we used a special type of marker that targets resistance-related regions in the genome (van der Linden et al. 2004; Calenge et al. 2005). Since the structure of the data in this article is common to many breeding programs and evaluation trials, the described association approach represents an example that can be adapted to specific programs, in potato and other species. The approach does not require the use of special-purpose software and can be implemented in any statistical package with extended facilities for mixed-model analyses, like SAS (SAS Institute 1999), GENSTAT (Payne and Arnold 2002), and ASREML (Gilmour et al. 1998).
MATERIALS AND METHODS
Phenotypic and pedigree data:
A first, exploratory association analysis was performed on phenotypic data stemming from VCU trials supporting the Dutch Potato Variety List. The data included evaluations for resistance against leaf blight (P. infestans) for 123 cultivars collected in the period 1975–2002. The number of years in which a particular cultivar was evaluated varied from a few years, 68 cultivars were evaluated from 1 to 3 years, to up to 25 years for 2 cultivars. Since in each year some cultivars were removed from the variety list while others were introduced, the phenotypic data set was highly unbalanced. The VCU trials were always inoculated with the same P. infestans isolate and cultivars were scored for resistance to leaf late blight on a scale from 1 (highly susceptible) to 9 (highly resistant). Although the original trials contained replicates, we worked with the trial means across replicates, as these were the only data available to us. We performed our statistical analyses on the original, ordinal measurements (1–9), as preliminary analyses showed that diagnostic plots of residuals showed no violation of the classical assumptions for analysis of variance and mixed models.
A second phenotypic data set for another 81 cultivars was constructed to corroborate significant marker–trait associations found in the study on the 123 above-mentioned cultivars. The 81 cultivars for our confirmatory study were a subset from the set of cultivars used by Gebhardt et al. (2004). Cultivars in the confirmatory study were chosen differently from the cultivars in the exploratory study. Phenotypic data for the cultivars in the confirmatory study were passport data in which genotypes (accessions) were scored for resistance to late blight again on a scale from 1 (susceptible) to 9 (resistant). Because the data set was composed of evaluations performed in different countries, it is likely that not all the infections were produced by the same isolate (Gebhardt et al. 2004).
Pedigree information for the 123 cultivars in the exploratory study and the 81 cultivars in the confirmatory study was retrieved from the Online Pedigree Database of the Laboratory of Plant Breeding of Wageningen University (Hutten and van Berloo 2001).
Molecular marker data and LD between markers:
The 123 cultivars in the exploratory study and the 81 in the confirmatory study have been genotyped with so-called nucleotide-binding-site (NBS) markers. These markers were developed upon the observation that a number of resistance genes for P. infestans encoded for proteins with a nucleotide-binding domain (NBS) and a leucine-rich repeat (LRR) domain (Ballvora et al. 2002; van der Vossen et al. 2003; Bormann et al. 2004; Huang et al. 2004). The production of NBS markers, NBS profiling, involves a PCR-based system that efficiently and reproducibly targets resistance genes (R genes) and resistance gene analogs (van der Linden et al. 2004; Calenge et al. 2005). For our genotypes, 49 NBS markers were dominantly scored as present (1) or absent (0). None of the 49 markers has been placed on a potato map, so their position in the genome is not yet known.
For the 81 cultivars in the confirmatory study, we had information about an additional marker, R1-1400, corresponding to a major resistance gene to late blight located on chromosome V (Ballvora et al. 2002). After checking independence (nonassociation) of R1-1400 with the full set of NBS markers, we used R1-1400 as a cofactor in the model screening for marker–trait associations in the confirmatory data set.
Before looking at marker–trait associations, we calculated as descriptive statistics pairwise LD between markers using the square of the coefficient of correlation, r2 (Pritchard and Przeworski 2001).
Models for the identification of marker–trait associations:
In association mapping the objective is to introduce molecular marker information tagging specific DNA regions to test whether those regions have a significant association with or effect on the trait. Within our mixed-model framework, we partition the total genetic variation, i.e., the variation between genotypes, into a fixed part associated with a particular QTL/marker and a random part associated with the genetic background consisting of QTL elsewhere in the genome. A simple, but probably unrealistic (see below), starting model (model E1) to test for marker–trait associations in our exploratory late blight variety trial data is(1)(random terms underlined), with the late blight resistance score for genotype i in year j, μ an intercept term, α the fixed (QTL) effect of the chromosomal region tagged by a particular marker, Xi the state of the marker in individual i (for a biallelic marker, Xi = 0 for marker band absence and Xi = 1 for marker band presence), a residual random genotypic effect that results from not further identified background QTL, the random year effect, and a residual term. Since the data consisted of averages over plots within a year, the residual term contains both genotype-by-year interaction and plot residual. The random terms are assumed to be normally and independently distributed with zero mean and a proper variance: , , and . The mixed model (model E1, Equation 1) can be fitted by restricted maximum likelihood (REML), and marker–trait association assessed by testing for the H0: α = 0 by means of a Wald test (Verbeke and Molenberghs 2000). For ease of presentation, for all fitted models used in screens for marker–trait associations, the associations were classified according to the P-value of the Wald test statistic as: strong (P < 0.001), moderate (0.001 < P < 0.01), weak (0.01 < P < 0.05), and not significant (P > 0.05).
The distributional assumption for the background genetic effect, , in model E1 (Equation 1) imposes a genetic variance–covariance structure, G, that assumes no relatedness between genotypes, so with N genotypes, , or, equivalently, an N × N matrix with on the diagonals and zeroes on the off-diagonals. If the genotypes are barely related this is a reasonable assumption, but usually collections of cultivars will exhibit tighter relations due to shared breeding history. On the basis of an additive genetic model, the covariance between two genotypes i and i′ with a coefficient of coancestry is (Lynch and Walsh 1998). Therefore, a more realistic model would allow G to take structures other than just diagonal, depending on the degree of relatedness between the genotypes. When A is an N × N matrix containing the coefficients of coancestry for all pairs of genotypes, calculated from pedigree records, and is the genetic variance due to undetected QTL elsewhere in the genome, . The choice between alternative models for the G matrix can be based on a model selection criterion such as the Bayesian information criterion (Wolfinger 1993). Assumptions about the other random effects can also be relaxed (for example, introducing heterogeneity of residual variance between years) and comparisons between different models for these random effects can be based on the same model selection criterion.
In a second screen of the exploratory data, each of the 49 NBS markers was tested for association with leaf resistance to late blight within a mixed model that now included pedigree information, that is, model E1 (Equation 1) was converted to model E2 (Table 1) by generalizing the distributional assumption for the genotypic random effect to , with G in this case a genetic variance–covariance matrix equal to , while retaining the distributional assumptions for the year and residual effects: , and .
Relations between genotypes are caused by a number of processes. Genotypes are related on the basis of pedigree, but additional relations can be superimposed by selection and drift. Especially the latter process will lead to the formation of subpopulations differing in allele frequencies, which will cause spurious marker–trait associations to occur if this population structure is ignored.
Assuming subpopulations known, within our mixed-model framework, a correlation structure can be imposed on the genotypes by including the subpopulations as the levels of a random factor. As a consequence, all pairs of genotypes will be equally correlated within groups, while being uncorrelated between groups. Such an approach could provide a crude, and sometimes satisfactory, approximation to the “real” genotypic correlations. The corresponding model (E3) becomes(2)where Ck refers to the kth subpopulation of genotypes, and genotypes are now nested within subpopulation effects, leading to the genotypic, residual, effects (Table 1). We fitted model E3 (Equation 2) under the assumptions of and , effectively ignoring the pedigree relationships. We constructed model E4 from model E3 by incorporating the pedigree in the form of (Table 1).
Marker–trait association analyses in plant genetics have come to be to some extent equivalent to first identifying sub-populations and then performing some form of within-subpopulation association analysis (Remington et al. 2001; Kraakman et al. 2004). The method of choice for classifying genotypes in homogenous groups showing Hardy–Weinberg and linkage equilibrium uses the popular software STRUCTURE (Pritchard et al. 2000a). Within a Bayesian framework, the method implemented in this package assigns genotypes to a user-defined number of groups on the basis of the information given by molecular markers. Either the model can assume that individuals can be classified into unrelated groups (no admixture model) or that assumption can be relaxed to allow individuals to be the result of admixtures between groups (admixture model).
To assess subpopulations in our exploratory data, we used the 49 biallelic NBS markers, although we admit that these marker data look already a priori not very powerful and appropriate for this purpose. To identify subpopulations that subsequently define the levels of a random genotype-grouping factor in our models E3 and E4, we chose in STRUCTURE the admixture model and assumed varying numbers of groups, K, with K = 2, 3, and 4. For each assumed population structure, we ran the program discarding the first 10,000 iterations as burn-in period and using the following 100,000 iterations to produce the results on group membership probabilities for individual genotypes.
As standard fixed regression and analysis of variance models, including t-tests and simple correlations, are still more popular than mixed models for association analyses (Ford-Lloyd et al. 2001; Remington et al. 2001; Ivandic et al. 2002; Kraakman et al. 2004; Skøt et al. 2005), we performed, for comparison with the above mixed models, association analyses under the following fixed regression model (E5):(3)
To corroborate associations found in our exploratory data set, we performed additional association analyses in a second, independent, confirmatory data set. The confirmatory data did not have replications over years, so that the year term disappears from the model, together with the index j, leading to a confounding of residual genotypic effects and residual effects due to experimental errors (and possible genotype-by-year interactions) in one random term, ui. For the confirmatory data set, scores were available for the resistance-related binary (1/0) R1-1400 marker, Ri, and we thought it sensible to include this marker as a covariable/cofactor. The model for the confirmatory data then becomes(4)(model C1), where β represents the effect of the R1-1400 marker. We fitted model C1 (Equation 4) with , thus ignoring the pedigree relationships (Table 1). Finally, we included the pedigree, , in our model C2 for the confirmatory data (Table 1).
In general, for the genotypes included in the exploratory and confirmatory data set, marker bands were present at moderate to low frequencies (median frequency = 0.22 and 0.27 in the exploratory and confirmatory data sets, respectively), although band presence ranged from rare (0.05 and 0.02, respectively) to almost fixed (0.92 and 0.86, respectively) (Figure 1). Marker band frequencies are not very useful as estimates of allele frequencies, because the number of alleles in a genotype (from one in a simplex to four in a quadruplex) cannot be determined with a dominant marker system like NBS. The fact that most of the bands were at relatively low frequencies points to one of the advantages of using an association mapping approach, as many of the polymorphisms would have passed unnoted in a conventional approach based on the segregation of a specific cross. A negative effect of the low frequencies of the markers is that the power of the test is reduced.
LD was generally low between markers, with most r2-values <0.10 (Figure 1). While the low LD between markers indicates that most markers are independent of each other, it is difficult to conclude about distances between markers. Values of r2 of <0.10 have been observed between markers as close as 1500 bp apart in maize (Remington et al. 2001) or at distances >10 cM in barley (Kraakman et al. 2004). The extent of LD in potato is not known, but it is likely to be large as genotypes are autotetraploid cultivars with a narrow genetic base and with relatively few generations since their introduction in Europe. There are, however, some examples in potato where a strong decay in LD was observed at distances of <1 cM (Gebhardt et al. 2004).
For the exploratory data, the clustering of genotypes using STRUCTURE did not produce a clear discrimination of the genotypes into hypothetical groups (K = 2, 3, and 4). Figure 2 shows the results for a three-groups solution (K = 3). Although the bars indicate that some genotypes have a genetic background with a large fraction from one of the three hypothetical subpopulations, substantial intermixing between the groups was evident. Nevertheless, we concluded on the existence of three groups and as Remington et al. (2001) did, we assigned each genotype to the group with which the genotype shared the largest genetic background fraction. For the confirmatory data, from the analysis by STRUCTURE we could not derive any useful grouping for later use in association analyses.
Before fitting the association models E1–E5 to the exploratory data and models C1 and C2 to the confirmatory data (see Table 1 for all models), we investigated whether the pedigree relationships contributed anything to a better description of the genetic covariances between genotypes. For the exploratory data we compared the fit of models E2 and E1, leaving out marker information, where the first model did contain a pedigree-structured genotypic variance–covariance matrix and the second model did not. Comparison of the Bayesian information criterion indicated a clear improvement of model E2 over E1, 654.1 vs. 684.3 (a lower number is better). An analogous comparison can be made between models E4 and E3, with a Bayesian information criterion (BIC) of 659.9 vs. 687.1, leading to the same conclusion. For the confirmatory data, we compared model C2 with C1, with a BIC of 131.1 vs. 133.6, showing that for this data set inclusion of pedigree information led to only marginal improvement of the model fit.
The P-values of the Wald tests for NBS markers with at least one significant result across models E1–C2 are presented in Figure 3. On the basis of the BIC (lowest values) the models of choice should be E2, with pedigree-structured genotypic variances for the exploratory data, and C2 for the confirmatory data, again with pedigree-structured genotypic variances and covariances. The other models are presented to show the effects of omitting pedigree information and/or structured association. Model E5 (Equation 3) serves as an anchoring point as it represents a default model of choice for many association analyses in the literature. The rows in Figure 3 are sorted in ascending order by the P-values found in model E5. At one glance it is obvious from Figure 3 that model E5 produced far more significant marker–trait associations than any other model. However, as deviance tests pointed to the desirability of inclusion of pedigree information, plus as it appeared natural to choose random instead of fixed genotype-by-year interaction, we know that model E2 provides a better fit than model E5. Therefore, the significant associations for model E5 that are not reproduced by model E2 are false positives, stemming from the use of an inadequate marker–trait association model. With an analogous argument, we find that because of the omission of pedigree information, model E1 is too liberal in comparison to model E2, as is model E3 to E4 and model C1 to C2. In all three model comparisons, P-values tend to increase, i.e., become less significant, with the inclusion of pedigree information, emphasizing too liberal testing in the absence of appropriate structuring of the genotypic variance–covariance matrix. From the significant markers detected by the application of model E2 only the markers NBS2_8 and NBS5a6_10 carry over from the exploratory data to the confirmatory data as analyzed by model C2.
Marker NBS5a6_10 showed the highest association with the trait in the exploratory data set (P < 0.001) and had a positive effect on resistance (Table 2). Band presence of this marker was associated with an increase in resistance score of 0.95 ± 0.244 as estimated in model E2, an effect that was consistent with the one of 0.74 ± 0.326 in model C2 found for the confirmatory data set. The other marker consistently associated with resistance was NBS2_8 with an increase of 0.98 ± 0.370 in resistance score for the exploratory data set as estimated in model E2, while for the confirmatory data fitting model C2 the effect was 1.91 ± 0.697. In the exploratory data, NBS5a6_10 and NBS2_8 had effects of similar magnitude, but the test statistic was much lower for NBS2_8, possibly because of the marker band being present at low frequency. In the confirmatory data set, this marker appeared at an even lower band frequency, but still showed a significant effect, supporting its association with the trait.
Although widely used in human genetics, association mapping has recently emerged in the plant breeding arena. A major reason for the interest in association mapping approaches in plant breeding is that the extensive phenotypic evaluations routinely produced by breeding programs and cultivar evaluation systems contain a large potential for studying the genetic mechanisms underlying trait variation in many crops. Sufficient conditions for unlocking that potential are targeted marker systems in combination with sound statistical procedures. In this article, we showed how 25 years of historical phenotypic data as collected within the Dutch potato cultivar evaluation system could be exploited to investigate the genetic basis for late blight resistance in potato, a challenging disease with large economic consequences. We tested for associations between 49 NBS markers, a new type of marker targeting resistance genes and their analogs, and late blight resistance, using mixed-model methodology. To assess the consistency of our procedure we compared the results of the association analyses as performed on the Dutch variety trials with analyses on a second independent set made up of passport data that were published as additional material to Gebhardt et al. (2004).
When LD decays over short distances, the number of random molecular markers that are necessary to find significant associations can be substantial. NBS markers allowed us to focus directly on interesting genomic regions. With a relatively reduced number of NBS markers we singled out two markers that showed consistent effects across populations for resistance to late blight. The absence of map information hindered comparison of our detected associations with previously reported QTL. Nevertheless, since we had information on R1-1400, a marker linked to a major resistance gene that maps on chromosome V, we could check for its association with either one of the two NBS markers. The observed LD (r2) between R1-1400 and NBS5a6_10 was 0.023 and that between R1-1400 and NBS2_8 was 0.011, suggesting that the markers map either to a distant region on chromosome V or to other chromosomes. Sequence information for marker NBS2_8 did not produce a clear indication for its location, but for NBS5a6_10 a strong homology with genes of the R3 complex (results not shown) located this marker on chromosome XI (Huang et al. 2004).
The mixed-model framework is for various reasons highly appropriate to study marker–trait associations. Many historical data sets in plant breeding are severely unbalanced in that the set of genotypes changes between years and often also between trials within years. Within plant breeding, mixed models constitute the method of choice to deal with unbalanced data across multiple trials. We extended the standard phenotypic analysis of multiple trials by mixed models to arrive at models suitable for association mapping by introducing marker genotype information as covariables at the levels of the genotypic factor, thereby creating so-called mixed factorial regression models (van Eeuwijk et al. 1996; Denis et al. 1997; Malosetti et al. 2004). The introduction of molecular marker information at the genotypic level, corresponding to the simultaneous modeling of multiple trials while acknowledging the correlation structure between those trials, distinguishes our approach from that of Yu et al. (2006), who present a model for identifying marker–trait associations for a single set of observations on a collection of genotypes.
Mixed models guarantee reliable inference through the explicit modeling of correlations induced by genetic and environmental causes. In genetic contexts, from the genotypic dimension, correlations between observations can arise in many ways. First, observations taken on the same genotype across multiple trials and replicates within trials can be expected to be correlated. Second, genetic correlations between observations made on different genotypes follow from these genotypes sharing ancestral history, i.e., pedigree relations, or from these genotypes having been subjected to similar selection forces and/or drift processes. To model correlations between multiple observations on the same genotype, we can choose genotypic effects to be random in an appropriate mixed model. To account for correlations due to pedigree, selection, and drift, we can further structure the variance–covariance matrix of the random genotypic effects. For rough-grained structure following from especially drift and to a lesser extent selection, we can introduce a random factor whose levels coincide with groups of genotypes in such a way that most or all of LD is explained by differences between those groups, where the groups themselves are relatively homogeneous and in, or close to, Hardy–Weinberg and linkage equilibrium. For the imposition of more subtle structure on the genotypic variance–covariance matrix, coancestry relations could be calculated from known pedigree relations, while as an alternative or complementary means for capturing fine-grained correlation structure molecular marker-based estimates for pairwise genotypic relatedness can be used (Lynch and Ritland 1999; Wang 2002; Milligan 2003).
Fine-grained pedigree information can be hard to acquire and therefore many articles on association mapping address a coarse level of genotypic correlation by classifying genotypes into more or less homogenous groups. Correcting for these group effects is an indirect way of correcting for relatedness between genotypes and therefore less spurious associations are expected. However, defining homogeneous groups in a collection of genotypes is not always a simple task. For our potato data, we could not find a clear group structure in the exploratory data and no structure at all in confirmatory data using the methodology proposed by Pritchard et al. (2000b). The low number of dominant markers that we used will certainly have contributed to our problems in identifying groups. Still, the absence of a clear group structure in collections of elite breeding germplasm has been observed before in similar works (Kraakman et al. 2004; Simko et al. 2004b). Elite breeding genotypes have a relatively narrow genetic base resulting from closely related breeding history, so it can be expected that such populations consist mainly of one major gene pool rather than of the intermixing of very distinct subpopulations. Studies where a clear population structure was observed generally involved broad-based germplasm and the groupings discriminated between rather distinct genetic backgrounds. For example, in ryegrass, groups corresponded to different species (Skøt et al. 2005) and in maize to different heterotic groups (Remington et al. 2001). In addition to the difficulty of finding structure in the population, even when groups are found, the correction for group effects may remove part but not all of the genetic correlation. Therefore, substantial polygenic effects may still persist and affect the results. For our data, the influence of the groups (model E3 vs. E1 and model E4 vs. E2) was very small, especially in comparison to the influence of the pedigree.
Ignoring correlations between genotypes will lead to liberal tests of marker–trait associations in the same way as testing of whole-plot treatments over subplot errors in split-plot designs will in general lead to liberal tests for the whole-plot treatment factor. With regard to QTL detection, Kennedy et al. (1992) used simulation to show that failure to account for the variance–covariance structure generated by relatedness in complex pedigree populations facilitates an increase of false positive QTL detection. Our potato data emphasized the same point, as only the inclusion of pedigree information in the mixed model produced consistent significant associations across two independent studies (associations identified by models E2 and C2). In agreement with our results for potato, Parisseaux and Bernardo (2004) identified marker–trait associations for maize that were reproducible over populations, also by using mixed models including pedigree information.
A fixed approach to association mapping, implying neglect of correlations between multiple observations on the same genotypes and of coarse- and fine-grained correlation structures between genotypes, produces a large amount of associations of which the majority will be false positives that will not be reproducible. Our model E5, although including a correction for population structure, closely resembled this approach and indeed produced many not reproducible associations. Still, the fixed approach is not uncommon in the literature (Ford-Lloyd et al. 2001; Remington et al. 2001; Ivandic et al. 2002; Kraakman et al. 2004; Skøt et al. 2005), probably due to the larger availability of and familiarity with fixed than mixed models.
Relations between genotypes need to be adequately represented in statistical models for marker–trait associations to guarantee reliable inference. Analogous to Yu et al. (2006), we presented a mixed-model framework that offered two ways of representing genotypic relations: by structuring the variance–covariance matrix of the genotypic effects using pedigree or marker information and by introducing a grouping factor to represent structured association. When the pedigree is accurate or when enough marker information is available to estimate reliably individual pairwise genotypic relations, the inclusion of a factor representing structured association is not expected to contribute much to the model fit. The structured association factor merely provides a crude approximation to the real matrix of pairwise genotypic relations, given that the groups can be identified reasonably. When population structure is suspected, but methods to identify groups are hard to apply or fail, the method of genomic control (Devlin and Roeder 1999; Devlin et al. 2001) can be an attractive procedure to follow. Genomic control is based on the assumption that population structure will induce overdispersion that will affect the test statistics for association. The amount of overdispersion can be estimated from the empirical values for the test statistic, after which the empirical test statistics can be adjusted on the basis of the estimated overdispersion. This procedure is simple to apply and seems to work well in simple clinical settings (case–control studies). For our purposes, however, when modeling marker–trait associations in plant data coming from multiple environments, using mixed models with complex variance–covariance structures, it is not so obvious how the theory of genomic control can be generalized, as it will be difficult to motivate a unique overdispersion factor across the whole of the data set. Therefore, we feel that a mixed-model approach as discussed in this article is a highly attractive framework for studying marker–trait association for plant data across multiple environments.
↵1 Present address: Biometris, Department of Plant Sciences, Wageningen University, P.O. Box 100, 6700 AC, Wageningen, The Netherlands.
Communicating editor: J. B. Walsh
- Received December 20, 2005.
- Accepted November 19, 2006.
- Copyright © 2007 by the Genetics Society of America