Abstract
Arabidopsis thaliana is a highly selfing plant that nevertheless appears to undergo substantial recombination. To reconcile its selfing habit with the observations of recombination, we have sampled the genetic diversity of A. thaliana at 14 loci of ~500 bp each, spread across 170 kb of genomic sequence centered on a QTL for resistance to herbivory. A total of 170 of the 6321 nucleotides surveyed were polymorphic, with 169 being biallelic. The mean silent genetic diversity (πs) varied between 0.001 and 0.03. Pairwise linkage disequilibria between the polymorphisms were negatively correlated with distance, although this effect vanished when only pairs of polymorphisms with four haplotypes were included in the analysis. The absence of a consistent negative correlation between distance and linkage disequilibrium indicated that gene conversion might have played an important role in distributing genetic diversity throughout the region. We tested this by coalescent simulations and estimate that up to 90% of recombination is due to gene conversion.
GENOME projects facilitate evolutionary studies, which in turn help to interpret the information uncovered by large-scale sequencing (Charlesworthet al. 2001). As a consequence, interest in the population genetics of the model plant Arabidopsis thaliana has grown steadily over the past decade. Three central observations have emerged from the analyses of the seven or so loci that have been subjected to comparative sequencing in this cruciferous weed (Kawabeet al. 1997; Purugganan and Suddith 1998, 1999; Kuittinen and Aguadé 2000; Savolainenet al. 2000): (i) There is an excess of rare polymorphisms, (ii) a number of genes have alleles that fall into two distinct classes (allelic dimorphism), and (iii) there is more recombination than might be expected, given that A. thaliana is a selfer.
The excess of rare polymorphisms, often indicated by a negative value of Tajima's D, is perhaps the least surprising of these findings. Most structural genes are subject to purifying selection, leading to an excess of rare frequency segregating sites. The converse, i.e., an excess of genetic diversity (π), is an infrequent but often highly significant exception, as seen, for example, at the Rpm1 resistance locus of A. thaliana (Stahlet al. 1999).
There is a slight tension between the observation of an excess of rare polymorphisms and allelic dimorphism. Extreme cases of the latter correspond to a deficiency of rare polymorphisms, as observed at the Rpm1 locus (Stahlet al. 1999), and may therefore be evidence for balancing selection. However, the reason for the apparent dimorphism may simply be that in a sample of n sequences the expected time required for the last two lineages to coalesce is equal to that taken by the first n − 2 sequences (Kingman 1982a,b). In other words, even neutral genealogies tend to have deep splits, and since branch lengths are proportional to the number of segregating sites, apparent dimorphism might result from such a neutral process.
The most enigmatic observation concerns recombination. The outcrossing rate of A. thaliana has been estimated as 0.3% (Abbot and Gomes 1989), which is very low. On the other hand, Miyashita et al. (1999) found no significant linkage disequilibrium among 472 AFLP markers scored in 38 ecotypes. This contrasts with the situation in another well-studied selfing plant species, wild Barley (Hordeum spontaneum). Its outcrossing rate has been estimated as 1.6% (Brownet al. 1978) and in an extensive allozyme study 20 out of 28 populations investigated displayed significant genome-wide linkage disequilibrium (Brownet al. 1980). However, using an extension of the test for linkage disequilibrium applied to H. spontaneum (Hauboldet al. 1998), Sharbel et al. (2000) detected highly significant linkage disequilibrium among 79 amplified fragment length polymorphism (AFLP) loci scored in 142 ecotypes (henceforth referred to as accessions) of A. thaliana. Nevertheless, the extent of recombination in A. thaliana has remained unclear, prompting the present study.
The effects of reciprocal recombination (R) and gene conversion (C) on the distribution of genetic material.
We employed the genomic sequence of the Columbia accession of A. thaliana to sample the genetic diversity among 39 accessions at 14 loci of ~500 bp each in a region spanning 170 kb on chromosome 5. The region was chosen because a polymorphism for production of defensive metabolites maps within this interval (Kroymannet al. 2001).
In this study we examine the mode of recombination in this region. Depending on the way the Holliday Junction is resolved, recombination may result either in reciprocal recombination or in gene conversion (Figure 1). Reciprocal recombination affects a series of homologous loci downstream of the recombination break point. Gene conversion, on the other hand, leads to the alteration of single segments only. Therefore, recombination causes the decay of linkage disequilibrium with distance, while no such effect results from gene conversion if conversion tracts are short (Wieheet al. 2000). We have applied this idea to our data and discovered that a substantial input from gene conversion is likely.
MATERIALS AND METHODS
Plant material and DNA sequencing: The 39 accessions used in this study are listed in Figure 3. Primers were designed using the published sequence of the accession Columbia and the software PRIMER3 (Rozen and Skaletsky 1998). Total DNA was extracted from leaves of single plants, amplified, and sequenced directly on both strands. All primer pairs are shown in Table 1.
Data analysis: Alignment: DNA sequences were aligned using PILEUP (GCG Wisconsin Package) and all computations were carried out after gap removal.
Nucleotide diversity: Most loci in our sample contained coding as well as noncoding regions (Table 2). To compute the average number of silent substitutions between pairs of sequences (πs) we included the third codon positions of the coding regions as well as the complete noncoding segment and applied
Confidence intervals for πs were estimated using the bootstrap procedure (Efron 1979) across taxa: Rows of the aligned data matrix were resampled with replacement and the average number of pairwise mismatches per nucleotide was recalculated 10,000 times. The resulting mismatch values were sorted, and the 2.5 and 97.5% quantiles were looked up in the sorted array.
Pairwise linkage disequilibrium: We use the normalized linkage disequilibrium, D′, to quantify pairwise linkage disequilibria (Lewontin 1964). Consider two biallelic loci,
Now consider three loci,
Simulating the distribution of
Primers used to amplify the 14 loci investigated in this study
An example of such a block is provided by positions 136,818–137,110 in our data set (Figure 3), which all belong to locus 10 and have an identical haplotype structure. Although the adjacent polymorphic position 138,531 also maintains the haplotype structure, it is not included in the block as it belongs to locus 11 (Figure 2). From triplets of such blocks of polymorphisms we computed
Multilocus disequilibrium: Multilocus disequilibrium was investigated by treating each distinct sequence at the 14 loci as an allele and calculating the number of loci at which each pair of haplotypes differed. The observed variance of this
“mismatch distribution,” VD, was then compared to the variance expected under linkage equilibrium, Ve (Brownet al. 1980; Hauboldet al. 1998). The ratio between these variances serves as a measure of the strength of multilocus association in the sample
Coding parts and annotations of the 14 loci investigated
Silent genetic diversity, πs (●), including 95% confidence intervals, and Tajima's D (★) across the 170-kb region studied. Tajima's D was significant at locus 6 (P = 0.01), perhaps indicating balancing selection. MRN17, T2007, and MYJ24 designate bacterial P1 clones of genomic A. thaliana DNA used in the Arabidopsis genome sequencing project (Arabidopsis Genome Initiative 2000); MAM1 encodes a methylthioalkylmalate synthase involved in glucosinolate chain elongation (Kroymannet al. 2001); and MAML encodes a duplication of MAM1.
RESULTS
Sequence data and genetic diversity: A total of 39 accessions were sequenced at 14 loci distributed over a 170-kb region (Figure 2). After gap removal this amounted to 6321 nucleotides in 39 accessions. Of this data set, 170 sites distributed among 35 haplotypes were polymorphic (Figure 3). With the exception of one hypervariable position (46,750; Figure 3), all segregating sites had only two nucleotide states. In addition, there were three heterozygous positions in accession Kondara (37,062, 37,304, and 37,351; Figure 3). The hypervariable and heterozygous sites were removed from the computation of pairwise disequilibria.
Most of the loci were located within predicted genes and contained both protein-coding as well as noncoding parts (Table 2). However, locus 4 was entirely noncoding, while loci 3 and 10 consisted of coding sequence only. With the exception of loci 6 and 9, functions had been assigned to the investigated loci in the context of the Arabidopsis genome project (Arabidopsis Genome Initiative 2000). These functions were diverse, ranging from putative alanyl-tRNA synthetase (locus 1) to histone (locus 3), peptidases (loci 7 and 14), and acetyl-CoA synthetase (locus 8; Table 2).
The genetic diversity varied by a factor of 30 between πs = 0.001 at locus 1 and πs = 0.030 at locus 3 across the region (Figure 2). To assess whether these diversity values were compatible with neutral equilibrium expectations, we investigated the frequency spectrum of the single-nucleotide polymorphisms using Tajima's D test statistic (Tajima 1989). This test is based on the assumption that the data have not been subject to recombination. We explored this assumption by computing the minimum number of recombination events for each locus (Rm; Hudson and Kaplan 1985). Only locus 7 showed evidence of a recombination event and with this background information we proceeded to calculate Tajima's D.
The only locus with a significant value of Tajima's D was locus 6 (D = 2.66, P = 0.01; Figure 2). Unfortunately, its function is unknown. Further, the signs of the test statistics showed no consistent pattern, with 9 out of the 14 loci having D < 0 and the rest D > 0 (Figure 2).
Multilocus linkage disequilibrium: Hanfstingl et al. (1994) hypothesized that recombination in A. thaliana was frequent enough to erode linkage disequilibrium between sites just 350 bp apart. Since all the loci investigated in our survey were >350 bp apart (Figure 2), we assessed the strength of association between these loci by calculating the standardized index of association,
Phylogeny: Given that there was strong linkage disequilibrium between the surveyed loci, we used the exploratory tool of statistical geometry to investigate the phylogeny of the genomic region (Eigenet al. 1988; Maynard Smith 1989). Statistical geometry proceeds by first generating a phylogeny on the basis of the parsimony criterion for each quartet of sequences in the sample. These phylogenies are averaged to generate the graph shown in Figure 4. Note that there are three ways in which this graph can be reduced to a conventional unrooted tree: Collapse dimensions X and Y of the central box, collapse dimensions Z and Y, or collapse dimensions Y and Z. In other words, a statistical geometry graph simultaneously represents the three unrooted trees that can be formed from four taxa. If no recombination has taken place, only one of these three possible trees should be supported by the data. High support for all three possible trees is indicated by a large central box.
Prettyplot of all 170 polymorphic sites. Positions are indicated by numbers, which should be read top to bottom. At each site a dot indicates agreement with the nucleotide shown in the top row.
Statistical geometry phylogeny for the combined nucleotide data. X, Y, and Z indicate the dimensions of the three-dimensional box from which the terminal branches (a–d) stick out. If the data were tree-like, the small Z-dimension as well as the larger Y-dimension would be zero. The bar indicates the number of substitutions per polymorphic site.
For our data the deviation from the ideal tree topology was considerable (Figure 4), and we did not attempt to further reconstruct the phylogenetic history of the region.
Disequilibrium as a function of distance: Given that over a stretch of 170 kb the phylogeny of A. thaliana does not conform to a tree, reciprocal recombination or gene conversion has probably contributed considerably to the evolution of this species. Under reciprocal recombination the disequilibria between pairs of polymorphic sites are expected to fall off exponentially with distance. In contrast, gene conversion should generate no distance effect on disequilibria, if the average tract length is short.
We started our investigation of the relationship between distance and disequilibrium by grouping the single-nucleotide polymorphisms (SNPs) into 24 “blocks” as outlined in materials and methods. Pairwise linkage disequilibria between these blocks were negatively correlated with distance (r = −0.11, P = 6 × 10−5; Figure 5A). However, if only haplotype pairs with four alleles were included in the analysis, i.e., allele pairs where a recombination event could be detected, the negative correlation between distance and linkage disequilibrium turned positive (r = 0.353, P = 10−5; Figure 5B). There is no neutral mechanism that results in a significant positive correlation between distance and disequilibrium. When we removed the one locus with a significant Tajima's D from the analysis (locus 6), the correlation between distance and disequilibrium vanished altogether (r = 0.07, P > 0.05; Figure 5C). This indicated that reciprocal recombination may not have been the primary mechanism for exchanging homologous DNA in the region.
To quantify the mode of recombination more directly, we applied a statistical test designed to distinguish between reciprocal recombination and gene conversion (Wieheet al. 2000).
Gene conversion vs. reciprocal recombination: We carried out coalescent simulations with a recombination rate of one-tenth the rate of mutation, which appears to be a reasonable value given estimates in the literature (Kuittinen and Aguadé 2000) and the observation of genome-wide linkage disequilibrium (Sharbelet al. 2000). In our simulations we distributed this “effective” rate of recombination between reciprocal recombination and gene conversion. A graph of the mean value
of sign
Linkage disequilibrium as a function of distance. (A) All pairs of blocks of haplotypes included in the analysis. (B) Only those pairs of blocks are included where all four possible haplotypes were present, i.e., where a recombination event certainly has taken place. (C) Same as B, except that the nonneutral locus 6 was removed from the analysis.
DISCUSSION
Completely asexual reproduction halves the rate of adaptation compared to panmixis and is therefore usually regarded as a rare exception, if it exists at all (Fisher 1930/1999, p. 123). This may appear surprising, given the large number of selfing plant species and other asexual organisms, including bacteria. However, in most selfing plants inbreeding is not complete and even the existence of purely clonal bacterial populations has been doubted (Feilet al. 2001). Different accessions of A. thaliana can be crossed in the laboratory, which forms the basis of the large amount of classical mapping work carried out using this organism. However, in the wild A. thaliana is a selfer with a very low outcrossing rate of 0.3% (Abbot and Gomes 1989). Recent studies of this plant's molecular population genetics suggested that in spite of its selfing habit, it underwent recombination rather frequently (Kuittinen and Aguadé 2000), leading to a decay of linkage disequilibrium in worldwide samples over ~250 kb (Nordborget al. 2002). In this study we contribute to the clarification of the apparent contradiction between selfing and the molecular data.
Nucleotide polymorphism: The genetic diversity in the MAM region is highly variable (Figure 2). In 13 out of the 14 cases the polymorphisms do not contradict neutral expectations. The one exception (locus 6, Figure 2) is currently annotated as a gene of unknown function. Every new genome that is sequenced reveals a large number of predicted genes to which no function can be assigned. Given that sequencing is usually easier than elucidating a gene's function, comparative sequencing combined with tests of neutrality might point to those genes whose products are most relevant to an organism's biology.
The mean of the test statistic sign as a function of the extent of gene conversion. Plotted are mean values (●) and 50% confidence intervals (▪). All blocks of polymorphisms were included in the analysis and the corresponding maximum-likelihood estimate (MLE) for gene conversion is ~90%. See materials and methods for details on the test statistic
.
Multilocus disequilibrium: If multiple loci have been investigated, linkage disequilibrium can be assessed either by performing pairwise tests or by calculating the overall linkage disequilibrium. Pairwise tests are difficult to interpret, as they are not independent from each other. The test based on the mismatch distribution used in this study does not suffer from this uncertainty about its interpretation (Hauboldet al. 1998). Moreover, it leads to the discovery of strong linkage disequilibrium not only in our data set, but also in a set of genome-wide AFLP markers (Sharbelet al. 2000). A lack of genome-wide linkage disequilibrium as suggested by Miyashita et al. (1999) would be hard to reconcile with the selfing habit of A. thaliana and previous findings in other selfing plant species (Brownet al. 1980).
Phylogeny: The average phylogeny differed from an ideal tree topology, which indicated that there was substantial recombination in the region (Figure 4). This difference becomes more pronounced if the genome-wide AFLP data published by Sharbel et al. (2000) is subjected to statistical geometry (Figure 7). This is not surprising, since disequilibrium decreases exponentially with distance. However, even in this situation the loci display significant genome-wide linkage disequilibrium (Sharbelet al. 2000). Having rejected the two extreme hypotheses of no recombination and of linkage equilibrium, we were interested in investigating the rate and mode of recombination.
Disequilibrium as a function of distance: It is clear that linkage disequilibrium should reflect genetic distance rather than physical distance. However, genetic positions are rather unreliable over short distances and hence we have used physical positions as a substitute (Nordborget al. 2002).
In the MAM region linkage disequilibrium apparently decreases with distance (Figure 5A). However, a positive correlation with distance was observed when we analyzed only pairs of blocks displaying all four possible haplotypes (Figure 5B). The puzzle of finding a positive correlation was resolved when we removed the one locus with significant evidence for selection from the sample. The resulting data set showed no correlation between distance and disequilibrium (Figure 5C). It is clear that three haplotypes can be generated by mutation alone, while four haplotypes between two markers must be the result of recombination, assuming no recurrent mutation. Hence, Figure 5C shows a sample that has certainly been shaped by recombination, while in Figure 5A the pairs of positions may or may not have been affected by recombination. Nevertheless, under neutrality and reciprocal recombination the two samples should yield a similar decay of linkage disequilibrium with distance. This suggests that gene conversion has shaped the distribution of polymorphisms in this region.
Mode of recombination: Gene conversion has been at the center of recent empirical and theoretical population genetic studies. Langley et al. (2001) investigated the extent of linkage disequilibrium in the su(s) and su(wa) loci on the Drosophila melanogaster X chromosome that are located in a region of reduced crossing over. In spite of low reciprocal recombination, the authors observed a similar genomic scale of linkage disequilibrium at the su(s) and su(wa) loci as found in regions with normal rates of crossing over. This suggests that gene conversion is high in this region (Langleyet al. 2001).
Wiuf and Hein (2000) have introduced gene conversion into coalescent models. These authors noted that there was no statistic available to assess the relative extent of recombination and gene conversion. Such a statistic,
These simulations were based on the assumption of neutrality, which may not apply throughout the region, especially at locus 6 (Figure 2). Removal of this locus from the plot of linkage disequilibrium as a function of distance resulted in zero correlation between the two variables (Figure 5C), which would be expected with high gene conversion rates.
We show that with an effective recombination rate of one-tenth the rate of mutation (c/μ = 1/10) and a 90% gene conversion rate the experimental data can be explained quite adequately (Figure 6). However, this calculation should be treated with caution, as the distribution
of
Statistical geometry phylogeny for the 87 AFLP loci in 115 accessions in A. thaliana published by Sharbel et al. (2000). X, Y, and Z indicate the dimensions of the three-dimensional box from which the terminal branches (a–d) stick out. If the data were tree-like, the small Z-dimension as well as the larger Y-dimension would be zero. The bar indicates the number of substitutions per locus.
Acknowledgments
We thank Richard Hudson for providing his gene conversion simulation code and comments. Thanks are also due to U. Priedemuth for help with data handling. This work was supported by the Max-Planck-Gesellschaft; the Bundesministerium für Forschung, Germany, BMBF grant 0312705A to T.W.; the U.S. National Science Foundation (grant DEB-9527725); and the European Union.
Footnotes
-
Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. AF471728–AF472273.
-
Communicating editor: S. W. Schaeffer
- Received September 5, 2001.
- Accepted April 22, 2002.
- Copyright © 2002 by the Genetics Society of America