Abstract
In Brassica species, self-incompatibility is controlled genetically by haplotypes involving two known genes, SLG and SRK, and possibly an as yet unknown gene controlling pollen incompatibility types. Alleles at the incompatibility loci are maintained by frequency-dependent selection, and diversity at SLG and SRK appears to be very ancient, with high diversity at silent and replacement sites, particularly in certain “hypervariable portions of the genes. It is important to test whether recombination occurs in these genes before inferences about function of different parts of the genes can be made from patterns of diversity within their sequences. In addition, it has been suggested that, to maintain the relationship between alleles within a given S-haplotype, recombination is suppressed in the S-locus region. The high diversity makes many population genetic measures of recombination inapplicable. We have analyzed linkage disequilibrium within the SLG gene of two Brassica species, using published coding sequences. The results suggest that intragenic recombination has occurred in the evolutionary history of these alleles. This is supported by patterns of synonymous nucleotide diversity within both the SLG and SRK genes, and between domains of the SRK gene. Finally, clusters of linkage disequilibrium within the SLG gene suggest that hypervariable regions are under balancing selection, and are not merely regions of relaxed selective constraint.
THE self-incompatibility recognition system (SI) in species of the mustard family (Brassicaceae) appears to be controlled by multiple alleles of members of a multi-gene family. Two of the genes that are probably involved have been well-characterized and are expressed in the epidermal cells of the stigma during self-incompatibility. These two genes, SLG (S-locus glycoprotein) and SRK (S-receptor kinase), are physically linked, the region between them spanning a length of between a few to possibly as much as 100–200 kb (Boyes and Nasrallah 1993; Goring and Rothstein 1996; Yuet al. 1996; Conneret al. 1998), and they cosegregate with incompatibility types as S-allele haplotypes (Boyes and Nasrallah 1993; Boyeset al. 1997). The two genes share a region of homology (an S-domain within SRK is homologous to SLG), and both are highly polymorphic (Dwyeret al. 1991; Hinataet al. 1995; Kusabaet al. 1997). These properties have led to a number of suggestions about their functional relationships. Although it is not yet certain whether both genes have recognition functions, the SRK gene has been shown to be essential for self-incompatibility (Goring and Rothstein 1996), and there is evidence for a role for SLG (Nasrallahet al. 1992).
The fact that the S-domains of both loci are extremely polymorphic suggests that their role probably involves recognition functions. Major questions are then which portions of these domains are involved in recognition, whether both loci have such functions, and how the two loci interact. Observations that similarity between the S-domains of the two component loci in the same S-haplotype is greater than that between haplotypes (Steinet al. 1991; Dwyeret al. 1994; Hinataet al. 1995) suggested that, for functional self-incompatibility, alleles from the two loci must be similar (but see below). In this view, these two loci must be in linkage disequilibrium. This further suggests that recombination between the two loci is suppressed to maintain the matching (Steinet al. 1991). Low recombination would also be required in two-locus models of the genetic basis of incompatibility in which the pollen component of recognition requires a separate, linked gene, because recombination would otherwise create haplotypes with the pistil reaction of one type and the pollen reaction of another, which has never been detected (Lewis 1962). Sequence rearrangements and high sequence divergence in the Brassica S-locus region (Boyeset al. 1997) and in the flanking regions of the S-locus of Petunia inflata, a species with gametophytic self-incompatibility (Coleman and Kao 1992), seem to support rarity of recombinational exchange. These data are not conclusive, however, as divergence could be caused by relaxed selection in these flanking regions. Moreover, such differences are not unique to S-loci; equally striking differences are also found in maize intergenic regions, where recombinational exchange may also be rare (Sanmiguelet al. 1997).
It is thus important to find means to test whether recombination does or does not occur in the S-loci. A further reason for interest in recombination within the S-loci is the hope that sequence data can illuminate the question of which parts of the sequence encode the recognition functions. Given the large numbers of S-alleles in homomorphic incompatibility systems, it seems reasonable to think that the most polymorphic regions of the genes, the hypervariable (HV) regions, may encode recognition regions of S-proteins. There is some direct evidence that exchanging these regions between S-alleles in species with gametophytic self-incompatibility can change their specificity in some cases (Mattonet al. 1997), though tests involving different alleles have yielded different results (Kao and McCubbin 1996; Zureket al. 1997) and it seems clear that in these species other regions of the gene can affect incompatibility types. The sequences currently analyzed do not include any independently isolated alleles having the same incompatiblity type. Such data could help provide evidence about which parts of the gene are not involved in determining specificity differences, though in the only available comparison, in P. rhoeas, only silent site differences were found (Walkeret al. 1996).
If recombination were totally suppressed, however, the inference of functional importance from high levels of variability in particular parts of the sequence cannot be sustained. The strong balancing selection in the S-loci would be expected to lead to extremely long-term maintenance of amino acid polymorphisms involved in recognition, and all parts of the locus should exhibit similar long coalescence times. In other systems, convincing evidence of selection has been provided by the finding of regions with amino acid polymorphism, i.e., by Ka/Ks ratios (Nei 1987) exceeding unity, as in the case of major histocompatibility complex (MHC) loci (Hugheset al. 1990). Such analyses are inconclusive for the S-loci, apparently because the polymorphism is too ancient (see below). Theoretical studies of neutral diversity at sites closely linked to sites under balancing selection show that, once stochastic equilibrium has been reached, diversity should be high at linked sites not subject to selection, and that this falls off as a function of recombination frequency between the sites (Strobeck 1983; Hudson and Kaplan 1988; Nordborget al. 1996; Charlesworthet al. 1997; Takahata and Satta 1998). Without recombination, there should thus be no consistent regional heterogeneity in silent diversity, but just stochastic differences (Hudson 1990), even if selective constraints differ in the sequence. Linkage disequilibrium should also be high throughout the region. Recombination, however, allows different segments of a gene to have different evolutionary histories (Hudson 1983, 1990), so that differences in functional constraint could generate different levels of polymorphism (Maynard Smithet al. 1993; Klitzet al. 1995). High Ks values, and perhaps even amino acid replacements, are then expected at sites closely linked to regions where balancing selection is acting to maintain amino acid diversity (Strobeck 1983; Kreitman and Hudson 1991). Thus our capacity to detect functionally different regions in genes where variant alleles have persisted for very long time periods, as the S-alleles have clearly done, depends on the occurrence of recombination.
No estimates of recombination rates from SI sequence data have yet been made. Such estimates may be impossible, because the high variability obscures patterns, and apparent “recombinant sequence motifs may be caused by independent origination of similar sequences. It may thus be difficult to differentiate between convergent or parallel evolution and recombination (Gustafsson and Andersson 1994; O’Huigin 1995; Hughes and Yeager 1998). Ks values for pairs of S-locus sequences are very high (see below), suggesting that these sequences have reached saturation for silent substitutions, in which case measures of recombination are not informative as patterns of recombination become lost through mutation. Several methods exist to test for or estimate recombination in DNA sequence data (e.g., Stephens 1985; Hudson 1987; Sawyer 1989; Hey and Wakeley 1997), but balancing selection at the S-loci causes violations of the methods’ underlying assumptions. One such violation is that multiple substitutions at individual sites are frequent in the data for these loci (∼14% of the total number of sites in the Brassica campestris SLG gene, and 19% in B. oleracea), whereas the methods are based on the infinite sites model, under which sites should segregate for only two different bases. Furthermore, recombination will generally be incorrectly estimated for alleles maintained for long periods of time by balancing selection. Multiple substitutions can mimic recombination, but balancing selection can create genealogical effects similar to those caused by population structure, and thus reduce estimates (Hey and Wakeley 1997). Nevertheless, using two tests sensitive to low rates of recombination (Stephens 1985; Sawyer 1989), Clark and Kao (1991) found evidence for some recombination at the self-incompatibility locus in species of the Solanaceae, even though the analyses were performed with few alleles and the alleles analyzed came from four different species. Another difficulty with the data currently available is that the sequences are not from a natural population sample. Rather, every sequence is from a known and different S-allele type sequenced from cultivated strains, which would on average be expected to differ more than randomly picked alleles, as is the case for MHC alleles (Takahata and Satta 1998).
The possibility of intragenic recombination within the SLG gene was suggested by Kusaba et al. (1997), who analyzed the sequences of a subset of 6 out of 21 B. oleracea Type I alleles and found evidence for HV regions having been “shuffled between alleles. Their approach examined whether the topologies of the gene trees differed when they were estimated from different regions of the set of sequences. However, statistical testing for whether differences are significant is a problem for this approach, because different regions of a sequence have different mutational histories and would would yield different topologies (which might have support from bootstrap tests) even if there were no recombination. It would be possible to perform tests using likelihood ratios for one region based on its own topology vs. that estimated from other regions, but our attempts to do this yielded ambiguous results, which depended on which particular sequences were included. A related, somewhat less ad hoc approach, based on differences in diversity in different gene regions, has been used as a test for recombination between MHC alleles, for which there is good evidence for the action of balancing selection (Hughes and Yeager 1998); it uses sliding windows analysis to estimate the variability in levels of nucleotide diversity between different parts of the sequence of the set of alleles, after removing the peptide binding region codons. This approach, however, requires prior knowledge of the functionally important regions of the protein product. It also suffers from difficulties similar to the tree-based approach and from the well-known problems of sliding windows methods, which can be very sensitive to window size.
To avoid some of the problems just mentioned and to attempt to test for recombination in S-loci, we calculated linkage-disequilibrium estimates between pairs of informative (segregating) sites. With recombination, pairs of sites relatively far apart should exhibit less linkage disequilibrium than sites close together in the sequence (e.g., Miyashitaet al. 1993; Schaeffer and Miller 1993). This would be true even if recombination were infrequent (e.g., Guttman and Dykhuizen 1994) and even if reciprocal recombination did not occur, but only gene conversion events occur, causing exchange of sequence information between different alleles at S-loci (whether between homologous loci or between different members of the S gene family). The chief purpose of this article is thus to examine the evidence concerning whether any recombinational process has occurred in the Brassica S-locus region. Using published nucleotide sequence data from B. oleracea and B. campestris we do this by analyzing patterns of sequence diversity and linkage disequilibrium in the SLG gene. We further suggest that the data support the view that sites in the HV regions are under diversifying selection.
MATERIALS AND METHODS
DNA sequences of SLG and SRK loci from B. oleracea, B. campestris, and B. napus were obtained from GenBank. A total of 39 SLG sequences and 6 SRK sequences were analyzed (Table 1). These included only functional SLG-specific and SRK-specific sequences and only sequences encoding dominant (type I) S-alleles. Currently, there are only a few published type II SLG sequences, and these differ greatly in sequence composition from type I (Hatakeyamaet al. 1998). Coding sequences were aligned with ClustalX version 1.64b (Thompsonet al. 1997), after removal of introns in the case of SRK sequences (SLG genes and S-domains of SRK, which are the main focus of the analyses reported below, have no introns). Clustal outputs were manually edited using Seqpup version 0.6f (Gilbert 1997), and reading frames and amino acid positions were checked against published amino acid sequences (Kusabaet al. 1997). A number of gaps are necessary to align the sequences. In our alignments of SLG, four indels are polymorphic in both species (one consisting of 3 nucleotides, two of length 6 nucleotides, and one 15 nucleotides long in an otherwise relatively conserved region); in B. campestris a further three gaps are necessary: two are due to single codon additions each present in just one sequence, and one is a polymorphic indel of 6 nucleotides, while in B. oleracea one additional codon is present in just one sequence. Four of these indels are in HV regions I and II. The total sizes, including alignment indels, of the SLG and the SRK coding regions were 1350 and 2780 bp, respectively. After removal of all indels and portions of incomplete information at the ends of the sequences, 508 of the 1116 remaining coding sequence sites in B. campestris (46% of all sites) and 585 (52%) in B. oleracea were polymorphic, i.e., about half of all nucleotide positions in both species.
Mean pairwise proportions of synonymous substitutions (Ks) and nonsynonymous substitutions per site (Ka), and their standard errors, were calculated for the regions analyzed (see below) using MEGA version 1.01 software (Kumaret al. 1994). These values are equivalent to the nucleotide site diversity, or π, values of Nei (1987). Regions were classified as HV or “conserved following Dwyer et al. (1991) and Kusaba et al. (1997). Expected means and standard errors for Ka:Ks ratios were calculated using the Delta method (Bulmer 1979). Differences in these ratios were tested for significance by z tests, using the standard error estimates. Estimates of substitution rate heterogeneity across the SLG gene were done by calculating maximum-likelihood estimates of the shape parameter, α, of the discrete-gamma model of substitution rate, using PAML version 1.3 (Yang 1996, 1997).
Linkage disequilibrium estimates and values of the Hill and Robertson (1968) measure, r2, were calculated between pairs of sites in SLG segregating for two nucleotides, using DnaSP version 2.91b (Rozas and Rozas 1997). No measure of disequilibrium is completely independent of the nucleotide (allele) frequencies at the sites compared (Lewontin 1988), so normalization is desirable because of the very different allele frequencies observed at different sites in the S genes. We used D′, a measure of the degree of association between nucleotide variants of different polymorphic sites, normalized by Dmax, the largest possible value of D given the nucleotide frequencies at the sites (Brown 1975; Lewontin 1988), which appears to be preferable to other measures (Hedrick 1987). A disadvantage of D′, however, is that its distribution includes a high proportion of values close or equal to 0 or 1 (Golding 1984; Hudson and Kaplan 1985; Hedrick 1987; Schaeffer and Miller 1993). Also, sites with multiple substitutions often tend to have low Dmax values (because they often have low allele frequencies), resulting in pairs of sites in which one or both is in this category, have D′ values higher than those for sites in which there are just two variants. Other linkage-disequilibrium measures were also calculated. Significance of the disequilibria was evaluated by two-tailed Fisher’s exact tests, and the set of results was adjusted for multiple comparisons using Bonferroni correction (Sokal and Rohlf 1995). First, second, and third nucleotide positions of codons were analyzed separately.
In both B. campestris and B. oleracea, <20% of polymorphic sites in third positions of codons segregate for three or four nucleotides, but it is necessary to check that these sites do not alter the conclusions. For comparisons between pairs of sites where at least one had more than two nucleotides segregating, a program was therefore written to calculate D′. The significance of each pairwise measure of disequilibrium was tested by a permutation approach suggested to us by W. G. Hill. For each contingency table, 100 randomizations of the observed values within the cells were performed, preserving the row and column totals. A contingency index (CI) was calculated for each permutation (Keeping 1962), the CI values were sorted, and linkage disequilibrium for the pair of sites was deemed significant at the P = 0.05 level if its CI was ≥95% of the values of the permuted tables. Only polymorphic sites in third positions of codons were analyzed in this way, owing to the very large numbers of pairs of polymorphic sites in these sequences.
DNA sequences used
The relationship between linkage disequilibrium and distance between the polymorphic sites was tested by Spearman’s rank correlations. Because multiple pairwise tests involve the same polymorphic sites, we did randomization tests of the significance of the relationships found. We used the second procedure of Schaeffer and Miller (1993), generating a large number of datasets in which the distances between polymorphic sites were randomly assigned from actual distances between these sites. (This is simpler than their first procedure, which assigns polymorphic sites to random positions in the sequence under study, but it should give similar results, as the S-allele sequences are polymorphic at almost half of all sites; see above.) For each randomization, the rank correlation was calculated. The value from the actual sequence data was compared with the distribution of values from the randomized sets of data and was considered significant at a given level if it exceeded the relevant percentile. As so many sites are segregating in the SLG sequences, this analysis was done only for third position sites of codons with two nucleotide variants.
RESULTS
Analyses of patterns of sequence diversity in the S genes: i. Nucleotide sequence differences within and between species for the SLG gene and the receptor-domain of the SRK gene: Some evidence about whether the S genes undergo recombination or not can be gleaned from analyses of sequence diversity in different regions of the loci. Much is already known about variability in the SI genes and its structure within genes (Hinataet al. 1995; Kusabaet al. 1997). Overall, mean pairwise diversity values for both genes are very high (Hinataet al. 1995) relative to estimates for most coding loci for substitutions between species (Li 1997) or polymorphisms (Moriyama and Powell 1995; Charlesworth and Awadalla 1998). Within the S-domains of both the SLG and SRK loci, HV regions occur in the same nucleotide positions (Kusabaet al. 1997; Charlesworth and Awadalla 1998). As sequence data from more SLG alleles have accumulated, the pattern of diversity within the locus has remained stable, showing that these patterns are common across alleles. Table 2 shows mean estimates of silent and nonsynonymous (replacement) differences per site in the coding portions of the S-domains of the SLG and SRK loci, separately for the HV and the conserved regions and in the kinase domain of the SRK gene. Surprisingly, Ks values for alleles from the comparison of alleles between B. oleracea and B. campestris are no larger than for within-species comparisons, except perhaps for the conserved region of SRK S-domains and the kinase domains (Table 2), which show some slight phylogenetic signal. Thus the estimated coalescence times for the polymorphic alleles appear so greatly to exceed the time since divergence of the two species that no silent substitution with respect to the other species is detectable. This is consistent with previous analyses suggesting very ancient divergence of S-alleles (Ioergeret al. 1990; Dwyeret al. 1991; Uyenoyama 1995), together with the evidence, based on the few available data (Lagercrantz 1998), that these Brassica species are relatively recently diverged. They differ in chromosome number (Goldblatt 1981), but no sequences are yet available for reference loci not subject to balancing selection, from which the expected silent divergence level could be estimated.
Distribution of diversity values expressed as mean proportions of pairwise synonymous and nonsynonymous differences per site (Ks and Ka, respectively) in different parts of the coding regions of the SLG and SRK loci of Brassica oleracea (21 SLG and 3 SRK alleles) and campestris (18 SLG and 3 SRK alleles)
In the S-domains of both genes of both Brassica species, diversity in the HV regions is higher for synonymous as well as nonsynonymous substitutions than in the conserved regions, though Ka values differ more than Ks (Table 2). The extensive differences between alleles are consistent with the fact that the incompatibility alleles are a balanced polymorphism, maintained by frequency-dependent selection, so that long coalescence times of alleles are expected (Vekemans and Slatkin 1994). In the HV regions, Ka/Ks ratios (Nei 1987) are ∼1, significantly different from values elsewhere in the genes. For the SLG gene, z for the comparison between HV and other regions is –14.5 for B. oleracea and –18.2 for B. campestris, and for the S-domain of the SRK gene, the test values are –2.64 and –3.81, respectively; all these are significant at P < 0.01. These differences are particularly striking because Ks values are also extremely high in the HV regions. Even in the HV regions, however, the mean values of this ratio were not significantly >1. Because the SRK S-domain and the SLG gene have a similar level of polymorphism, it is possible that both participate in recognition of incompatibility types.
ii. Evidence for variability in substitutions at different sites in the SLG sequence: These analyses demonstrate that different regions of the S-domains have significantly different diversity values for both replacement and silent sites. The similarity in positions of the hypervariable regions in different species (see above) argues against a nonselective interpretation, but this is not conclusive because the species may be very close relatives (perhaps even able to hybridize occasionally), and gene conversion between the two loci could potentially cause similarity between them (though not, of course, if there is no recombination in this region of the genome).
As a further test for heterogeneity in substitution rates across the gene sequences, and to compare variability at the S-loci with data from other loci, substitution rates per amino acid site were estimated using the discrete gamma model (Yang 1996, 1997). This method is intended for analyses of divergence between gene sequences of different species, but it should also be appropriate for allele sequences in a nonrecombining region of genome. Increased rate variability might be expected if recombination occurs, though no explicit study of this appears to have been published. Because our aim is to test for recombination, nonsignificant rate variability would imply that there is no evidence for recombination from this type of analysis. The coefficients of variation for B. oleracea and B. campestris were 109 and 107, respectively. The shape parameters, α, of the gamma distribution of substitution rates (see Yang 1996) were also estimated. Low α values suggest rate variability within the sequence. Sites in SLG show rate variation (which was significant with P < 0.01 for both species by likelihood ratio tests): α = 0.57 ± 0.064 for B. oleracea and 1.28 ± 0.075 for B. campestris. These values are not particularly low (that for B. campestris is >74% of the 51 available values for vertebrate nuclear genes; see Zhang and Gu 1998), reflecting the fact that the S-alleles show high variability throughout the sequence. There are too few SRK sequences to do this kind of analysis.
iii. Variation between different domains of the SRK gene: The coding sequence of the kinase domain of the SRK gene has reduced nucleotide diversity relative to the S-domain (Hinataet al. 1995; Charlesworth and Awadalla 1998). Such polymorphism differences again indicate some independence in the evolution of these linked sequences. As just discussed, the polymorphism in the S-domain of the few SRK alleles currently available appears similar to that at the SLG locus (both have overall mean diversity per base of 0.13; see Hinataet al. 1995). The SRK kinase coding sequence, however, has fewer nonsynonymous and synonymous differences, and Ka/Ks lower than the conserved regions of the SLG gene or the SRK S-domain (Table 2). Consistent with the kinase domain’s overall lower variability, probabilities of radical amino acid differences between alleles within both species average less than half those of conservative differences, unlike the situation in the S-domain (Charlesworth and Awadalla 1998). Finally, Ks estimates from the kinase domain are less within species than between them, unlike the situation in the S-domains (Table 2). There are thus clear signs of selective constraint acting at the SRK locus, although even in the kinase domain the Ks values show ancient divergence between alleles. The regional variability suggests that the polymorphism in the kinase domain is caused by linkage to sites elsewhere in the SRK (perhaps the S-domain) that are under balancing selection, with some recombination in the locus. If this were true, the polymorphism in this domain should be highest at the 5′ end and would be expected to decline sharply. There is, however, no clear tendency in this direction. Silent polymorphism in the kinase domain declines slightly but nonsignificantly in both species (for B. oleracea, Ks in the 3′ half of the kinase coding sequence is 86% of its value in the 5′ half, and for B. campestris it is 87%). Of the three intron sequences from multiple alleles, indel polymorphisms are also very abundant in introns 2 and 3, and much less so in intron 4 (Nishioet al. 1997); further data on this domain may illuminate this question in the future.
iv. Within- vs. between-haplotype comparisons of the SLG gene and the S-domain of the SRK gene: The view that recombinational exchange in the S-gene region is restricted, initially appeared to be supported by the observation that the sequences of the SLG gene and the S-domain of the SRK gene from each haplotype tend to be much more similar than those from different haplotypes (Steinet al. 1991), but it has become clear that the similarity is far from complete, and that large differences between the two members of a given haplotype are not unusual (Goring and Rothstein 1996; Kusabaet al. 1997). For the few haplotypes sequenced for both loci, quantitative analysis of the data from the two Brassica species shows that both Ks and Ka values between haplotypes within species average almost double their within-haplotype values (Table 3). These differences are highly significant. For the conserved regions, the z value for Ka is –6.70, and for the HV region sites it is –11.3, while for Ks the value is –9.96, and for the HV region sites it is –9.90 (all P values < 0.001). Furthermore, when the different regions of the S-domains are analyzed separately, the HV regions again stand out as different from the conserved regions. Conserved regions differ rather more between than within haplotypes in proportions of both synonymous and replacement substitutions, but the HV regions show greater differences between, though not within, haplotypes.
Comparisons of synonymous and nonsynonymous differences per site between SLG and the S-domain of SRK-alleles, within haplotypes and between different haplotypes
Testing for linkage disequilibrium within the SLG gene: Figure 1 summarizes the patterns of linkage disequilibrium between pairs of segregating sites in third positions of codons within the SLG locus. There are currently too few sequences available to analyze the SRK locus in this manner. Three kinds of tests were done to ask whether linkage disequilibrium declines with distance in the SLG locus. In the first test, pairs of sites segregating for only two nucleotides were analyzed. These results are summarized in Figure 1, which shows, for the two species, estimates of r2 for pairs of sites grouped according to the nucleotide distance between them. Results are also shown for sequences from which the HV regions, which exhibit clusters of linkage disequilibrium (see below), had been removed. Linkage-disequilibrium values decrease significantly with distance. Spearman rank correlations of r2 with distance were –0.071 and –0.134 for r2, for B. oleracea and B. campestris, both with P < 0.01; the corresponding values when the HV regions were removed from the sequences were –0.074 and –0.083, both with P < 0.01. Out of 500 data sets with randomized distances between sites, none equalled or exceeded these correlation values for either species, either for all sites segregating for just two different nucleotides in third positions of codons, or for the set of sites excluding the HV regions.
A second test was based on all segregating sites in third positions of codons. These are too numerous to analyze the individual distance/disequilibrium values. Spearman rank correlations of D′ with distance between sites were therefore performed using mean values for the distance categories in Figure 1. For the two species, the correlations were –0.54 (P > 0.05) and –1.0 (P < 0.01), respectively. This is a very conservative test for a decline in association with distance, owing to the high frequency of large D′ values expected even in the presence of free recombination (see materials and methods section). Also, the means for pairs of sites at the greatest distances apart (which represent quite small proportions of all site pairs tested) can be inflated by a few high, but nonsignificant, D′ values.
—Relationship of linkage disequilibrium with distance for the SLG sequences of two Brassica species. Linkage disequilibrium estimates in terms of r2 are shown for sites segregating for only two nucleotide variants, excluding pairs of sites with singleton variants (black bars), and after excluding HV region sites (gray bars). (a) Brassica oleracea. (b) Brassica campestris.
The frequencies of significant associations between sites at different distances are therefore better for testing the relationship of linkage disequilibrium to distance between sites. It is also of interest to examine which parts of the gene show significant linkage disequilibrium (see below). This third kind of test is still very conservative as it takes no account of the values of the disequilibria and is based on small numbers of mean values (all tests became more highly significant with finer division of distances than in the figures). Figure 2 shows the effect of distance on the chance of observing a pairwise association that was significant at the P = 0.01 level. The Spearman rank correlations for B. oleracea were –0.99 (P < 0.01) for sites segregating for only two nucleotides and –0.77 (P < 0.05) for all sites; the corresponding values for B. campestris were –0.63 (P > 0.05) and –0.81 (P < 0.05). The results were similar when the HV regions were removed from the sequences (Figure 2). In all analyses, the frequency of significant linkage disequilibria drops off sharply when sites are >600 nucleotides apart (Figure 2). Calculations of linkage disequilibrium for the first and second nucleotide positions showed similar relationships for all measures of linkage disequilibrium.
It is very unlikely that alignment errors could have produced these results, even though in principle such errors might obscure linkage disequilibrium between distant sites. As explained in materials and methods, most of the sequences align unambiguously despite the very high diversity, and, of the eight indels, three are additions of single codons in just one of the sequences, so errors in these could not greatly affect the results; in addition, one 12-nucleotide indel is present only in B. oleracea and cannot affect the results from the other species. We tested whether there is evidence for recombination within a portion of the sequence that contains no indels (in which alignment is clear), but this portion is only 123 codons in length and does not show the pattern of linkage disequilibrium found overall, perhaps unsurprisingly for such a short region (see Figures 1 and 2). However, the results in general cannot be explained by alignment errors. With such errors, linkage disequilibria should be particularly infrequent in the regions near the indels, but this is not the case; four of the five indels are in HV regions, which include half of the indels, but which include more significant disequilibria than other regions (see above). Furthermore, alternative alignments yield essentially the same results (the evidence for decline in linkage disequilibrium within SLG either becomes very slightly stronger, or remains the same). The same is true when we aligned the sequences using a different algorithm (PILEUP of the University of Wisconsin Genetics Computer Group, GCG). Finally, omitting the HV regions yields a similar decline of linkage disequilibrium as seen in the analyses of the entire sequence, which again suggests that the addition of gaps to the sequences is not obscuring linkage disequilibrium.
—Frequencies of linkage disequilibria in the SLG sequences that are significant by Fisher’s exact tests, for sites at different distances in the two Brassica species. The analyses were done for sites segregating for only two nucleotide variants, excluding pairs of sites with singleton variants (black bars) and after excluding HV region sites (gray bars). (a) Brassica oleracea. (b) Brassica campestris.
Structure of associations between variants within the SLG gene: Many of the significant nonrandom associations (P values <0.001 that remain significant after Bonferroni correction) involve sites within HV regions I and II and between pairs of sites from these two regions (Figure 3), and these sites are significantly overrepresented among the associations that are significant (Table 4), even though scattered sites with very high diversity at third codon positions are also found elsewhere throughout the SLG sequence. Moreover, the C-terminal regions, with high diversity similar to the HV regions, include no sites in linkage disequilibrium. Significant linkage disequilibria were also detected for first and second segregating positions of codons across the SLG gene, and again the sites involved tended to be those within the HV regions. Given the effect of distance on linkage disequilibrium, the excess representation of these sites argues for functional importance of the HV regions. All pairs of first and second position segregating sites with significant values after Bonferroni correction involved replacement substitutions. Segregating second position sites that exhibit strong linkage disequilibria may thus be candidate amino acid positions that contribute to differences in allelic specificity.
—Results of Fisher’s exact tests of linkage disequilibrium in the SLG gene. The analyses were done using pairs of third position polymorphisms segregating for only two nucleotides, excluding singletons. Empty boxes indicate pairs of sites that showed no significant associations, black boxes indicate significance after adjustment for multiple comparisons using Bonferroni correction, and gray boxes indicate all (uncorrected) tests that were significant with P < 0.01.
Tests of significance of the HV regions’ representation among pairs of sites involved in linkage disequilibria based on third position sites segregating for two nucleotides only
DISCUSSION
The effects of distance and region on linkage disequilibrium: Our analyses reveal a strongly significant effect of distance between sites in the SLG gene for two of the three estimators of linkage disequilibrium in both species analyzed (Figures 2 and 3), though the bimodal distribution of D′ (Hudson and Kaplan 1985; Schaeffer and Miller 1993) obscures its relationship with distance. Reduced recombination has been documented for a number of different coadapted gene complexes such as in regions of genomes involved in recognition processes (Ferris and Goodenough 1994; May and Matzke 1995). Among these are the “supergenes that appear to control the phenotypic differences, including incompatibility types, of the two morphs in distylous plants (Haldane 1933; Ernst 1936; Charlesworth and Charlesworth 1979). The evidence for these supergenes is at present entirely from classical genetic studies; no molecular analyses have yet been possible, so definitive evidence is lacking. The evidence for a tightly linked complex of genes in the S-locus region is based on molecular data, but we have argued here that it is also not definitive, and that the sequence data give serious grounds for suspecting that recombination has occurred over the evolutionary time during which this polymorphism has persisted. Over the relatively small lengths of DNA in the S-locus region, the estimated average recombination frequency for Brassica suggests that the map distance would be <1 cM, so recombination is unlikely to be detected in classical linkage tests with the family sizes that can be tested (Boyes and Nasrallah 1993; Yuet al. 1996; Conneret al. 1998). Nevertheless, over evolutionary time even rare events can cause different parts of a gene to have essentially independent evolutionary histories (see, e.g., Hudson 1990; Guttman and Dykhuizen 1994; Nordborget al. 1996; Charlesworthet al. 1997; Kelly 1997; Andolfatto and Nordborg 1998). In MHC loci, for instance, sites outside the peptide-binding regions are much less diverse than sites within those regions themselves (Takahata and Satta 1998).
Significant clusters of linkage disequilibrium were found predominantly to involve sites in the HV regions. If recombination occurs in the S-domains of the incompatibility loci, this may suggest that these sites have functional importance in recognition processes. Before we can conclude that this is true, we must, however, consider alternative possibilities. Population subdivision cannot explain these findings, as it should cause disequilibrium across the whole gene and does not produce a clear relationship between distance and linkage disequilibrium. The relationship of linkage disequilibrium with distance exists even when the HV regions are removed from analyses, so it is not simply caused by greater power to detect associations in the most variable regions. The evidence is therefore consistent with the view that sites in the SLG gene have recombined over evolutionary time, both within and outside the HV regions, such that sites far apart are not in linkage disequilibrium, even if sites close together are. Even if the HV regions have less recombination than the rest of the gene, this could not account for these regions’ locally higher diversity, unless they differ in their selective regime.
The view that the S-alleles recombine or undergo some other kind of exchange, such as gene conversion, is in apparent contradiction with some recent evidence that recombinant alleles (chimeric constructs between two Nicotiana alata alleles; Zureket al. 1997) may fail to be recognized by pollen of either allele. Such tests have not as yet been performed in Brassica species, but if this turns out to be a general property of S-loci, it would imply that recombination yields alleles that are nonfunctional in incompatibility. If such alleles are regularly generated in self-incompatible species and regularly eliminated, a selective force against recombination would be generated. Similar tests using different alleles have yielded different results (Mattonet al. 1997), so it is clear that at present more evidence is needed, including evidence from the Brassicaceae, on the extent to which different parts of the S-allele sequences are essential for correct recognition.
Patterns of diversity: balancing selection or reduced selective constraints? Taking the conserved regions as a reference, it appears that HV regions are evolving in a manner different from that of other parts of the SLG gene. There are two very different possibilities for these regions. The hypothesis that they are under balancing selection is attractive, given that self-incompatibility alleles are known to be subject to frequency-dependent selection, but it is difficult to rule out the possibility that they are evolving neutrally. Ka/Ks ratios ∼1 are usually considered evidence of neutral evolution, whereas genes under balancing selection may have values of this ratio >1 (Nei 1987). Under balancing selection, we may expect to see an initial increase in Ka relative to Ks early in the evolution of polymorphism at these loci (Nei 1987). Ka/Ks ratios >1 may thus indicate regions under balancing selection (Hugheset al. 1990).
However, initial high values will be expected to change over evolutionary time, because once silent substitutions are close to saturation, Ka/Ks will tend to increase. If, however, Ka ceases to increase, the opposite change could occur. This might happen if only a subset of amino acid residues undergo adaptive substitutions, while others are conserved. Thus an initial high ratio could fall below 1 (Neilsen and Yang 1998). In addition, assignment of sites as replacement or silent becomes uncertain over long evolutionary times, when divergence between sequences is high. Ka/Ks ∼1 could therefore result from balancing selection or neutrality, and it is impossible to infer selection unless ratios well above 1 are found, which is not the case for the sequences examined here.
HV regions have been identified in essentially identical regions in S-genes of other species, including B. napus (a species of hybrid origin with B. campestris and B. oleracea as putative parents; see Goringet al. 1992) and Raphanus sativus (Sakamotoet al. 1998). These similar locations, found also between HV regions in the SLG and in the S-domain of the SRK, tell us that the high diversity in HV regions is not merely chance variability in substitution rates, as it would then be highly unlikely that different genes in different species should share similar diversity patterns. The data might then be taken as evidence for selection acting in similar functional parts of the proteins these genes encode. The similarity in location of HV regions could, however, be caused by relaxed selective constraints in these regions, as could the greater divergence between these regions in different haplotypes, compared with other regions of the S-domains (see above). Regions under relaxed selective constraint should, however, have diverged since the two Brassica species analyzed here separated from one another. There is no evidence for this. Between-species Ks values for the HV regions are similar to those for different alleles within species. It therefore appears that, not only do the HV regions exhibit significant linkage disequilibrium, but variability in these regions has been maintained since before these two species diverged. Thus the diversity data alone do not allow us to distinguish definitively between these possibilities, though, in conjunction with the evidence that recombination has occurred, they tend to support the view that diversifying selection acts in these regions.
Our analyses suggest that HV regions are important in allelic specificity, but this does not imply that no other regions play any part. Two functionally distinct B. campestris SLG alleles with 97% amino acid similarity, Bca8 and Bca46, studied by Kusaba et al. (1997) proved to have identical HV region sequences. This suggests that, at least for some alleles, changes outside these regions can determine specificity or that changes at other genes (most likely SRK) affect allelic specificity either interactively with SLG alleles or alone. Interestingly, 12 of the 23 amino acid substitutions between the Bca8 and Bca46 alleles lie within the first 70 amino acid positions of the SLG gene, and this region is also quite polymorphic in the entire set of alleles.
Acknowledgments
We thank Brian Charlesworth, Mikkel Heide Schierup, Gil McVean, W. G. Hill, Molly F. Przeworski, and Bryant McAllister for discussions and advice on analyses. We also thank Jody Hey and D. S. Guttman. D.C. was supported by the Natural Environment Research Council of Great Britain, and P.A. by an Edinburgh University Faculty of Science and Engineering Scholarship.
Footnotes
-
Communicating editor: G. B. Golding
- Received September 30, 1998.
- Accepted January 22, 1999.
- Copyright © 1999 by the Genetics Society of America