Natural selection and neutral processes such as demography, mutation, and gene conversion all contribute to patterns of polymorphism within genomes. Identifying the relative importance of these varied components in evolution provides the principal challenge for population genetics. To address this issue in the nematode Caenorhabditis remanei, I sampled nucleotide polymorphism at 40 loci across the X chromosome. The site-frequency spectrum for these loci provides no evidence for population size change, and one locus presents a candidate for linkage to a target of balancing selection. Selection for codon usage bias leads to the non-neutrality of synonymous sites, and despite its weak magnitude of effect (Nes ∼0.1), is responsible for profound patterns of diversity and divergence in the C. remanei genome. Although gene conversion is evident for many loci, biased gene conversion is not identified as a significant evolutionary process in this sample. No consistent association is observed between synonymous-site diversity and linkage-disequilibrium-based estimators of the population recombination parameter, despite theoretical predictions about background selection or widespread genetic hitchhiking, but genetic map-based estimates of recombination are needed to rigorously test for a diversity–recombination relationship. Coalescent simulations also illustrate how a spurious correlation between diversity and linkage-disequilibrium-based estimators of recombination can occur, due in part to the presence of unbiased gene conversion. These results illustrate the influence that subtle natural selection can exert on polymorphism and divergence, in the form of codon usage bias, and demonstrate the potential of C. remanei for detecting natural selection from genomic scans of polymorphism.
NEUTRAL and selective processes interact to shape polymorphism and divergence across genomes, yet it continues to be a difficult problem to derive a robust understanding of the relative influence of the different component forces that contribute to evolution. Demographic change in populations is a neutral process that affects the entire genome, so population samples for a large number of loci are required to accurately infer the general features of demographic history that are recorded in patterns of DNA sequence polymorphism. Selection, on the other hand, acts locally on specific targets, so a reasonable characterization of the background patterns molded in the genome by global forces (like population size change) is necessary to accurately detect signatures of selection from molecular variation relative to this background. Here, I characterize nucleotide polymorphism and divergence across the X chromosome of the nematode Caenorhabditis remanei, an obligately outbreeding relative of the classic model organism C. elegans, to elucidate the contributions of selective and neutral processes in the evolutionary history of this organism.
Due to recent completion of a genome-sequencing effort and its close relationship with C. elegans, C. remanei is experiencing new interest as a focal taxon for comparative research in evolutionary genetics and genomics (Haag et al. 2007). The high nucleotide diversity previously reported for this species provides ample data for population genetic inference from even short stretches of DNA sequence (Graustein et al. 2002; Jovelin et al. 2003; Haag and Ackerman 2005; Cutter et al. 2006a). Analysis of a sample in a previous study suggested that the effective population size of C. remanei is large (Ne ∼1.6 × 106) and that its demographic history as reflected in sequence polymorphisms may be relatively uncomplicated (Cutter et al. 2006a), which should facilitate attempts to infer the action of natural selection from patterns of polymorphism. Beyond this, however, the structure and dynamics of C. remanei populations are unknown. An understanding of how natural selection shapes the C. remanei genome will provide important evolutionary insights from a species that represents the ancestral obligately outcrossing breeding system in the genus, and therefore a contrast to the derived condition of hermaphroditism in C. elegans and C. briggsae (Cho et al. 2004; Kiontke et al. 2004).
The large effective population size and obligate outcrossing of C. remanei, in contrast to its self-fertile relatives (Sivasundar and Hey 2003; Barrière and Félix 2005; Cutter 2006; Cutter et al. 2006b), imply that even very weak selection may yield an evolutionary response and that the response to selection may be tempered by the local recombination environment. Selection among alternative synonymous codons for translational efficiency and/or accuracy is expected to be due to very small selection differentials (Li 1987; Bulmer 1991), yet is a notable force shaping coding sequences within a variety of taxa (Duret 2002; Merkl 2003), including nematodes (Stenico et al. 1994; Duret and Mouchiroud 1999; Cutter and Charlesworth 2006; Cutter et al. 2006c). Directional selection in general eliminates neutral genetic variation in the neighborhood of a selected target (Maynard Smith and Haigh 1974; Charlesworth et al. 1993). However, the size of the region that is subject to reduced variation will depend on the number of recombination events that occur by the time that the selected variant becomes fixed in (or eliminated from) the population, resulting in a prediction of reduced neutral genetic diversity in regions that experience low levels of recombination (Wiehe and Stephan 1993; Hudson and Kaplan 1995). However, some neutral processes might result in a similar pattern (Lercher and Hurst 2002; Hellmann et al. 2003), making empirical support mixed for selective explanations of diversity–recombination correlations. Thus, it is of considerable interest to understand the relative importance of selective and neutral forces in shaping patterns of polymorphism across genomes.
Here, I survey nucleotide polymorphism for 40 loci across the X chromosome in a sample of C. remanei to (i) infer the potential role of demographic history in shaping patterns of genetic variation, (ii) quantify weak selection on codon usage, and (iii) evaluate the potential for selection at linked sites to alter diversity levels in genomic regions experiencing different recombination environments.
MATERIALS AND METHODS
Genomic DNA was amplified from C. remanei individuals by directly adding single males to QIAGEN (Valencia, CA) Repli-g mini kit reactions. Diluted aliquots of the resulting whole-genome-amplification DNA samples were then used as template for locus-specific amplification by PCR. I designed PCR primers with Primer3 (http://primer3.sourceforge.net) for putatively X-linked loci with predicted long exons, using contigs that had strong synteny to the C. elegans X chromosome from the preliminary assembly of the C. remanei genome sequence produced by the Washington University School of Medicine Genome Sequencing Center (supplemental Table 1). The X chromosome experiences the greatest interspecific synteny in Caenorhabditis (Stein et al. 2003; Hillier et al. 2007). Both strands of the resulting PCR products were then sequenced directly on an ABI 3730 through the University of Arizona's Genomic Analysis and Technology Core sequencing service. Five of 45 primer sets yielded obvious double peaks in all or some sequence traces, indicating the presence of heterozygotes and therefore linkage to autosomes; in this study, I focus only on the remaining putatively X-linked loci (supplemental Table 1). Sequences for each of 40 loci were obtained for 16 C. remanei individuals (supplemental Table 1): 14 were derived from isofemale lines collected by S. E. Baird in the Wright State University forest in Dayton, Ohio (PB207, PB210, PB211, PB213, PB215, PB219, PB242, PB243, PB244, PB247, PB249, PB252, PB253, PB275), as well as strains PB4641 (Brooklyn, NY) and SB146 (Freiburg, Germany) from the Caenorhabditis Genetics Center (Cutter et al. 2006a). All analyses of polymorphism focus on the sample of strains from Ohio, unless stated otherwise. All loci include coding sequence only, and average 742 bp in length after trimming primer and ambiguous bases. Two loci have slightly smaller sample sizes because PCR was unsuccessful for one (Cre-lam-2) or three (Cre-mbk-1) strains. Assuming that the C. remanei X chromosome is of a size similar to the C. elegans X chromosome (∼20 Mb, ∼50 cM), these loci are expected to be spaced ∼2/Mb and 1/cM.
Sequence trace editing and alignment were performed using Sequencher v. 4.7 and BioEdit v. 7.05.3 to confirm sequence quality and to remove primer sequences. I used DnaSP v. 4.10.9 to calculate diversity from pairwise differences (π) and from the number of segregating sites (θ) (Watterson 1975; Nei and Li 1979) and to conduct tests of neutrality [Tajima's (1989) D, Fu and Li's (1993) D* and F*], each computed separately for synonymous (denoted with subscript “s”) and nonsynonymous sites (denoted with subscript “a”). SITES was used to calculate Tajima's D for the restricted set of sites corresponding to preferred–preferred and unpreferred–unpreferred polymorphisms (Hey and Wakeley 1997). All correlations use the nonparametric Spearman's rank correlation, rSRC.
The population recombination parameter (ρ = 4Ner; effective population size Ne, recombination rate r) was inferred with LDhat (ρLDhat; McVean et al. 2002; G. McVean, http://www.stats.ox.ac.uk/∼mcvean/LDhat/) and maxhap (ρH01 and ρH01+f, which accounts for gene conversion, f; R. Hudson, http://home.uchicago.edu/∼rhudson1) based on Hudson's (2001) composite-likelihood method. LDhat uses a finite-sites mutational model, unlike the infinite-sites model used by maxhap, which results in slightly different estimates of ρ despite an otherwise identical composite-likelihood methodology. Values of ρLDhat and ρH01 were strongly correlated (rSRC = 0.87, P < 0.0001; Ohio sample only). With maxhap, the ratio of gene conversion to recombination rate (f) was inferred jointly with ρH01+f. Although gene conversion tract lengths are unknown in this species, transgene-mediated gene conversion in C. elegans following double-strand break repair of transposon excision yields tracts at least 191 bp long (Plasterk and Groenen 1992) and reports of a few hundred base pairs are typical of other taxa (Hilliker et al. 1994; Jeffreys and May 2004); a gene conversion tract length of 400 bp is assumed in the calculations presented here. Estimates of ρH01+f made with other tract lengths (50, 200, 1000) did not dramatically alter the values (not shown), although shorter tract lengths yielded somewhat higher estimates of f and slightly lower estimates of ρH01+f. Values for ρH01 could not be obtained for two loci (Cre-myo-2 exon 8, Cre-let-2), which were excluded from corresponding analyses.
The C. elegans and C. briggsae orthologs of the C. remanei loci were inferred by identification in the TreeFam database on the WormBase website and by manual reciprocal-best-hits Blast. Alignments of the coding sequences that correspond to the regions assayed for polymorphism were made by eye or with the aid of ClustalW in BioEdit. I then used the program Kaks_calculator (Li et al. 2006) to compute pairwise rates of synonymous (dS) and nonsynonymous (dN) site divergence with the Goldman–Yang method (Goldman and Yang 1994) for the coding regions containing C. remanei polymorphism data, where a consensus sequence was used for C. remanei. Because C. remanei shares a most recent common ancestor with C. briggsae (Kiontke et al. 2004), C. briggsae–C. remanei measures of divergence are used unless stated otherwise. The C. remanei consensus sequences were also used for computation of the codon bias statistics Fop (frequency of optimal codons; Stenico et al. 1994) and ENC (effective number of codons; Ikemura 1985) as well as G + C base content, using the program CodonW (J. Peden, http://codonw.sourceforge.net) and applying the C. elegans optimal codon table (Stenico et al. 1994). The resulting Fop and GC-content values were then used to generate corrected estimates of C. briggsae–C. remanei dS (dS′) via multiple regression that removes the correlation between synonymous divergence, codon bias, base composition, and their interaction.
C. remanei synonymous-site polymorphisms were characterized as encoding preferred–unpreferred variants (PU) if one variant yielded a preferred codon in the gene sequence and the other segregating variant produced an unpreferred codon, according to the C. elegans optimal codon table, as in Cutter and Charlesworth (2006). Similarly, unpreferred–unpreferred (UU), preferred–preferred (PP), and GC-AT variant sites were identified (211 total sites in 36 loci formed GC-AT variants at UU or PP sites). Using the frequencies of the preferred variant at PU sites, I applied a maximum-likelihood procedure (Cutter and Charlesworth 2006) based on the model of McVean and Charlesworth (1999) to infer the selection intensity (γ = 4Nes, selection coefficient s) acting on preferred codons for each locus individually and for all loci considered together (a total of 529 PU sites in 40 loci). All calculations are based on polymorphism in the Ohio sample of strains. The implementation used here corrects an error in normalization of the likelihoods (B. Charlesworth, personal communication) that was discovered in the program used in Cutter and Charlesworth (2006); however, the effect is small, and the error was conservative with respect to detecting selection on codon usage. The statistical significance of the maximum-likelihood estimates for γ were inferred from overlap with γ = 0 of the 2lnL interval. Correlations involving per-locus maximum-likelihood estimates of γ for the Ohio sample exclude Cre-let-2.
Coalescent simulations using Hudson's (2002) program ms were implemented to test the distribution of Tajima's D for consistency with expectations under neutrality. Specifically, 10,000 replicates of 9 or 35 loci (see below) were generated using corresponding observed values of the number of segregating synonymous sites and sample size for the Ohio strains only (Table 1; supplemental Table 2) from which Tajima's D was calculated to compute the expected mean and variance in D for the loci under neutrality (using E. Stahl's program samplestats, http://molpopgen.org/software). The observed mean and variance in D was then compared to the simulated distributions. Different loci were assumed to be unlinked, and intralocus recombination was either excluded or included using observed population recombination parameter values of ρLDhat for corresponding loci. To reduce the potential for selection on synonymous sites to compromise the analysis, I restricted the empirical data set in each of two ways (in both cases also excluding one locus with evidence of selection). First, I considered polymorphic synonymous sites for the 9 loci that exhibited low overall codon bias (Fop < 0.5). Second, I considered only sites that defined preferred–preferred or unpreferred–unpreferred polymorphisms for the 35 loci containing such variants, because these variant sites are unlikely to experience direct selection.
I also performed a series of coalescent simulations to test for correlations between θ and Hudson's (2001) estimator of ρ (ρH01). Sets of 1000 neutral genealogies were simulated for each combination of fixed values of θsim (1, 5, 10, or 20), ρsim (1, 4, 10, or 16), fsim (0, 1/2, 2, 4, or 8), and gene conversion tract length (100, 400, or 1600 bp). Gene conversion was incorporated in the simulation of neutral genealogies with the parameter f, the ratio of gene conversion to recombination rate (Wiuf and Hein 2000; Hudson 2002). Then, θs, ρH01, and ρH01+f were estimated from the resulting simulated neutral genealogies (using samplestats and R. Hudson's programs exhap and maxhap, http://home.uchicago.edu/∼rhudson1). Estimates of θ and ρ (ρH01, ρH01+f) for a set of input values therefore varied solely due to stochasticity in the coalescent process and were computed along with Spearman-rank correlations between the θ and ρ estimates across each of the 1000 replicates. The resulting correlation coefficients were then evaluated in an ANOVA model as a function of the input values to determine the influence of each parameter, their second-order polynomials, and their first-order interactions on the correlation between estimated θ and ρ. A recursive partition analysis (JMP 5.0.1) was also applied to derive a heuristic interpretation of the effects of the dominant explanatory variables. This simulation regime tests for whether stochasticity in the coalescent process alone can generate correlations between estimates of θ and ρ, even when there is no underlying variation among loci in these two parameters.
Multilocus patterns of polymorphism and divergence:
The 40 loci sampled here for 16 individuals, putatively linked to the C. remanei X chromosome, provided 953 polymorphic sites and yielded average nucleotide diversities at synonymous sites (πs) of 3.6% and at nonsynonymous sites (πa) of 0.1%. For the subsample of 14 strains from Ohio, the diversity values are comparable (Table 1). These mean values for 29.7 kb of sequence per sampled chromosome are somewhat lower than the averages previously reported for 9 loci (Cutter et al. 2006a), although the range is nearly identical. If a 4/3 correction for diversity estimates is appropriate for these loci, due to a reduction in effective population size caused by hemizygosity of the X chromosome in males, then the numerical average diversity value previously reported (πs = 4.7%) will hold, but additional data from autosomal loci are necessary to confirm this possibility. One locus demonstrated particularly high polymorphism (Cre-D1005.1; πs = 0.128), confirming the high diversity observed previously for an adjacent portion of sequence in this gene (Cutter et al. 2006a).
As reported previously (Cutter et al. 2006a), linkage disequilibrium decays rapidly with distance in C. remanei. For these loci, the r2 measure of linkage disequilibrium averages 0.19 within loci (40 loci) and 0.031 between loci (38 loci with complete data for Ohio sample), although 4 loci had intralocus r2 values >0.4 (supplemental Table 2).
This polymorphism data set for a large collection of loci along the X chromosome is useful for testing for any influence of demographic perturbations in shaping genetic diversity in the sample. Demographic processes such as population size change or migration should skew the site-frequency spectrum for loci across the genome, whereas selection is likely to skew allele frequencies only in the vicinity of a target of selection. Across loci for the Ohio sample, synonymous sites yielded an average value of −0.259 for Tajima's D (Tajima 1989), a statistic that quantifies skew in the frequency spectrum of variant sites on the basis of alternative measures of nucleotide diversity (Table 1; Figure 1; mean D for the full sample = −0.322). For two loci individually, D differs significantly from the standard neutral expectation (Cre-F47A4.5, Cre-spc-1; Table 1) although the background skew makes the test for Cre-spc-1 nonconservative; three other loci deviate significantly for the related metric Fu and Li's D* (Cre-K10C2.1, Cre-col-19, Cre-sym-4; D* ∼1.5 for all three loci; supplemental Table 2). Only Cre-F47A4.5 shows significance for more than one test of neutrality (D = 2.31, P < 0.05; D* = 1.49, P < 0.02; F* = 1.96, P < 0.02). C. elegans spc-1 encodes a spectrin, which is involved in body morphogenesis (Norman and Moerman 2002), and col-19 encodes a collagen protein, which is associated with the cuticle (Thein et al. 2003). The function of the C. elegans orthologs of the other genes is largely unknown.
To determine whether the distribution of D-values across the X chromosome is consistent with the distribution expected under a neutral scenario at mutation-drift equilibrium, I conducted coalescent simulations conditioned on the observed number of segregating synonymous sites and the sample size for each locus. Weak selection on synonymous sites can lead to negative D of a magnitude relevant to this study (McVean and Charlesworth 2000), which would complicate interpretations of a negative skew in the site-frequency spectrum. Selection on synonymous sites was detected in these data (see below; 70% of polymorphic synonymous sites form a PU site that is potentially subject to selection), although D does not correlate significantly with measures of codon bias (D × Fop: rSRC = −0.12, P = 0.47; D × γ: rSRC = −0.23, P = 0.15; Ohio samples only; see below for details on codon bias). Consequently, I analyzed the data in two ways to limit the potential for selection at synonymous sites to impact inferences about demography.
First, I limited the tests for an overall departure in D to a set of nine loci with low codon bias (Fop < 0.5) and excluded Cre-F47A4.5 due to the potential impact of selection on this locus; I also limited these analyses to the 14 strains from Ohio. This restricted sample exhibited mean D = −0.230. Neutral simulations based on the observed number of segregating sites and sample size, without recombination, indicate that a mean D ≤ −0.230 is expected 31.9% of the time, suggesting no deviation from equilibrium in the Ohio sample on a timescale that can be detected with single nucleotide polymorphism (Figure 1). The variance in D (var[D]) also conforms well to the neutral expectation (25.2% of simulations showed a variance in D less than the observed 0.527). Two caveats are that D exhibits notoriously low power and is conservative under the assumption of no recombination (Wall 1999). When the simulations for the Ohio sample are repeated, including observed estimates of ρLDhat, virtually identical results are obtained (23.2% of simulations have D ≤ −0.230).
Second, I calculated Tajima's D only for those synonymous sites that corresponded to preferred–preferred (PP) or unpreferred–unpreferred (UU) polymorphisms. Selection for codon bias should be negligible at such sites. Among the 35 loci in the Ohio sample that contained PP or UU polymorphisms (again excluding Cre-F47A4.5), Tajima's D averaged −0.113. Neutral coalescent simulations without recombination, conditioned on the number of segregating sites and the size of the sample, indicate that D ≤ −0.113 is expected in 39.3% of cases (14.7% of simulations have var[D] less than or equal to the observed 0.702), which is indicative of no deviation from equilibrium (Figure 1). Simulations with recombination corroborate this result (31.5% of simulations with D ≤ −0.113, 36.5% of simulations with var[D] ≤ 0.702). This approach has the advantage of considering most of the loci (35 of 40), but is limited by using a restricted set of polymorphic sites per locus. This contrasts with the first approach, which is limited in the number of loci considered (9 of 40), but which evaluates all polymorphic synonymous sites. The results of both approaches support the same conclusion that there is no evidence for a change in population size in the site-frequency spectrum for sites with little codon bias. The particularly high value of D for the gene of unknown function Cre-F47A4.5 (2.33 entire sample, 2.31 Ohio only, both P < 0.05) therefore suggests that it can be considered as a candidate for linkage to a target of balancing selection.
In calculations of pairwise divergence between homologous sequences in C. remanei, C. briggsae, and C. elegans, no significant differences were observed in mean dN, dS′, or dN/dS′ (supplemental Table 3). This accords with the findings of Cutter and Payseur (2003a). Focusing on the C. remanei–C. briggsae sequence comparison, divergence at nonsynonymous sites correlates positively with polymorphism at nonsynonymous sites in the Ohio sample (dN × πa: rSRC = 0.41, P = 0.0086; dN × θa: rSRC = 0.41, P = 0.0087), indicating that both diversity and divergence at nonsynonymous sites reflect a similar selective regime. Divergence at synonymous sites correlates positively with diversity at synonymous sites in the Ohio sample (dS × πs: rSRC = 0.48, P = 0.0019; dS × θs: rSRC = 0.42, P = 0.0065), but only when it is uncorrected for its association with codon bias and GC content (P ≥ 0.8 for dS′ × θs). This supports the observation that synonymous sites are not strictly neutral (see below). Nonsynonymous and synonymous divergence, however, are not associated (dN × dS: rSRC = −0.21, P = 0.18; dN × dS′: rSRC = −0.02, P = 0.9). The average nonsynonymous/synonymous ratio of polymorphism (θa/θs = 0.034) is similar to that of divergence (dN/dS′ = 0.031), although the two are not correlated across loci (P = 0.2). Unfortunately, the saturated divergence (i.e., mean dS > 1) at synonymous sites between the C. elegans–C. briggsae–C. remanei species pairs rules out the application of divergence-corrected tests of selection, such as HKA and McDonald–Kreitman tests (Hudson et al. 1987; McDonald and Kreitman 1991).
Weak selection on codon usage:
Selection on alternative degenerate codons is responsible for codon usage bias in C. remanei, as evidenced by several patterns. First, I quantify the selection intensity (γ = 4Nes) on preferred codons to be significantly greater than zero overall among the 40 new loci assessed here for the Ohio sample (γ = 0.44, 2lnL range = 0.23–0.65) and for two loci individually (Cre-vit-2, Cre-let-2; Table 2, supplemental Table 4). Similar results were obtained for the full sample (not shown). To the extent that Hill–Robertson interference (Hill and Robertson 1966) affects allele frequencies at these weakly selected sites, the strength of selection on preferred codons may be underestimated (McVean and Charlesworth 2000). Second, codon usage bias and synonymous site divergence are strongly negatively correlated in this sample of loci (Fop × dS rSRC = −0.85, P < 0.0001; Figure 2), as predicted for the action of selection at synonymous sites (Sharp and Li 1987). Although I have applied the C. elegans optimal codon table to C. remanei, this is a reasonable approach because C. briggsae uses an identical optimal codon table to C. elegans (Stein et al. 2003) and C. elegans is as divergent from C. remanei as it is from C. briggsae. In addition, codon bias is strongly correlated for these loci among all three species (rSRC of Fop for C. elegans–C. briggsae = 0.94, for C. elegans–C. remanei = 0.96, and for C. briggsae–C. remanei = 0.95, all P < 0.0001). Consistent with the association between accumulated codon bias and divergence, contemporary selection on codon usage measured with γ correlates negatively with synonymous site divergence (γ × dS rSRC = −0.45, P = 0.0039). Furthermore, the per-locus contemporary estimates of selection on codon usage (γ) that are based on preferred–unpreferred codon polymorphisms correlate positively with codon bias that has accumulated over time and is reflected in the overall frequency of optimal codons in the sequences (γ × Fop rSRC = 0.46, P = 0.0034; Figure 2). Finally, loci with strong codon bias exhibit reduced synonymous site diversity (Fop × πs: rSRC = −0.49, P = 0.0013; Fop × θs: rSRC = −0.47, P = 0.0021). All of these patterns are consistent with selection differentiating between alternative synonymous codons in C. remanei.
Because a positive correlation was observed between GC content and codon bias (GC × Fop: rSRC = 0.89, P < 0.0001; GC × γ: rSRC = 0.37, P = 0.057), the concern arises that biased gene conversion might be responsible for skewed frequencies of preferred variants (Marais 2003; Galtier et al. 2006). This is plausible because most preferred codons in Caenorhabditis are rich in guanine and cytosine (Stenico et al. 1994; Duret and Mouchiroud 1999). Consequently, I calculated γ for G or C variants at polymorphic sites that defined a GC-AT polymorphism for only those sites at which the corresponding alternatively encoded codons did not shift between a preferred and unpreferred designation (i.e., for unpreferred–unpreferred and preferred–preferred variant sites, for which the selection differential is likely to be negligible). The maximum-likelihood estimate of γ on these G or C variants was −0.37 (2lnL interval: −0.71 to −0.04), a value that is of opposite sign to that expected for biased gene conversion [for the full data set, γ = −0.30 (−0.61–0.01)]. This result implies that biased gene conversion is not the driving force behind the skewed variant frequencies for preferred codons, consistent with previous results in this species (Cutter and Charlesworth 2006).
Selection at linked sites:
Empirical recombination rates are not available for C. remanei, so inferences about recombination currently must rely on inverse measures of linkage disequilibrium used to estimate the population recombination parameter, ρ. Overall, the ratio ρ/θ gives a measure of the number of recombination events per mutation; at neutral equilibrium, ρ/θ = 4Ner/4Neμ = r/μ. In the Ohio sample, ρLDhat/θs has a median of 0.163 (ρH01/θs = 0.256 and ρH01+f/θs = 0.020), and few per-locus values exceed 1. These values are substantially lower than observed for Drosophila melanogaster and wild barley (Andolfatto and Przeworski 2000; Morrell et al. 2006) and might reflect the high mutation rate in Caenorhabditis (Denver et al. 2004).
When measured for the full data set, nucleotide diversity at synonymous sites (πs and θs) correlates positively with the population recombination parameter in this sample (πs × ρLDhat: rSRC = 0.52, P = 0.0004; θs × ρLDhat: rSRC = 0.52, P = 0.0005; πs × ρH01: rSRC = 0.40, P = 0.013; θs × ρH01: rSRC = 0.43, P = 0.0074; Figure 3). These data also show no correlation of synonymous-site divergence with ρ or f (dS′ × ρLDhat: rSRC = −0.13, P = 0.4; dS′ × f: rSRC = −0.02, P = 0.9; similarly, nonsignificant results were obtained for dS and other measures of ρ), which could occur if mutation rates were greater in regions of high recombination and also could cause a θ–ρ correlation. However, for the sample restricted to individuals from Ohio, which likely better represents a single population, a positive correlation between θ and ρ is no longer present (rSRC < 0.27, P > 0.1 for all measures of θ and ρ). Furthermore, when the relative rate of gene conversion was estimated simultaneously with the population recombination parameter for the full sample, the θ × ρ correlation also disappeared (θs × ρH01+f: rSRC = 0.09, P = 0.6; Figure 3). The relative incidence of gene conversion (f), when estimated simultaneously with ρH01+f for the Ohio sample, is observed to correlate positively with both ρH01 and ρLDhat, but not with ρH01+f (f × ρH01: rSRC = 0.40, P = 0.012; f × ρLDhat: rSRC = 0.45, P = 0.0042; f × ρH01+f: rSRC = −0.22, P = 0.19). On the basis of the restricted Ohio sample, the average gene conversion parameter is inferred to be f = 5.6 (Table 2), which suggests that unbiased gene conversion is common. It will be important in the future to obtain empirical recombination rate estimates with which to contrast ρ, θ, and gene conversion.
The observation that simultaneous estimation of gene conversion and the population recombination parameter resulted in a reduced θ–ρ correlation for the full data set prompted me to investigate this issue further with coalescent simulations. I simulated neutral genealogies for each of 300 parameter combinations that controlled mutation, recombination, and gene conversion and then calculated the correlation between subsequently estimated values of θ and ρH01 from the simulated data, which differed from the input values due to stochasticity in the coalescent process (Figure 4). Each of the 300 correlation coefficients is based on 1000 paired estimates of θ and ρ. An ANOVA model describes 66.5% of the variation in correlation coefficients as a function of each of the input variables, their first-order interactions, and their second-order polynomials (Table 3). The most important factor contributing to variation in the θ × ρH01 correlation is the input value of ρsim (and its quadratic term), followed by the interaction between the input θsim and ρsim, the gene conversion parameter (fsim) and its interaction with the input ρsim, and the interaction between input values of fsim and θsim (Table 3). A partition analysis also illustrates how the combination of high recombination and high mutation results in a positive correlation between the estimated values of these two parameters, whereas low recombination, infrequent gene conversion, and a high mutation rate together yield negative correlations between estimates of θ and ρ (Figure 4). For the range of parameters considered here, the length of gene conversion tracts did not strongly influence the observed correlations. Although correlations between θ and ρH01+f were observed in these simulations, the magnitudes were generally about half of those seen for θ × ρH01 (Figure 4). These simulations show that this linkage-disequilibrium-based estimator of ρ (ρH01) can create spurious correlations between estimates of θ × ρ, particularly when the true values of both parameters are large, and even when there is no underlying variation among loci in the population mutation and recombination rates.
I also report that none of Tajima's D, Fop, γ, or GC content correlates significantly with any measure of ρ (or gene conversion) in this sample. Such associations have previously been used in arguments for genetic hitchhiking (Andolfatto and Przeworski 2001; Stajich and Hahn 2005), Hill–Robertson interference (Hey and Kliman 2002), and a mutational influence of recombination (Marais 2003).
Nucleotide diversity is high across the C. remanei X chromosome, reflecting a large effective population size for this species, as also inferred from patterns of polymorphism in previously surveyed nuclear loci (Graustein et al. 2002; Jovelin et al. 2003; Haag and Ackerman 2005; Cutter et al. 2006a). In the population subsample from Ohio, there is no evidence of an overall departure from the neutral expectation in the site-frequency spectrum for putatively neutral sites, implying that panmixis is approximated in the sample. A lack of demographic change is a useful property for population genetic inference of natural selection because it is not necessary to account for complicated influences of population size change or structure in the history of the sample. Therefore, genome scans of selection using silent-site variant frequency spectra may be fruitful in C. remanei, provided that weak selection for codon bias is handled appropriately. Here I identify one locus (Cre-F47A4.5) that exhibits a significantly skewed frequency spectrum in a manner consistent with balancing selection, as measured by several statistics, motivating further analysis of this region.
I find that selection among alternative degenerate codons generates a robust pattern at polymorphic sites in which preferred codon variants are skewed toward high frequency. This pattern is particularly evident among loci with strong overall biases in codon usage, indicating the concerted action of contemporary selection pressures shaping allele frequencies and the historical selection that generated fixed codons in genes. This result corroborates previous findings from a smaller sample of loci (Cutter and Charlesworth 2006). One unusual observation is that the selection intensities on preferred vs. unpreferred codons seem rather weak for selection on codon usage to have generated such dramatic effects on synonymous-site divergence (McVean and Charlesworth 1999). I speculate that the polymorphism-based method used here to estimate selection on alternative codons might be overly conservative relative to other approaches that permit polarization of changes relative to the ancestral state.
An association between crossover rate and nucleotide diversity is predicted by theory to be caused by background selection (Hudson and Kaplan 1995) or by genetic hitchhiking of neutral variants if selective sweeps are common (Wiehe and Stephan 1993). While supporting evidence for these processes comes from some species (Begun and Aquadro 1992; Andolfatto and Przeworski 2001; Cutter and Payseur 2003b), other species show no such correlation (Nordborg et al. 2005; Schmid et al. 2005; Wright et al. 2006), a weak or inconsistent correlation (Baudry et al. 2001; Tenaillon et al. 2002; Roselius et al. 2005), or a correlation that can be explained by neutral factors (Lercher and Hurst 2002; Hellmann et al. 2003). Here, I measured recombination using inverse-linkage disequilibrium estimators of the population recombination parameter (ρ = 4Ner) and found that there is not a consistent significant association between diversity (θ = 4Neμ) and ρ. The abundance of polymorphism per recombination unit (i.e., low ρ/θ) suggests that the power to detect selection at linked sites should be high in C. remanei. However, empirical crossover rates will provide a more appropriate test for the potential of genetic hitchhiking and background selection to be important forces shaping patterns of genomic diversity (Maynard Smith and Haigh 1974; Charlesworth et al. 1993; Wiehe and Stephan 1993; Hudson and Kaplan 1995). It will be invaluable to obtain empirical recombination rate estimates for C. remanei to compare with observed levels of diversity so as to contrast with the population recombination parameter estimates (Andolfatto and Przeworski 2000; Przeworski and Wall 2001). In particular, this will be important to determine whether C. remanei mirrors the evidence of selection at linked sites found in C. elegans using map-based recombination rates (Cutter and Payseur 2003b). It is conceivable that self-fertilization in C. elegans reduces the effective recombination rate sufficiently such that the impact of selection at linked sites is easier to detect than for C. remanei, in which the window of linked polymorphism will be narrow due to extensive recombination and large effective population size. Therefore, regions of very low recombination in C. remanei might be required to detect a general effect of genetic hitchhiking and/or background selection in the form of an association between diversity and recombination.
In addition, I report that coalescent simulations indicate that spurious correlations between θ and linkage-disequilibrium-based estimators of ρ can be generated even under neutrality. Previous simulation work on estimators of the population recombination rate also showed that ρ tends to be overestimated when gene conversion is frequent or θ is high (Smith and Fearnhead 2005). When the relative strength of gene conversion (f) is estimated simultaneously with ρ using a composite maximum-likelihood method (Hudson 2001), the correlations are weaker, suggesting that estimators of ρ that do not account for gene conversion will be more likely to exhibit spurious associations with measures of diversity. Consequently, an artifactual effect of gene conversion on ρ estimators might best explain results such as those of Tenaillon et al. (2001, 2002), in which nucleotide diversity correlates with measures of recombination that are based on linkage disequilibrium but not the genetic map, rather than invoking selection or demography as causal factors.
The prevalence of gene conversion in C. remanei, as measured by the average ratio of gene conversion to recombination rate (mean f = 5.6), is comparable to or slightly higher than values reported in other species (Frisse et al. 2001; Padhukasahasram et al. 2004; Ptak et al. 2004; Morrell et al. 2006). In C. elegans, gene conversion has been observed to occur at a frequency of ∼10−5 in the 38-kb-long gene unc-22 (Moerman and Baillie 1979) with tracts extending at least 191 bp (Plasterk and Groenen 1992). Crossover and gene conversion appear to be associated in a number of taxa (Borts and Haber 1987; Jeffreys and May 2004) and crossover rates do correlate with ρ in some species (Andolfatto and Przeworski 2000; Ptak et al. 2004). Andolfatto and Nordborg (1998) astutely detailed the important influence of gene conversion in breaking down linkage disequilibrium across short stretches of sequence and Langley et al. (2000) suggested that gene conversion might be an important piece of the diversity–recombination puzzle, but the potential role of gene conversion in generating a spurious correlation between diversity and ρ has not been reported previously.
I thank Scott Baird for sharing nematode strains and Stephen Wright and Bret Payseur for helpful discussions and comments on the manuscript. Three reviewers also provided particularly helpful comments. Some strains were made available by the Caenorhabditis Genetics Center. I also thank the Washington University School of Medicine Genome Sequencing Center for making C. remanei genome sequences publicly available. This research was supported by startup funds from the University of Toronto Department of Ecology and Evolutionary Biology and the Connaught Fund.
- Received December 13, 2007.
- Accepted January 10, 2008.
- Copyright © 2008 by the Genetics Society of America