The house mouse is a well-established model organism, particularly for studying the genetics of complex traits. However, most studies of mice use classical inbred strains, whose genomes derive from multiple species. Relatively little is known about the distribution of genetic variation among these species or how variation among strains relates to variation in the wild. We sequenced intronic regions of five X-linked loci in large samples of wild Mus domesticus and M. musculus, and we found low levels of nucleotide diversity in both species. We compared these data to published data from short portions of six X-linked and 18 autosomal loci in wild mice. We estimate that M. domesticus and M. musculus diverged <500,000 years ago. Consistent with this recent divergence, some gene genealogies were reciprocally monophyletic between these species, while others were paraphyletic or polyphyletic. In general, the X chromosome was more differentiated than the autosomes. We resequenced classical inbred strains for all 29 loci and found that inbred strains contain only a small amount of the genetic variation seen in wild mice. Notably, the X chromosome contains proportionately less variation among inbred strains than do the autosomes. Moreover, variation among inbred strains derives from differences between species as well as from differences within species, and these proportions differ in different genomic regions. Wild mice thus provide a reservoir of additional genetic variation that may be useful for mapping studies. Together these results suggest that wild mice will be a valuable complement to laboratory strains for studying the genetics of complex traits.
THE house mouse presents an excellent mammalian model for studies of the genetic basis of complex traits, including many diseases. Dozens of inbred strains are available with sufficient genetic variability among them for linkage mapping and association studies (Paigen 2003a,b; Peters et al. 2007). A variety of molecular genetic tools are available for the mouse, making it possible to identify and functionally characterize candidate genes for some traits. The mouse genome has been sequenced (Waterston et al. 2002), patterns of expression in different tissues have been described for nearly all genes (e.g., Su et al. 2004), there is a large and growing set of knockouts, and phenotypes have been associated with >10% of all genes (Grimm 2006).
The classical inbred strains in which these resources have been developed, including the sequenced C57Bl/6J (Waterston et al. 2002), derive from matings among different species of the house mouse (Silver 1995; Wade et al. 2002). Thus, understanding the genetic variation among inbred strains requires understanding the evolutionary history of the species from which they were derived. The house mouse consists of three main lineages: Mus domesticus in Western Europe, Mus musculus in Eastern Europe and Asia, and Mus castaneus in Southeast Asia and India (also referred to as subspecies of Mus musculus: i.e., M. m. musculus, M. m. domesticus, and M. m. castaneus; Silver 1995). M. musculus and M. domesticus diverged between ∼350,000 and 1,000,000 years ago (She et al. 1990; Boursot et al. 1996; Suzuki et al. 2004). These species recently came into secondary contact following the spread of M. domesticus into Western Europe from the Middle East with the spread of agriculture over the last few thousand years (Cucchi et al. 2005). M. domesticus and M. musculus form a stable hybrid zone where they meet, and laboratory crosses between these species result in sterile hybrid males (Britton-Davidian et al. 2005). Classical lab strains of mice derive principally from M. domesticus and M. musculus, with a smaller contribution from M. castaneus (Silver 1995; Wade et al. 2002; Frazer et al. 2007; Yang et al. 2007). The partitioning of genetic variation among M. musculus and M. domesticus is largely unknown. In particular, the recent separation of these lineages raises the possibility that some loci will retain ancestral polymorphisms and that other loci will show fixed differences. From a genealogical perspective, some loci may be monophyletic within each species (i.e., all alleles within a species are more closely related to each other than to any alleles in the other species), while other loci may be paraphyletic or polyphyletic (i.e., some alleles within a species might be more closely related to alleles in the sister species than to other alleles within the same species; Figure 1).
The relationship of classical inbred strains to wild mice is important for several reasons. First, it is likely that the different inbred strains capture a small amount of the naturally occurring variation, but the amount of variation in the wild is still unclear. What proportion of haplotypes and single nucleotide polymorphism (SNPs) found in wild mice are also present among inbred strains? Does this pattern differ between the X chromosome and the autosomes? Differences between the X chromosome and autosomes among inbred strains might arise if selection acted differently on the X and autosomes during the founding of these strains or if unequal numbers of males and females were used in the founding of these strains. Wild mice might provide a reservoir of additional genetic variation for studies that seek to understand the genetic basis of complex traits, but quantifying variation in the wild is a necessary first step. Second, some of the variation among inbred strains derives from fixed differences between species, while some of the variation reflects differences within one or several species (Wade et al. 2002). Epistatic interactions between alleles from different species are likely to have shaped the current variation among strains (Payseur and Hoekstra 2005; Petkov et al. 2005), and this is likely to affect the genetic architecture underlying complex traits.
Although several large-scale efforts have characterized variation among classical and wild-derived inbred strains of mice (Wade et al. 2002; Wiltshire et al. 2003; Frazer et al. 2004; Ideraabdullah et al. 2004; Petkov et al. 2004; Yalcin et al. 2004; Frazer et al. 2007; Yang et al. 2007), these studies do not adequately describe the amount of variation in natural populations for two reasons. First, and most importantly, <10 wild-derived inbred strains of each species (and often only 1) have been included in these studies, representing a very small portion of the geographic range of the wild house mice. Second, some of these studies have described variation among wild-derived inbred strains for polymorphisms that were previously ascertained among the classical inbred strains. This ascertainment bias may hide the true distribution of variation in natural populations (Boursot and Belkhir 2006).
To begin to address these issues, we compared variation among nine of the most commonly used classical inbred strains with variation among wild M. domesticus and M. musculus. First, we present data from five X-linked loci sequenced in relatively large samples of wild mice and we compare these data to previously published X-linked and autosomal data. Second, we sequenced eight inbred strains and analyzed them with the already sequenced C57Bl/6J for nearly all genes for which polymorphism data have been published from wild M. domesticus and M. musculus for a total of 11 X-linked and 18 autosomal loci. Finally, we compared all data to the publicly available SNP databases.
MATERIALS AND METHODS
We sequenced five X-linked loci in large samples of wild mice and then resequenced eight classical inbred strains and one each M. spretus and M. caroli for 11 X-linked and 18 autosomal loci. These include the five X-linked loci as well as 24 loci for which wild mouse population data have already been published (Harr 2006a; Baines and Harr 2007).
Male M. domesticus and M. musculus were wild caught in Europe (Table 1). Males were used to obtain unambiguous haplotypes for X-linked loci. The standard karyotype of M. domesticus is 2n = 40, with all acrocentric chromosomes. However, M. domesticus has many chromosomal races with 2n < 40; all M. musculus have 2n = 40 (Pialek et al. 2005). To exclude chromosomal races, all M. domesticus were karyotyped as described previously (Nachman et al. 1994), and only mice with 2n = 40 were used. Genomic DNA representing the classical inbred strains 129S1/SvImJ, A/J, AKR/J, BALB/cByJ, C3H/HeJ, DBA/2J, FVB/NJ, and SJL/J, as well as one M. spretus and one M. caroli, was purchased from the Jackson Laboratories (Bar Harbor, ME). Sequence for C57Bl/6J was downloaded from NCBI or Ensembl (Build 36).
Five X-linked loci were surveyed in wild mice: Maoa, Dmd, Msn, Dach2, and Amelx (supplemental Figure 1 at http://www.genetics.org/supplemental/). Loci were chosen so as to (i) be evenly distributed along the X chromosome, (ii) have human homologs for which population genetic data have been published (Hammer et al. 2004), and (iii) have at least one long intron (>5 kb). We focused on introns to capture a large amount of genetic variation. Our survey of Amelx included four small exons, which were excluded from analysis. All sequence information was based on Build 36 of the mouse genome.
Amplification and sequencing primers were designed using Primer3 (Rozen and Skaletsky 2000) on sequences prescreened using RepeatMasker (Smit et al. 1996–2004). Amplification primer sequences are given in supplemental Table 1 at http://www.genetics.org/supplemental/. Portions of Dach2, Amelx, Maoa, and Msn were PCR amplified in a single amplicon using high-fidelity Taq polymerase (Invitrogen, Carlsbad, CA). The following PCR conditions were used: initial denaturation (94° for 2 min) was followed by 35 cycles of 30 sec at 94°, 30 sec at 57°, and 5 min at 65°. High-fidelity (Invitrogen) and HotMaster (Eppendorf, Westbury, NY) Taq polymerases were used to amplify a portion of one intron of Dmd in multiple amplicons. PCR products were cleaned using either the QIAGEN (Valencia, CA) PCR clean-up kit or the 96-well format by the Genomic Analysis and Technology Core at the University of Arizona. Sequencing was performed on both strands using either an ABI 377 or a 3731 automated sequencer.
Sequences were trimmed and assembled into contigs using Sequencher (Gene Codes, Ann Arbor, MI). These contigs have been deposited in GenBank under accession nos. EF067347–EF067807 and EU220489–EU220697. Alignments generated with Sequencher were manually edited using MacClade (Sinauer Associates, Sunderland, MA). Summary statistics describing the level and pattern of nucleotide variability were calculated manually or by DnaSP (Rozas and Rozas 1999). We also used DnaSP to calculate Hudson's minimum number of recombination events, Rm (Hudson and Kaplan 1985), and to describe linkage disequilibrium. Haplotype networks were drawn manually and neighbor-joining trees were generated using PAUP*, version 4.0b10 (Swofford 2003).
We performed several statistical tests of a neutral model of molecular evolution. Tajima's D (Tajima 1989), Fu and Li's D (Fu and Li 1993), and Fay and Wu's H (Fay and Wu 2000) all describe the allele-frequency spectrum by comparing different estimators of θ, the population mutation parameter (θ = 3Neμ for the X chromosome and θ = 4Neμ for the autosomes, where Ne is the effective population size and μ is the neutral mutation rate). We used several tests because they differ in their power to detect different perturbations from neutral equilibrium conditions (Braverman et al. 1995; Simonsen et al. 1995; Przeworski 2002). These tests were conducted using software available from Y. X. Fu (http://hgc.sph.uth.tmc.edu/fu) and from J. Fay (http://www.genetics.wustl.edu/jflab/htest.html). The HKA test compares the ratio of polymorphism to divergence among two or more genes (Hudson et al. 1987). Pairwise and multilocus HKA tests were performed using software developed by Jody Hey (http://lifesci.rutgers.edu/∼heylab/HeylabSoftware.htm#HKA). FST values and analysis of molecular variance (AMOVA) results were calculated using Arlequin (Schneider et al. 2000).
We also analyzed the data in the Mouse Phenome Database (MPD; http://www.jax.org/phenome) for the corresponding portions of the 11 X-linked and 18 autosomal loci. MPD includes a comprehensive database of all known mouse SNPs and it integrates SNPs from many sources, including the Broad Institute (Lindblad-Toh et al. 2000), Celera (e.g., Lemon et al. 2003), Perlegen (Frazer et al. 2007), The Jackson Laboratory (Petkov et al. 2004), The Wellcome Trust Centre for Human Genetics (e.g., Yalcin et al. 2004), The Genomics Institute of the Novartis Research Foundation (Pletcher et al. 2004), and others. This database contains >10 million SNPs representing >100 classical and wild-derived inbred strains of mice. We queried MPD for all SNPs in the regions that we sequenced. We conducted mouse-genome-specific BLAST searches using NCBI or basic local alignment tool (BLAT) searches with the University of California at Santa Cruz mouse genome server and identified sites that overlapped with our resequencing targets. All SNPs identified solely by differences from representatives of M. castaneus or M. molossinus (a hybrid between M. castaneus and M. musculus) were excluded to match the sampling used here.
Level and pattern of genetic variation at five X-linked genes:
We sequenced 4–5 kb of intronic DNA at each of five X-linked genes (supplemental Figure 1 at http://www.genetics.org/supplemental/) in 60–64 M. domesticus and 18–22 M. musculus. Levels of nucleotide variability were low in both species, with 1–29 segregating sites/locus (Table 2; Figure 2; supplemental Figures 2 and 3). In M. domesticus, nucleotide diversity ranged from 0.03 to 0.15% among the five genes, with an average value of 0.07%, almost identical to the average value for X-linked genes in humans (e.g., Hammer et al. 2004). In M. musculus, nucleotide diversity was considerably lower, ranging from 0 to 0.06% among genes, with an average value of 0.03%. We used the estimates of nucleotide diversity to estimate species-wide effective population sizes, assuming a mutation rate of 4 × 10−9 (Waterston et al. 2002). Thus, for M. domesticus, Ne = θ/3μ = (7 × 10−4)/(1.2 × 10−8) = 5.8 × 104, and for M. musculus, Ne = (3 × 10−4)/(1.2 × 10−8) = 2.5 × 104. To the extent that selection has reduced levels of variation on the X chromosome (Baines and Harr 2007), these will be underestimates.
Levels of divergence with respect to M. caroli were consistently ∼3% (Table 2), similar to previous observations (She et al. 1990). Average divergence between M. musculus or M. domesticus and M. spretus (∼1%) was also consistent with previous estimates (Nachman 1997). The average pairwise divergence (k) between alleles from M. musculus and alleles from M. domesticus was 0.43%. We can use this value to estimate the time of separation of these species. Under a neutral model, k = 2μt + 3Neμ for X-linked loci, where μ is the neutral mutation rate, t is the species divergence time in generations, and Ne is the effective population size of the ancestral population. We can estimate the ancestral value of 3Neμ as the average of current nucleotide diversity in M. musculus and M. domesticus [π = 3Neμ = (0.0003 + 0.0007)/2 = 0.0005]. Assuming a neutral mutation rate of 4 × 10−9 (Waterston et al. 2002), this leads to an estimate of t = (k − 3Neμ)/2μ = (0.0043 − 0.0005)/(8 × 10−9) = 475,000 generations. Although mice in captivity produce several generations per year, mice in the wild often breed seasonally and may produce only one or two generations per year. This suggests that these species diverged on the order of 237,500–475,000 years ago. This is a very rough estimate but is consistent with a previous estimate of 350,000 years based on DNA–DNA hybridization data (She et al. 1990).
We examined patterns of linkage disequilibrium (LD) both within and between genes. Within genes, we found no evidence for recombination: we never observed all four gametic types in pairwise comparisons between sites within a gene (Hudson's minimum number of recombination events, Rm, was 0 for each gene). Our sequences span 4–5 kb for each gene, suggesting that LD extends over distances greater than this for X-linked loci, similar to patterns seen for many X-linked loci in humans (e.g., Hammer et al. 2004). Between genes we found no LD. Of 4950 pairwise comparisons between all sites in our data set, 372 were found to be in significant LD using a Fisher's exact test (FET). However, after a Bonferroni correction for multiple tests, only 86 of these values were significant, and all of those described intralocus pairs of sites.
We also considered the relationship among haplotypes (Figure 2). Within M. domesticus, there was generally a single common haplotype and several rare haplotypes, often separated by one or a few mutational steps from the most common haplotype. One exception was Maoa, where several intermediate-frequency haplotypes were observed within M. domesticus. Within M. musculus, the pattern varied among loci, with some showing a single common haplotype (Msn) and others showing more intermediate-frequency haplotypes (Amelx, Maoa). Reciprocal monophyly between M. domesticus and M. musculus was unambiguously observed at Maoa, Dach2, and Msn (Figure 2). At Amelx, we observed two divergent lineages corresponding to M. musculus and M. domesticus, with one exception. A single M. musculus haplotype was identical to the most common M. domesticus haplotype over its entire 4-kb length, including all insertion/deletion (indel) variants, a microsatellite locus, and all nucleotide sites. Hybridization is known to occur between these two species not far from the sampling locality at which this mouse was trapped (Munclinger et al. 2002). Given the geographic origin of this mouse and its Amelx haplotype that is identical to haplotypes otherwise seen only in M. domesticus, this allele may represent recent introgression from M. domesticus into M. musculus. Excluding this individual, the haplotype network at Amelx is consistent with reciprocal monophyly. At Dmd, the haplotype network is unresolved, as the root (M. spretus) falls at a trichotomy, with one branch leading to all M. musculus haplotypes and two branches each leading to M. domesticus haplotypes. This pattern is not inconsistent with reciprocal monophyly between the species, but additional data are needed to resolve this trichotomy.
We performed several statistical tests of neutrality within M. domesticus and within M. musculus. First we performed tests based on the distribution of allele frequencies, including Tajima's D (Tajima 1989), Fu and Li's D (Fu and Li 1993), and Fay and Wu's H (Fay and Wu 2000). In M. domesticus, we generally observed negative values of Tajima's D and Fu and Li's D, consistent with an excess of rare polymorphisms, although most of these values were not significant and no locus showed significant values for all tests (Table 2). In M. musculus, both positive and negative values for Tajima's D and Fu and Li's D were observed, and none was significant. We also calculated Tajima's D for short and long indels within M. domesticus and within M. musculus, summed across loci. None of these values was significant (P > 0.05 for each). Second, we compared the ratio of polymorphism to divergence across multiple loci using the HKA test (Hudson et al. 1987). This test was applied in a pairwise manner and for all five loci simultaneously, using M. caroli as the outgroup. Neither the pairwise tests (Table 2) nor the multilocus test (sum of deviations = 6.389) rejected a neutral model.
Mice are known to live in highly structured demes at the local level (e.g., Delong 1967), but large-scale geographic structure was less evident when mapped onto a haplotype network for M. domesticus (supplemental Figure 4 at http://www.genetics.org/supplemental/). All geographic regions included the most common haplotype and several rare haplotypes. We calculated FST to assess the degree of population subdivision in M. domesticus. Values of FST obtained for each gene in comparisons among major geographic regions ranged from 0 to 0.5, with a mean value of 0.189 (Table 3). Some comparisons revealed significant structure. For example, the average FST value at Maoa was 0.312 and comparisons between geographic regions for this locus were all individually significant. An AMOVA analysis was carried out for a concatenated data set within M. domesticus. Consistent with a lack of strong population structure on a continental scale (across Western Europe), the greatest proportion of variation was found within populations (99.15%). Almost none of the variation was due to differences among regions (0.05%). In this multilocus analysis, FST was low and marginally significant (0.09, P = 0.06). Both ϕCT and ϕSC, measures of differentiation among populations and among regions, respectively, were low and nonsignificant (0.008, P = 0.09 and 0.0006, P = 0.95, respectively).
Levels of variation and gene genealogies for 11 X-linked and 18 autosomal loci:
We compared levels of variation at the five X-linked loci that we resequenced to levels of variation at six X-linked loci previously published (Table 4). Mean π was slightly higher for the six loci in Baines and Harr (2007) than for the five loci sequenced here, both within M. domesticus (0.10% vs. 0.07%) and within M. musculus (0.07% vs. 0.03%). These differences might be due to sampling differences or stochastic variation in levels of π among loci. The mean length of the five loci sequenced here was 4234 bp while the mean length of the six loci in Baines and Harr (2007) was 523 bp. We also compared the number of polymorphisms and fixed differences among the five X-linked loci that we resequenced and the six loci in Baines and Harr (2007), and we found no significant differences (FET, P = 0.869). These data sets are therefore pooled in the analyses below. The average level of nucleotide variability on the X chromosome (π = 0.08%) was considerably lower than on the autosomes (π = 0.26%), as previously noted (Baines and Harr 2007).
Neighbor-joining trees for the 11 X-linked and 18 autosomal loci are shown in Figures 2 and 3, and the numbers of fixed differences, shared polymorphisms, and exclusive polymorphisms are given in Table 4. All three kinds of genealogies that are depicted in Figure 1 can be seen among these 29 loci (Figures 2 and 3). Some loci showed reciprocal monophyly between M. musculus and M. domesticus (e.g., Bnc1), while others were paraphyletic (e.g., Melk) or polyphyletic (e.g., Nkd1). Some gene genealogies are unresolved due to an absence of informative sites, especially on the X chromosome, consistent with its low level of variation. Nonetheless, there are some interesting differences between the X chromosome and the autosomes. Excluding unresolved genealogies, the X chromosome included five monophyletic, no paraphyletic, and no polyphyletic gene genealogies, while the autosomes included nine monophyletic, three paraphyletic, and two polyphyletic gene genealogies. Thus, there was unambiguous evidence for paraphyly and polyphyly only at autosomal loci. Similarly, the proportion of fixed differences was greater on the X chromosome (69/212 = 33%) than on the autosomes (57/271 = 21%; Table 4). We compared the number of fixed differences, shared polymorphisms, polymorphisms within M. musculus, and polymorphisms within M. domesticus on the X chromosome (the counts in each category, respectively, are 69, 0, 35, and 108) and on the autosomes (57, 6, 71, and 137; Table 4). We used a Monte Carlo procedure (Lewontin and Felsenstein 1965) to ask whether the counts are significantly different between the X chromosome and the autosomes in this 2 × 4 contingency table. We generated 100,000 random tables with marginal sums equal to the observed data, using a program kindly provided by Bill Engels, and found that the observed table is highly unlikely (P = 0.0005). Similarly, the ratio of fixed differences to total polymorphisms was greater on the X (69:143) than on the autosomes (57:214; FET, P = 0.004). Thus, both the gene genealogies and the distribution of polymorphic and fixed nucleotide sites support the notion that the X chromosome is more differentiated than the autosomes between M. musculus and M. domesticus.
To further investigate the differentiation of the X chromosome compared to the autosomes, we analyzed data from the Wellcome Trust Center for Human Genetics in which ∼8500 SNPs were typed in seven wild-derived M. domesticus and eight wild-derived M. musculus (http://gscan.well.ox.ac.uk/gs/strains.cgi). Perl scripts were written to parse X-linked and autosomal variation. On the autosomes, we observed roughly equal numbers of fixed differences and shared polymorphisms, while on the X chromosome there were 112 fixed differences but only one shared polymorphism (Table 5). The relatively greater differentiation of the X chromosome in the Wellcome Trust data compared to the data shown in Table 4 may reflect bias in the ascertainment of SNPs in the Wellcome Trust data (Boursot and Belkhir 2006; Harr 2006a,b).
Amount of wild variation captured by classical inbred strains:
The haplotypes seen in the nine classical inbred strains are shown in Figures 2 and 3. The number and origin of SNPs found among inbred strains for the 29 loci surveyed here are given in Table 6. These nine classical inbred strains capture only a small proportion of the variation seen in nature. For example, the strains were invariant at 12 of the 29 loci. Similarly, the inbred strains contained only 87 SNPs (Table 6), compared to 483 SNPs among wild mice at these same loci (Table 4).
There are notable differences between the X chromosome and the autosomes in the levels of variation captured in the inbred strains. The inbred strains were invariant at 8/11 = 73% of loci on the X chromosome and 4/18 = 22% of loci on the autosomes, a difference that was significant (FET, P = 0.02). This can also be seen in the number of SNPs on the X chromosome and on the autosomes; the inbred strains contained 6.1% of the SNPs present in the wild on the X chromosome and 26.6% of the SNPs present in the wild on the autosomes, a difference that was also significant (Table 7). Not only did the X chromosome contain less variation than the autosomes, but also the origin of the SNP variation on the X chromosome was different from the autosomes. The ratio of fixed differences to polymorphisms on the X (9/4) was significantly greater than on the autosomes (22/52; Table 7). In other words, much of the SNP variation among inbred strains on the X chromosome corresponds to differences between species, while most of the SNP variation among inbred strains on the autosomes corresponds to differences within species. It is important to bear in mind that this conclusion derives from consideration of only 29 loci in nine strains; sequencing of additional loci in samples of wild and laboratory mice will be needed to fully describe the origin of genetic variation among lab mice.
Our data allowed us to identify the species origin of individual haplotypes among inbred strains. We surveyed nine strains at 29 loci, representing 261 gene copies (supplemental Table 2 at http://www.genetics.org/supplemental/). Of these, 16.5% were of uncertain origin, usually because the corresponding haplotype was shared between M. musculus and M. domesticus, 5.4% were of M. musculus origin, and 75.1% were of M. domesticus origin. Thus, for these 29 loci, these inbred strains are predominantly of M. domesticus origin, consistent with other recent studies (Frazer et al. 2007; Yang et al. 2007).
Comparisons to MPD:
We were interested in asking more generally how SNP variation in our wild sample compares to SNP variation in existing databases, including all laboratory strains (both classical and wild derived). This is important since these laboratory strains represent the existing tools for most current mapping efforts. We were interested in asking two questions: (1) How many of the SNPs in existing databases were found in our survey of wild mice? and (2) How many of the SNPs in our sample of wild mice were found in existing databases? The MPD is the central repository for SNPs among all laboratory strains of mice. A total of 117 SNPs were found in MPD for the regions that we sequenced, and 107 of these were observed in wild mice (supplemental Table 3 at http://www.genetics.org/supplemental/). Thus, nearly all of the SNPs known from laboratory strains were captured in these wild samples. A total of 483 SNPs were identified in the samples of wild mice and 107 of these (∼22%) were found in MPD. Thus, wild mice provide a large reservoir of additional genetic variation not currently captured in laboratory strains.
We studied DNA sequence variation at 11 X-linked and 18 autosomal loci in wild and inbred mice. We draw four main conclusions:
Levels of genetic variation in wild mice are generally low, although slightly higher than in humans. Effective population sizes for mice are on the order of 105.
M. musculus and M. domesticus diverged recently, and many gene genealogies are reciprocally monophyletic between these species, while others are paraphyletic or polyphyletic. In general, the X chromosome is more differentiated than the autosomes.
Nine commonly used inbred strains contain only a small amount of the genetic variation seen in wild mice. The X chromosome contains proportionately less variation among inbred strains than do the autosomes. SNP variation on both the X chromosome and the autosomes derives from differences between species as well as differences within species, although the proportion of SNPs deriving from interspecific variation was greater on the X than on the autosomes for the genes that we surveyed.
Public SNP databases for the mouse, which include variants from all inbred strains, still contain only a small fraction of the SNPs seen in wild mice.
Below we discuss each of these conclusions in turn.
Levels and patterns of genetic variation in wild mice:
We resequenced five X-linked loci and found generally low levels of nucleotide variation and little geographic structure on a continental scale. Previous studies of genetic variation in natural populations of house mice have included work on allozymes (e.g., Selander et al. 1969a,b; Hunt and Selander 1973; Britton-Davidian et al. 1989), mtDNA (e.g., Nachman et al. 1994; Prager et al. 1996; Tryfonopoulos et al. 2005), MHC alleles (e.g., Arden and Klein 1982; Potts et al. 1991), t-haplotypes (e.g., Ardlie and Silver 1998; Dod et al. 2003), chromosomal polymorphisms (e.g., Pialek et al. 2005), microsatellites (e.g., Ihle et al. 2006), and DNA sequence variation at nuclear genes (e.g., Nachman and Aquadro 1994; Nachman 1997; Karn and Nachman 1999; Harr 2006a; Ihle et al. 2006; Baines and Harr 2007). Our results are concordant with some of these earlier studies in several respects. First, the average level of nucleotide diversity reported here for introns of five X-linked genes in M. domesticus (π = 0.07%) matches very nearly the value previously reported for 6022 bp of X-linked intron sequence from a sample of 10 M. domesticus (π = 0.08%; Nachman 1997) and 6 X-linked genes in six to eight mice from across the range of M. domesticus (π = 0.10%; Baines and Harr 2007). Second, studies of mtDNA and nuclear genes documented higher levels of genetic variation in M. domesticus than in M. musculus (Prager et al. 1996; Baines and Harr 2007), and this is corroborated by each of the five X-linked loci that we studied. Third, studies of allozymes (Britton-Davidian et al. 1989), mtDNA (Nachman et al. 1994), and nuclear DNA (Baines and Harr 2007) found relatively little geographic structure across Western Europe in M. domesticus, comparable to our findings. The lack of strong geographic structure for M. domesticus is consistent with an archaeological record indicating that this species expanded its range into Western Europe from the Fertile Crescent within the past 10,000 years (Auffray et al. 1990; Cucchi et al. 2005). The generally negative values of Tajima's D and Fu and Li's D suggest a population expansion, and this may have accompanied the known range expansion.
The observed levels of variation at X-linked loci suggest a long-term effective population size for mice of 58,000 for M. domesticus and 25,000 for M. musculus. As pointed out by Baines and Harr (2007), however, levels of variation on the X chromosome may be reduced by selection, relative to levels of variation on the autosomes. The average level of nucleotide variation on the autosomes (Baines and Harr 2007) suggests an effective population size of 160,000 for M. domesticus and 100,000 for M. musculus [M. domesticus: Ne = θ/4μ = (2.6 × 10−3)/(1.6 × 10−8) = 1.6 × 105; M. musculus, Ne = (1.6 × 10−3)/(1.6 × 10−8) = 105]. Given uncertainties in the estimates of mutation rate, these values should be taken as rough approximations; however, it seems that M. domesticus has a long-term species-wide Ne on the order of 105.
This study allows us to compare patterns of DNA sequence variation in mice to patterns seen in humans, the only other mammalian species for which extensive population samples of DNA sequence variation have been obtained. Levels of variation on the X chromosome in mice are nearly identical to levels of nucleotide diversity at X-linked introns in humans (π = 0.07%; Hammer et al. 2004), although levels of variation on the autosomes appear to be roughly twice as high in mice (π = 0.26%: Table 4) as in humans (π = 0.11%; Li and Sadler 1991; Aquadro et al. 2001). Similarly negative values of Tajima's D are seen in non-African populations of humans and in European populations of mice at X-linked loci (Hammer et al. 2004), probably consistent with population expansions associated with range expansions in both species. Levels of intralocus LD are also similarly high in both species (e.g., Hammer et al. 2004). Recombination rates in mice (∼0.5 cM/Mb; Shifman et al. 2006) are roughly half as high as in humans (∼1 cM/Mb; Kong et al. 2002); however, the larger population size of mice suggests that the population recombination rate (i.e., 4Nec) may be roughly similar in both species. Recent work has shown that the decay of LD occurs over similar genomic distances in mice and humans (Laurie et al. 2007). The similarities between mice and humans suggest that wild mice will provide a useful comparison to humans for understanding the forces governing genetic variation in nature. In addition, wild mice might serve as useful models for genetic association studies.
Divergence between M. musculus and M. domesticus:
Our results suggest that M. musculus and M. domesticus diverged <500,000, or ∼5Ne, generations ago. The average coalescence time for all alleles in a neutral genealogy is 4Ne generations; however, the variance in the coalescent process is large, and even after 5Ne generations, reciprocal monophyly is not expected for all loci between diverging taxa (Tajima 1983). Consistent with these theoretical expectations, we observed all three patterns in Figure 1 in our data (Figures 2 and 3). Nonetheless, 14/19 = 74% of all unambiguous genealogies were reciprocally monophyletic between M. domesticus and M. musculus. These data highlight the fact that these groups are genetically well differentiated.
Despite this overall high level of differentiation, the X chromosome appears to be more differentiated than the autosomes. Excluding the unresolved genealogies, all X-linked loci were reciprocally monophyletic, while only 64% of autosomal loci were reciprocally monophyletic. We found clear evidence for paraphyly or polyphyly only at autosomal loci. Similarly, the ratio of polymorphic to fixed nucleotide differences was significantly lower on the X chromosome than on the autosomes. The greater differentiation of the X chromosome compared to the autosomes is also seen in our analysis of the Wellcome Trust data (Table 5), although these SNPs were ascertained in classical inbred strains and then typed in wild-derived inbreds and thus may contain some bias (Harr 2006a,b; Boursot and Belkhir 2006).
The greater differentiation of the X chromosome compared to the autosomes may be due to several factors. One is the difference in effective population size. The average coalescence time for all alleles is 3Ne generations for the X chromosome and 4Ne generations for autosomes, and thus the X is expected to achieve reciprocal monophyly more quickly. Another possible explanation comes from levels of gene flow across the musculus–domesticus hybrid zone. A number of studies have documented reduced introgression of the X chromosome compared to the autosomes (Tucker et al. 1992; Dod et al. 1993; Prager et al. 1997), and reproductive incompatibilities map to the X chromosome (Oka et al. 2004; Storchova et al. 2004). Third, if positive selection is more frequent on the X chromosome than on the autosomes (e.g., Charlesworth et al. 1987), then the X chromosome would be expected to exhibit shallower gene genealogies and reciprocal monophyly more often than the autosomes. Finally, it is possible that additional sampling will uncover X-linked genes with shared polymorphism.
Amount and pattern of genetic variation among inbred strains:
This study is the first to compare variation at multiple loci in the classical inbred strains of mice to variation in large samples of wild mice. Three important patterns emerge.
First, the inbred strains of mice capture a small percentage of the variation seen in nature. The inbred strains were invariant at 41% of the loci surveyed (12/29) and contained 85 SNPs compared to 483 in wild mice. The lack of variation among inbred strains has important implications for their use in identifying genes underlying traits of interest. There are good mouse models for many complex diseases (Peters et al. 2007). If laboratory strains are used to identify the genetic architecture of disease phenotypes and if laboratory strains contain a small fraction of the variation seen in nature, then some genes of importance may go undetected. For example, the recent “collaborative cross” uses a set of eight inbred lines that will be intercrossed and then used to produce a set of 1000 recombinant inbred lines (Churchill et al. 2004). This panel will provide a powerful tool for mapping genes to ∼1 cM resolution. The total amount of variation present in this panel, however, is limited by the variation present in the initial founders. It is important to recognize that inbred strains of mice may contain a sufficient number of SNPs to “tag” nearly every gene in the genome. However, the reduction in variation in lab strains compared to wild mice is likely to have excluded many functionally important alleles, especially if they were rare in the wild. Wild mice could provide an additional source of functional genetic variation for mapping efforts. Yang et al. (2007) note that the genomes of laboratory mice contain large regions of extremely low diversity and that these represent “blind spots” for the study of complex traits. Our results suggest that wild mice could fill in these genetic blind spots.
Second, despite the overall low level of variation seen among inbred strains, there is significantly less variation on the X compared to autosomes, and this difference is greater than expected on the basis of levels of variation seen in the wild (Table 7). Thus, X chromosome variation is underrepresented in lab strains of mice. This could be caused by a small ratio of females to males in the founding of inbred strains (Ferris et al. 1982). Another possibility is that the X chromosome contained a large number of incompatibilities in the crosses that were used to establish lab strains, resulting in selection against many X-linked alleles. The fact that hybrid male sterility maps to the X chromosome (Oka et al. 2004; Storchova et al. 2004) is consistent with this view.
Third, among inbred strains, most of the SNPs on the X derive from differences between species, while most of the SNPs on the autosomes derive from differences within species at the genes that we surveyed (Table 7). The genes contributing to fixed differences on the X in our study, Tex16 and Dach2, lie in a large region also identified by Yang et al. (2007) as deriving from different species (on the basis of a single inbred mouse from each species). However, for both the X and the autosomes, a nontrivial proportion of SNPs derive from fixed differences between species. Such species-specific alleles have not been tested by natural selection in combination with species-specific alleles at other loci, and this may give rise to epistatic incompatibilities (e.g., Payseur and Hoekstra 2005). The extent to which such interactions underlie phenotypes of interest in mouse models is unknown, but it is clear that such trans-species polymorphisms do not serve as an accurate model for most human genetic variation.
SNP databases include a small fraction of the variation in nature:
MPD includes a database of all mouse SNPs compiled from many sources. The database contains >10 million SNPs representing >100 classical and wild-derived inbred strains of mice. It is not comprehensive in the sense that not all strains have been sequenced for all genomic regions, but it probably contains a large fraction of all SNPs among all strains since it includes the >8 million Perlegen SNPs derived from resequencing the genomes of 15 major inbred strains (Frazer et al. 2007). The Perlegen data are based on oligonucleotide arrays that cover ∼58% of the reference genome (C57Bl/6J) with a high false-negative rate (∼40%). Although there are many genomic gaps in these data, including many of the regions that we studied here, they represent a very large catalog of SNPs among mouse strains. Nonetheless, we identified only 117 SNPs in MPD at these 29 autosomal and X-linked loci, corresponding to 24% of the SNPs discovered among wild mice (supplemental Table 3 at http://www.genetics.org/supplemental/). This underrepresentation of variation supports the conclusion that studying even a diverse collection of inbred strains of mice may miss important alleles underlying complex traits. Fixed differences between M. domesticus and M. musculus constitute a large proportion of the SNPs in MPD at these 29 loci (supplemental Table 3) on both the X (62%) and the autosomes (40%), which further suggests that the genetic architecture of traits mapped using these strains may not correspond to the architecture of the same traits in human populations.
P. Basset, M. D. Dean, J. M. Good, E. Kelleher, L. Kent, M. Sans-Fuentes, M. A. Saunders, and G. Wlasiuk provided useful discussion and suggestions. We also thank D. J. Begun and three anonymous reviewers for helpful comments on the manuscript, J. G. Krenz and M. A. D'Urso for technical assistance, and M. D. Dean and M. Carneiro for computational assistance. T.S. was supported by a fellowship from the National Science Foundation (NSF) Integrative Graduate Education and Research Traineeship Program in Evolutionary, Functional and Computational Genomics at the University of Arizona. A.G. was supported by a fellowship from the Fundacao para a Ciencia e a Tecnologia (SFRH/BPD/24743/2005). This work was supported by NSF and National Institutes of Health grants to M.W.N.
- Received August 4, 2007.
- Accepted October 15, 2007.
- Copyright © 2007 by the Genetics Society of America