In recent years in silico analysis of common laboratory mice has been introduced and subsequently applied, in slightly different ways, as a methodology for gene mapping. Previously we have demonstrated some limitation of the methodology due to sporadic genetic correlations across the genome. Here, we revisit the three main aspects that affect in silico analysis. First, we report on the use of marker maps: we compared our existing 20,000 SNP map to the newly released 140,000 SNP map. Second, we investigated the effect of varying strain numbers on power to map QTL. Third, we introduced a novel statistical approach: a cladistic analysis, which is well suited for mouse genetics and has increased flexibility over existing in silico approaches. We have found that in our examples of complex traits, in silico analysis by itself does fail to uniquely identify quantitative trait gene (QTG)-containing regions. However, when combined with additional information, it may significantly help to prioritize candidate genes. We therefore recommend using an integrated work flow that uses other genomic information such as linkage regions, regions of shared ancestry, and gene expression information to obtain a list of candidate genes from the genome.
THE mouse represents one of the most commonly used model organisms in science. Its fast breeding rate, ability to control the environment in which it lives, and ability to control its genetic background make it the ideal organism to study the genetics of a multitude of traits and diseases. To date, the most common approach for identifying genes underlying traits [quantitative trait genes (QTG)] using mice is linkage analysis, a method that generally consists of crossing two strains. By studying the recombinations in the following generation(s) one can identify large regions that potentially contain QTG. Those regions subsequently are analyzed in greater detail using additional mice, a process that takes years to get to the gene(s) and is rarely successful (for a recent review, see Flint et al. 2005). Narrowing QTL regions still remains a “major challenge” (Mott 2006). In 2001, Grupe et al. (2001) illustrated how an alternative approach based on comparing phenotypes from different strains could be applied to gene mapping. Although their initial approach was based on a small number of genotypes and strains, it did pave the way for a novel approach, which, simply put, tests for association between a trait and a genotype treating different strains of mice as individuals from a population. The approach is intuitive; however, its chance of success as initially applied is debatable (Darvasi 2001).
Over the past few years, reports have started to appear using in one way or another the idea of in silico analysis to identify genes (Pletcher et al. 2004; Cervino et al. 2005; Hillebrandt et al. 2005). Significant efforts have resulted in the release of large amounts of genotyped single nucleotide polymorphisms (SNPs) in inbred strains (Wiltshire et al. 2003; Cervino et al. 2005; Wade and Daly 2005). Phenotypic information has also become increasingly available through public databases (for example, the Phenome database and the Eumorphia project). As a result, many in silico projects are ongoing and many more are being planned. It is therefore timely to revisit some of the strengths and limitations of an in silico-based approach.
The goal of in silico analysis is to identify, if not a gene, a short genomic region or list of candidate genes containing a gene influencing the trait under study. To successfully achieve this goal, many factors need to be considered: the trait under study and its genetic component, the number of strains used in the in silico experiment, the number of genetic markers used in the in silico analysis, and the statistical algorithms used in the analysis. Here we investigated the effect of these factors on both experimental and simulated data sets. First, we looked at the effect of using a 20,000 SNP (20K SNP) map vs. the 140,000 SNP (140K SNP) β-map just released by Wade and Daly (2005), which we refer to as the BROAD/MIT data. Second, we investigated the effect of strain numbers on the power to map QTL using simulated traits as well as the cholesterol phenotype. We selected the cholesterol phenotype because it has been measured in the largest number of strains. Third, we implemented a novel analysis to test for association between haplotypes of varying lengths and traits and compared it to a number of different algorithms on both Mendelian and complex traits. The underlying assumption to our cladistic model—that similar haplotypes correspond to similar disease risks—is particularly well suited to the field of mouse genetics where common laboratory strains all come from a very limited number of founders. This type of analysis takes into account the genetic proximity of related strains, handles missing data, allows for repeated measures, and is fully flexible as it can be used to model different families of distributions. Finally, we illustrate the use of an integrated strategy that combines information from linkage analysis, blocks of shared ancestry (identical by descent, or IBD), in silico analysis, and gene expression. When applied to complex traits, our strategy results in successfully narrowing down the entire genome to a handful of candidate genes.
MATERIALS AND METHODS
Genotypes from 48 strains were downloaded as text files from the BROAD/MIT website and imported into our in-house database. IBD regions were then estimated by identifying regions of at least 1 Mb between any two strains without informative SNPs (Cervino et al. 2006). The images from the 140K map were added to our existing website, which previously contained IBD information on 61 strains using ∼20K SNPs. Both the 20K and the 140K set map to the mouse genome build NCBIM33. For the purpose of comparing the 20K map to the 140K map, we assumed that BALB/cJ, which was used in our existing database, and BALB/cByJ, which was part of the BROAD/MIT list of strains, were the same strains.
In silico methods:
To test for genetic association with any given trait, we used four different algorithms: one S20 or S140 (“#” refers to the number of SNPs used) algorithm tests for association between a genotype and a trait, and the other three algorithms test for association between a haplotype and a trait using a cladistic model: C20.3 or C140.3, C20.10 or C140.10, and C20.sli or C140.sli. Following are the four algorithms that we used:
S#: Single point marker analysis, which returns the P-value from a Student's t-test when the data are continuous or a chi-square if the data are binary. Since the markers are SNPs and the strains are all inbred, only two groups corresponding to the two different alleles exist at each position.
C#.3: Cladistic analysis using a haplotype of three consecutive SNPs. Starting at each position, the first three SNPs are assembled to define a haplotype, a measure of genetic distance between the haplotypes is calculated as the number of alleles differing between each haplotype, and then a tree (using the hierarchical function hclust in the R language) is built on the haplotypes that then define new groups. At each branch of the tree, a test is performed using general linear models with a Gaussian link or a binomial link, depending on whether the data are continuous or binary. The P-value is obtained from comparing the null model of no association with the alternative model of association with the groups as an independent variable, using the chi-square statistic. The algorithm is applied at each SNP position; consecutive haplotypes therefore overlap by two SNPs.
C#.10: Cladistic analysis using a haplotype of 10 consecutive SNPs; this is the same as C#.3 except that the haplotype blocks are longer as they are based on 10 SNPs. The algorithm is applied at each SNP position; consecutive haplotypes therefore overlap by 9 SNPs.
C#.sli: Cladistic analysis using a haplotype of however many SNPs are present in a 100-kb window; This is the same as C#.3 except that the haplotypes are defined by selecting all SNPs in 100-kb windows. The algorithm is applied at each 100-kb interval; consecutive haplotypes do therefore not overlap.
The distribution of P-values using the cladistic model is clearly not random. Adjacent SNPs are in linkage disequilibrium with each other and contribute to adjacent haplotypes. This physical correlation results in clear mountain-like patterns. We tried to take advantage of the patterns to estimate the most likely position of the underlying gene. Toward that effect, we applied a smoothed spline to the distribution of P-values and used the 1-LOD drop from the maximum (similar to what has been done in linkage analysis) to identify intervals containing the candidate gene. The smoothing was done in R using the smooth.spline function with a smooth parameter of 0.5.
Strains and phenotypes:
For the cholesterol trait, we used the raw HDL cholesterol data measure in females from the Phenome database available as part of the Paigen2 project (measurement number: 9904). Of the original 40 strains, 38 strains were used in the 140K analysis and 36 strains were used in the 20K analysis. When strains had similar names like BALB.cByJ (140K) and BALB.cJ (20K), we assumed that the genetic difference was small enough between the two strains to justify using the same genotypes.
For the hepatic fibrosis trait, we used the imputed hepatic hydroxyproline contents measured in seven strains (Hillebrandt et al. 2002).
For the blindness phenotype, we used 37 strains for the 20K analysis and 46 strains for the 140K analysis. The binary traits were obtained from the Jackson laboratory and were the same that had been previously analyzed (Pletcher et al. 2004).
For the albinism phenotype, we used the 12 strains reported in Wade et al. (2002). The trait for albinism is binary.
This simulated trait represents a single-SNP model. The “causative” SNP was randomly selected from the 140K map and maps to 93 Mb on chromosome 14. The trait was simulated in 41 strains, which corresponds to the number of strains that had been genotyped at that SNP. Under our model, the means for the two alleles were 0 and 1 with a standard deviation of 0.01; such a strong effect is similar to a Mendelian trait. SIM1delQTL is the same trait, but we removed the causative SNP from the 140K map when performing the analysis.
This trait was simulated using the following algorithm: two SNPs were selected at random from the 20K SNPs. For each SNP, the first encountered allele was selected as the risk allele and increased the trait (baseline level was set to 0) by 0.5. This two-SNP model represents an additive model that fully explains all of the trait variation. For example, the trait for a strain with no risk allele was set to 0, and strains with two risk alleles were set to 1. The trait values were then transferred to the strains represented in the 140K map. The trait was simulated in 35 strains for the 20K analysis and resulted in 33 strains for the 140K analysis. The two QTL mapped to chromosomes 3 and 12.
The effect of SNP maps:
Previously we had assembled public genotypes from 21,514 SNPs in 61 strains of mice (Cervino et al. 2006); we refer to these genotypes as the 20K SNP map. The 140K SNP map corresponds to the β-release data from the BROAD/MIT, which correspond to 138,793 SNPs genotyped in 49 strains (Wade and Daly 2005). The same assembly (mouse genome build NCBIM33) was used in both maps. We looked at the effect of using the 140K SNP map vs. the 20K map on estimating IBD blocks as well as on the in silico analysis.
The Scripps Florida IBD database provides estimates of SNP-poor regions corresponding to areas of shared ancestry between pairs of strains based on 61 common strains that had been genotyped in ∼20K SNPs (Cervino et al. 2006). When using 20K SNPs, only large blocks of ∼1 Mb could be reasonably well estimated. Clearly the IBD blocks can be more accurately estimated using additional genotypes; we therefore recomputed the IBD blocks using the 140K genotypes from 48 strains and updated our database with the estimated positions of the new IBD blocks. We also updated our website with the corresponding images.
Thirty-five strains were common to both data sets. On the basis of those 35 strains we compared the estimated IBD blocks and found that in general there was good agreement between the two maps. Closely related strains showed the least differences in block estimates and more distant strains showed the greatest differences. Fewer blocks were identified when using the 140K map between distant strains (http://mouseibd.florida.scripps.edu). For example, the overall amount of IBD on chromosome 10 between A and CZECH dropped from 23 to 12%. Looking at closely related strains such as 129S1 and 129X1, we noted that “SNP singletons,” i.e., individual SNPs situated in large areas of IBD blocks, were not present when using the 140K genotypes. Going back to the original SNP data, we found that most such singletons were submissions from Genomics Institute of the Novartis Research Foundation (GNF) and that those genotypes were not in agreement with genotypes from other experiments, making us conclude that those singletons were genotyping errors and that the quality of the BROAD/MIT genotypes and the Merck genotypes are of greater quality than the GNF published genotypes.
In silico analysis:
To assess the effect of using the two different maps, we compared the results from a single marker analysis as well as a novel clade-based analysis using a three-SNP haplotype (see materials and methods). We used two Mendelian traits as positive controls: the albinism phenotype, which is caused by a mutation at the tyrosinase locus (Yokoyama et al. 1990), and the blindness phenotype, which is caused by a mutation in the Pde6b gene. For albinism, the phenotype from 12 strains was used (Wade et al. 2002); for blindness, 48 strains were used (Pletcher et al. 2004).
For the albinism phenotype, we report the P-values at each SNP on the 19 autosomal chromosomes as well as the P-values based on the cladistic analysis. Twelve inbred strains were used. Figure 1A represents the −log of the P-value for each marker along all 19 chromosomes using the 20K map at the top and the 140K map at the bottom. Figure 1B represents the −log of the P-value using a three-SNP window (plotted at the start of the window) along all 19 chromosomes using the 20K map at the top and the 140K map at the bottom.
Figure 1 shows that chromosome 7 is correctly identified among the markers with the highest P-value in both maps. However, there are more significant markers when using the 140K map (which is to be expected, given the increase in the number of tests performed), which leads to a higher rate of false positives (if selecting on the basis of maximum P-value). When using a single-marker analysis (Figure 1A), 8 of 19 chromosomes are among the markers with the highest P-value; when using the 140K map, this goes up to 9 of 19. Using the cladistic analysis, the number of chromosomes with association is 9/19 for the 20K map, and 10/19 for the 140K map. At the chromosome level, we therefore see no gain from increasing the map resolution in this case. If one did not have prior linkage data to identify the correct chromosome carrying the QTG, one would have to search half of the chromosomes. This illustrates the importance of combining classical QTL analysis with in silico analysis. Given a small number of strains like 12 (which is standard for in silico analysis), an in-silico-only based analysis would pick up the right chromosome along with half of the other chromosomes that make up the mouse genome. However, combined with the knowledge from classic linkage analysis, one knows that the correct chromosome is chromosome 7 and, when looking at chromosome 7 in further detail (Figure 2), one can correctly identify the location of the underlying gene. The position of the tyrosinase gene is represented by a red vertical bar. As is apparent from the 20K map in Figure 2, markers in the neighborhood of the genes have the most significant P-values; this is also the case using the 140K map. However, there are more associated markers when using the 140K map, which is to be expected, given the overall increase in number of markers. As noted at the genomewide level, there is no benefit in this case from using the 140K map over the 20K map.
For the blindness phenotype, we did the same as for the albinism trait and calculated the P-values at each SNP on the 19 autosomal chromosomes as well as the P-values based on the cladistic analysis (Figure 3). For the 20K-map-based analysis, 37 strains were used while for the 140K map, 46 strains were used. Chromosome 5 is correctly identified as the only chromosome with the highest P-value using either map and either method. No other chromosome has an association more significant than 10−6. Compared to the albinism phenotype, this case has a larger sample size (∼40 strains instead of 12), which leads to smaller P-values, and hence one is able to distinguish the correct chromosome from the others. It therefore appears that having large numbers of strains is key to identifying a chromosome when no prior linkage data are available. This will be further investigated in The effect of strain numbers. When looking at chromosome 5 only (Figure 3), again the two maps lead to the same conclusion since they identify the genomic area in the region of the gene (indicated by a red bar).
When using the strongest P-value to select chromosomes or candidate regions, we do not see any gain from using the 140K map compared to the 20K map when performing in silico analysis on simple Mendelian traits.
The effect of strain numbers:
When planning in silico experiments, it is important to estimate the number of strains required to achieve a certain level of power. Wang et al. (2005) looked at the effect of strain numbers on power under varying genetic models and for varying numbers of haplotypes. In their example, a sample size of 10 was really powerful only when looking at very strong genetic effects such as Mendelian traits. For a trait with a genetic effect contributing in the range of 30% to the total variance, 30 strains or more are required for acceptable power. In real examples of complex traits, the genetic contribution is much lower than 30%. To look at the effect of strain numbers on mapping power, we analyzed subsets of simulated data as well as real data. We simulated two data sets (see materials and methods): the first simulated trait, SIM1QTL, corresponds to a single high-risk allele; the second trait, SIM2QTL, was generated assuming two causative SNPs situated on different chromosomes with equal risk and an additive effect. As an example of real data, we selected the cholesterol trait, since HDL cholesterol levels were the phenotype that had been measured in the largest number of strains (n = 38). For both simulated and real traits, we then randomly selected 100 subsets of 30, 20, and 10 strains and applied our cladistic analysis of size 3 followed by spline smoothing to investigate how often the correct QTL region was identified as the most likely region. We defined a 5-Mb interval around the SNP or gene as the QTL region and tested only the chromosomes where the QTL was, on the basis of the results from the in silico analysis discussed above. The results are presented in Table 1. There is a dramatic loss of power as the traits become complex and the genetic effect decreases. Interestingly, whether or not the causative SNP is included in the map (SIM1QTL vs. SIM1delQTL) does not affect power, highlighting the strength of our analytical approach and the informativeness of existing maps. The power decreases proportionally to the number of strains. Surprisingly, in our simulated example of two additive SNPs with extremely large effects, we completely fail to identify the first SNP and the mapping of the second SNP is a few megabases off. The two loci (SIM2QTL) were simulated using the same procedure and both were absent in the 140K map. The difference in the statistical power between the two loci is therefore most likely due to the number of informative SNPs at neighboring makers and to the distance to the nearest markers. A different choice of markers would have led to a difference in power. When we simulated traits with three to five underlying QTL, none or only one QTL was identified using 30 strains (data not shown). With <30 strains or with equal additive multigenic effects, there is little rationale supporting an in silico analysis of complex traits. It is therefore important when planning in silico analysis to obtain an accurate estimate of the number of QTL involved and their associated risks. After that, one should target a corresponding minimum number of strains or use a different strategy such as the one described in Gene mapping in complex traits: an integrated approach.
Since our analytical approach is based on three SNP haplotypes, we would expect it to perform better in classical inbred strains as opposed to wild-derived strains since the causative SNP is more likely to be represented on the same haplotypes. To investigate if there was a loss of power associated with including wild-derived strains in the in silico analysis, we compared the group of sets that correctly mapped the QTL to the frequency of wild-derived strains in our simulated data. If the inclusion of wild-derived strains had resulted in loss of power, then we would have expected an overrepresentation of wild strains in the random subsets that failed to map the QTL. In our simulated data, we did not see any evidence for a loss of power associated with including the wild strains in the analysis. For example, the average number of wild strains in the 93 groups of 30 strains that successfully mapped the single QTL (SIM1QTL) was 3.7 compared to 3.3 in the seven groups of 30 strains that failed to map the same QTL.
The effect of statistical algorithms:
In in silico analysis, one tests for association between a trait and a genetic marker in a population of inbred strains. Cervino et al. (2005) tested for association between individual SNPs and the trait of interest. Pletcher et al. (2004) and Hillebrandt et al. (2005) used a haplotype defined by a three-SNP window. Here, we have implemented for the first time a cladistic analysis to test for association between a trait and haplotypes of sizes 3 SNPs, 10 SNPs, and of length 100 kb. Given the recent shared history of most of the laboratory mice, applying a cladistic analysis to this type of data is a natural choice. The cladistic analysis is a haplotypic analysis, where the haplotypes get merged into groups in a stepwise fashion (see Figure 4). The cladistic approach will be especially successful when the risk-carrying mutation is fully associated with a single haplotype. Such a situation is most likely to be the case in laboratory mice since they were derived from a limited number of founders. This method has been applied to human populations (Durrant et al. 2004; Durrant and Morris 2005) with increased power when compared to more standard methods, but has never been implemented to study inbred mice. Because laboratory mice are inbred, their haplotypes are known.
To assess the performance of the cladistic analysis, we compared single-marker analysis to the cladistic analysis with windows of sizes 3, 10, and of length 100 kb on four traits: two Mendelian traits and two complex traits with known or strong candidates. We report the results for the 140K maps, since we would assume that most scientists would use the denser map; results for the 20K map are available from the authors upon request.
The ultimate goal of the statistical analysis is not to identify an association with a marker but to estimate the most likely position of the risk-carrying gene. To that end, we implemented the following approaches: For the single marker analysis, we report the position of the first and last markers that correspond to the minimum P-value at the selected chromosome. For the cladistic analysis, we applied two approaches: first we report the positions of the most strongly associated haplotypes at the selected chromosome. The second approach is based on the P-values resulting from the cladistic analysis. For each SNP position along the chromosome, a P-value is calculated from testing for association between a clade and the trait. The resulting distribution of P-values along the chromosome is a not random distribution. Given the correlation between adjacent P-values, we assumed that applying a smoother would help us in estimating the interval containing the QTG. Smoothing does allow the extrapolation to a continuous scale; i.e., we have information at every position, not just where the SNPs are, and it favors associations with multiple windows over single isolated associations, which we intuitively felt more convincing. We therefore used a 1-LOD drop in the distribution of the P-values after fitting a smoothed spline (see materials and methods). We chose to apply a spline that is a nonparametric smoother since the true distribution of the P-values is not known. The results are reported in Table 2.
The first thing that one notes is that for the two Mendelian phenotypes (blindness and albinism), all algorithms map the association within 1 Mb of the underlying gene. Different methods are closer or further away from the gene and have intervals of varying length for estimating the gene position. Application of the spline smoother results in larger confidence intervals for the blindness traits when compared to using the raw haplotype-based P-values. However, when multiple haplotypes are associated with the trait, as is the case for the albinism trait (Figure 5), the application of a smoother to the distribution of P-values results in tighter intervals.
Overall, the application of a “spline smoother” to the P-value distribution results in intervals from 2 to 14 Mb with an average of ∼5 Mb, which are relatively large intervals, although they are considerably smaller than intervals from linkage analysis. When looking at the results from the analysis of complex traits such as cholesterol level (Figure 6) and hepatic fibrosis, one notes that different methods map the most significant association to different parts of the chromosome. HDL cholesterol and hepatic fibrosis each have genes that are known to influence the traits. For cholesterol, it is Apoa2 (Hedrick et al. 2001; Wang and Paigen 2002; Suto 2006); other genes have also been mapped to the same chromosome (Wang et al. 2003; Cervino et al. 2005), which will influence the analysis and decrease the power to identify ApoA2. On the basis of recombinant inbred (RI) sets (N × SM), 74% of the HDL cholesterol variance can be explained by that locus (Mehrabian et al. 1993); in an independent F2 cross (B6 × RR), ApoA2 was found to explain 33.1% of the variance (Suto et al. 2004). For hepatic fibrosis, two QTL have been identified that explain 2–5% of the total variance (Hillebrandt et al. 2002); subsequently the second QTL was narrowed down to a QTG: Hc (Hillebrandt et al. 2005). Only the cladistic analysis with a window of size 3 followed by spline smoothing identifies the position within 1 Mb of the location of those genes; the cladistic analysis with windows of 100 kb estimates the position of Hc as just over 1 Mb away while the other methods identify other regions. Both the single point analysis and the cladistic analysis of haplotypes of size 3 alone failed to identify either regions.
From these examples, it appears that the cladistic analysis of size 3 with spline smoothing is the best-performing algorithm followed closely by the cladistic analysis of 100-kb haplotypes with spline smoothing. The intervals when applying the smoothing technique on the P-values and the use of a 1-LOD drop do result in larger intervals than when using a single marker or haplotype-based analysis when there is a single marker or a single haplotype with the strongest association. However, when there are multiple haplotypes or single markers with the same level of association, and in different parts of the chromosome, one has no way of differentiating which one is more likely to be the true one. Fitting a spline to smooth the distribution of the P-values, as we present here for the first time, does have the advantage that observations from nearby SNPs all contribute to the calculations, thereby helping to identify the true genomic region containing the underlying gene.
Gene mapping in complex traits: an integrated approach:
As we have just shown, in silico analysis by itself performed on a limited number of strains will most likely fail at identifying the correct region underlying complex traits. To help with the gene-mapping process, it is therefore necessary to bring information from multiple sources such as linkage data, IBD information, and gene expression. We now illustrate how this integrated approach was applied to two complex traits: cholesterol and hepatic fibrosis.
To illustrate the application of our approach when using solely public data, we chose cholesterol levels. To identify a gene influencing cholesterol levels, we used the linkage and phenotypic data from the Jackson laboratory. Looking at their informatics website, we selected HDL levels as a trait and selected the chromosome 1 QTL as it had the highest LOD score and had been confirmed by others (Korstanje and Paigen 2002). The nearest marker to the peak was D1Mit291, which maps to 184.5 Mb on assembly 33. We therefore used the 170- to 200-Mb region as the QTL region. The F2 cross used to identify the QTL was between SM/J and NZB/B1NJ. We investigated the IBD structure of those two strains in the 30-Mb region of chromosome 1. Three IBD blocks were identified in that region, totaling ∼6 Mb, and were therefore excluded from the candidate region. We added expression information from the GNF atlas to find the genes present in those regions that are expressed in the liver. A total of 190 genes were found in the candidate region (which excluded the IBD blocks). When applying a filter of 3× over the median for liver expression, six genes emerged as the final list of candidates: Ush2a, Fh1, Kmo, Apoa2, Hsd17b7, and Nr1i3. Given that the in silico analysis estimates the position with an uppermost limit at 172 (Table 2), the most likely candidates are therefore Apoa2, Hsd17b7, and Nri3.
By combining information from QTL experiments, IBD blocks, in silico association study, and expression analysis, we narrowed down the candidates to three (Figure 7), one of the three being a known cholesterol gene.
To illustrate the application of our integrated approach to data generated in a single laboratory, we used hepatic fibrosis as an example. Hillebrandt et al. (2002) studied the genetics of hepatic fibrosis following chronic liver injury using a mouse model where mice were treated with CCl4. Linkage analysis was performed by crossing BALB/cJ and A/J strains, and an association study was also done in seven inbred strains. The linkage analysis identified linkage regions on chromosomes 2 and 15. In their follow-up work, they identified Hc as the underlying gene on chromosome 2.
We performed in silico analysis using the latest 140K map and the algorithms described here. Chromosome 15 showed a strong association pattern at ∼68 Mb, while the association on chromosome 2 was much weaker. For the linkage region on chromosome 2, we used the first 35 Mb (up to D2Mit238) and for chromosome 15, we used the region from 30 to 70 Mb, which is around D15Mit26 and D15Mit122. Four blocks were identified on chromosome 2. Adding the information of the in silico results, which peaks ∼30 Mb, the two most likely regions are now 18.7–26.9 Mb and 30.3–35 Mb. In the 30.3- to 35-Mb interval, 172 genes are reported in the GNF atlas; filtering with a median fold increase of three for liver expression, three candidate genes exist: Ass1, Hc, and Slc25a25. Another six known genes were found in the 18.7- to 26.9-Mb interval. The final list of nine candidate genes did, as in the cholesterol example, include the correct one, illustrating the power of this integrated approach. On chromosome 15, four large blocks were identified in the 30- to 70-Mb region. Since in silico results peak at 68 Mb, the 58.5–70 Mb are the most likely region to contain a QTG. Looking at the gene expression information, only the following two candidate genes pass the threefold criteria: Mtss1 and A1bg.
With the β-release of the 140K map from Wade and Daly (2005), and the many ongoing projects to measure phenotypes in different inbred strains, in silico analysis is well on its way to becoming a standard methodology for identifying genes underlying complex traits. Under certain genetic models, such as Mendelian traits, it can be used by itself; however, we have found that used by itself in the complex traits presented here, it failed to identify the correct QTG regions. We therefore concluded that in complex traits it is best to use in silico analysis in conjunction with other sources of information such as linkage analysis (RI, F2, or BC1), expression data [expression QTL (eQTL) analysis or high expression in a tissue of interest], and knowledge about IBD regions. The gene expression information as well as the genotypic information are already well covered by public sources such as the GNF and the Jackson laboratory websites. Although we did not find that the in silico analysis was significantly improved by using this denser map, we did find that the IBD map from the 140K map was of higher quality than the previous 20K map, which was collected from multiple sources.
One of the interesting aspects of in silico analysis is the statistical methodology used to test for association. Here we introduced a novel cladistic approach to test for association between a haplotype and a trait. We varied the haplotype's length since fixed window sizes are easy to compute but do lack biological meaning. We therefore used a haplotype of 100 kb as well as fixed SNP sizes. Although intuitively we preferred that method of determining haplotypes, there did not seem to be any increase in power over a three-SNP window. One could further vary the length of the haplotype, depending on the area of the chromosome, as one might benefit from having shorter haplotypes in regions with a lot of diversity and longer haplotypes in regions that are conserved (for example, in SNP-poor regions across all tested strains). Recent reports that have investigated the length of haplotype blocks in mice could be used to help determine the optimal haplotype length (e.g., Wade et al. 2002; Frazer et al. 2004; Yalcin et al. 2004; Adams et al. 2005). The advantage of the cladistic approach is that it seems to naturally apply to inbred mice since it builds a tree based on the genetic distances and hence it recreates the phylogeny of the strains for that particular region, which is a better approach than specifying what the distance is on the basis of the whole genome. The use of a statistical framework has the added benefit of allowing more complex analyses. Missing genotypic data are handled appropriately, and methods such as repeated-measures models can easily be implemented, taking into account the intrastrain variation. In our examples, we tried to correctly identify the positions of known genes. Except in our simulated data, we focused mostly on single-gene analysis and the most significant P-value, in the case of complex traits; however, there are multiple genes that contribute to the trait. In the future it would be interesting to assess—once there is such a list of validated genes contributing to a complex trait—how the different analyses perform.
We successfully applied our work flow to the analysis of selected phenotypes. The integrated approach proposed here, however, is dependent on having sufficient power and sufficient additional genomic information. To narrow down the genome to a list of candidate genes, we used information on IBD blocks and expression data. When IBD information is not available on the strains originally used in the linkage or RI parental strains, this step can be skipped or IBD blocks estimated from closely related strains can be used. Either expression data can be generated in house at the same time as the linkage/RI/chromosome substitution strains experiments after which eQTL methods can be applied (Schadt 2005) or the data can be generated from the tissue of the inbred strains after which association with the inbred strains can be tested or overexpression in a relevant tissue can be searched for, as we have done here. For eQTL analysis, webQTL is an excellent public resource. For expression-level analysis, the GNF atlas is currently our recommended resource. As with all profiling experiments, the choice of the appropriate tissue can be challenging. Especially for complex traits such as cholesterol, many organs play an important role and one may want to look at expression levels across multiple tissues. In the future, we hope that existing databases will incorporate more and more information, allowing the integrated approach to be implemented systematically. Alternatively, to overcome the power limitations of the in silico approach, but still take advantage of the strain sequence and SNP information, the yin–yang crosses strategy has been suggested (Shifman and Darvasi 2005). Similarly, here we have found that combining the in silico approach with other sources of information may have great value. Therefore, we strongly believe that using an integrated work flow like the one described here is the way to address the limitations of in silico analysis. Our proposed approach quickly provides a short list of candidate genes at no cost and we were able to show that in most cases the final list does include a known QTG.
The authors thank Frank Lammert for providing the hepatic fibrosis measurements and for his comments. This work was supported by the Florida Funding Corporation.
Communicating editor: T. R. Magnuson
- Received August 28, 2006.
- Accepted September 28, 2006.
- Copyright © 2007 by the Genetics Society of America